[Avg. reading time: 5 minutes]

Model Compression

Model compression is the set of techniques used to reduce the size and computational cost of a trained model while keeping its accuracy almost the same.

Why It Exists

Speed up inference
Reduce memory footprint
Fit models on cheaper hardware
Reduce serving cost
Enable on-device ML (phones, edge devices, IoT)
Allow high-traffic systems to scale

Without Compression

Slow predictions
GPU or CPU bottlenecks
More servers needed to keep up
Higher inference bill
Some environments can’t run your model at all
Increased latency kills user experience

Photo Analogy

The popular mechanisms include

Quantization

Quantization refers to the process of reducing the precision or bit-width of the numerical values used to represent model parameters usually from 𝑛 bits to 𝑚 bits, where 𝑛 > 𝑚.

In ML, FP32 (Floating Point 32 bits) is the default, by quantization method we convert the 32bits to 16 or 8 bits and achieve similar results.

https://colab.research.google.com/drive/1SHGqVZhk8tKpuGQ3KqLhUXIk8NU9W2Er?usp=sharing

When using this with mlflow, log both the models, artifacts and serve the ones depending upon the usecase.

Distillation

Model distillation, also known as knowledge distillation, is a technique where a smaller model, often referred to as a student model, is trained to mimic the behavior of a larger, more complex model, known as a teacher model. The goal is to transfer the knowledge and performance of the larger model to the smaller one.

Reading whole book vs Nudging with hints and references

#compression #quantization #distillationVer 0.3.6

MLOps and AI

Model Compression