[Avg. reading time: 5 minutes]
Model Compression
Model compression is the set of techniques used to reduce the size and computational cost of a trained model while keeping its accuracy almost the same.
Why It Exists
- Speed up inference
- Reduce memory footprint
- Fit models on cheaper hardware
- Reduce serving cost
- Enable on-device ML (phones, edge devices, IoT)
- Allow high-traffic systems to scale
Without Compression
- Slow predictions
- GPU or CPU bottlenecks
- More servers needed to keep up
- Higher inference bill
- Some environments canβt run your model at all
- Increased latency kills user experience
Photo Analogy
The popular mechanisms include
- Quantization
Quantization refers to the process of reducing the precision or bit-width of the numerical values used to represent model parameters usually from π bits to π bits, where π > π.
In ML, FP32 (Floating Point 32 bits) is the default, by quantization method we convert the 32bits to 16 or 8 bits and achieve similar results.
https://colab.research.google.com/drive/1SHGqVZhk8tKpuGQ3KqLhUXIk8NU9W2Er?usp=sharing
When using this with mlflow, log both the models, artifacts and serve the ones depending upon the usecase.
- Distillation
Model distillation, also known as knowledge distillation, is a technique where a smaller model, often referred to as a student model, is trained to mimic the behavior of a larger, more complex model, known as a teacher model. The goal is to transfer the knowledge and performance of the larger model to the smaller one.
Reading whole book vs Nudging with hints and references