[Avg. reading time: 5 minutes]

Model Compression

Model compression is the set of techniques used to reduce the size and computational cost of a trained model while keeping its accuracy almost the same.

Why It Exists

  • Speed up inference
  • Reduce memory footprint
  • Fit models on cheaper hardware
  • Reduce serving cost
  • Enable on-device ML (phones, edge devices, IoT)
  • Allow high-traffic systems to scale

Without Compression

  • Slow predictions
  • GPU or CPU bottlenecks
  • More servers needed to keep up
  • Higher inference bill
  • Some environments can’t run your model at all
  • Increased latency kills user experience

Photo Analogy

The popular mechanisms include

  • Quantization

Quantization refers to the process of reducing the precision or bit-width of the numerical values used to represent model parameters usually from 𝑛 bits to π‘š bits, where 𝑛 > π‘š.

In ML, FP32 (Floating Point 32 bits) is the default, by quantization method we convert the 32bits to 16 or 8 bits and achieve similar results.

https://colab.research.google.com/drive/1SHGqVZhk8tKpuGQ3KqLhUXIk8NU9W2Er?usp=sharing

When using this with mlflow, log both the models, artifacts and serve the ones depending upon the usecase.

  • Distillation

Model distillation, also known as knowledge distillation, is a technique where a smaller model, often referred to as a student model, is trained to mimic the behavior of a larger, more complex model, known as a teacher model. The goal is to transfer the knowledge and performance of the larger model to the smaller one.

Reading whole book vs Nudging with hints and references

#compression #quantization #distillationVer 0.3.6

Last change: 2025-12-02