Quantization

Converting a model from using FP (like 32-bit floats) to using lower-precision formats (like 8-bit integers).

How does quantization work?

Post-Training Static quantization: Entire model including weights and activations are converted to lower precision. Model is calibrated using a small calibration dataset to minimize the impact on accuracy. May not always have best accuracy.
Dynamic Quantization: Weights quantized statically, but activations are quantized dynamocally at runtime. Method is often used for models wehre activation ranges can vary significantly depending on the input data.
Quantization-Aware Training (QAT)

Smallest Llama 2 has 7 billion parameters. If every parameter is 32 bit, then we need $\frac{7 * 1 0 ^{9} * 32}{8 * 1 0 ^{9}} = 28 GB$ just to store parameters on disk
For inference, we need to load all its parameters in memory