Converting a model from using FP (like 32-bit floats) to using lower-precision formats (like 8-bit integers).

How does quantization work?

  1. Post-Training Static quantization: Entire model including weights and activations are converted to lower precision. Model is calibrated using a small calibration dataset to minimize the impact on accuracy. May not always have best accuracy.
  2. Dynamic Quantization: Weights quantized statically, but activations are quantized dynamocally at runtime. Method is often used for models wehre activation ranges can vary significantly depending on the input data.
  3. Quantization-Aware Training (QAT)

source: https://www.youtube.com/watch?v=0VdNflU08yA

Problem

  • Smallest Llama 2 has 7 billion parameters. If every parameter is 32 bit, then we need just to store parameters on disk
  • For inference, we need to load all its parameters in memory

How are integers represented in the CPU (or GPU)?

Python can represent arbitraty big numbers by using so BigNum arithmetic: each number is stored as an array of digits in base

How are floating point numbers represented?

IEEE-754

Applying quantization

  1. Quantization first happens
  2. Then the integ arithmetic happens
  3. and then it’s deuqnatized and sent to the next layer

Types of qunatization

Assymetric vs Symmetric quantization

Low-precision matrix multiplication

Multiply-Accumulate (MAC)

How to choose range

Min-max:

  • Sensitive to outliers

Percentile