12 Model Compression
GPT-4 reportedly has over a trillion parameters. LLaMA 3.1 405B needs over 800 GB just to store its weights in 16-bit precision. Running these models requires clusters of A100 GPUs that cost tens of thousands of dollars per month. Meanwhile, you have a laptop with 16 GB of RAM and a dream.
Model compression bridges this gap. It is the art of making big models small enough to be useful without making them so degraded that they are useless. And the results are remarkable: a well-quantized 70B model running in 4-bit precision on a gaming PC can outperform a full-precision 13B model on many benchmarks. Compression is not a compromise; it is an optimization.
Here is a surprising fact: for most practitioners, compression matters more than training. The vast majority of LLM users will never pre-train a model from scratch (it costs millions of dollars), but nearly everyone will need to run a model efficiently. Quantization, pruning, and distillation are the tools that make AI accessible to the 99% of developers who do not have a data center.
12.1 Quantization: Fewer Bits, More Speed
Quantization reduces the numerical precision of model weights from 16-bit or 32-bit floating point to lower-bit representations: 8-bit, 4-bit, or even lower. The insight is that most weight values are clustered near zero, and the fine distinctions between close values matter less than you might think.
12.1.1 The Basics of Number Representation
A quick refresher: neural networks typically train in FP32 (32-bit floating point) or BF16/FP16 (16-bit). Each weight occupies 4 bytes (FP32) or 2 bytes (FP16). A 7B parameter model in FP16 requires \(7 \times 10^9 \times 2 = 14\) GB just for the weights, plus additional memory for activations and KV-cache during inference.
Quantization maps these floating-point values to a smaller set of discrete levels. In 4-bit quantization (INT4), each weight is one of \(2^4 = 16\) possible values, and occupies only half a byte. The same 7B model now fits in about 3.5 GB, easily fitting on a modern laptop.
12.1.2 Post-Training Quantization (PTQ)
PTQ applies quantization after training, without any retraining. You take a finished model, reduce its precision, and deploy it.
GPTQ (Frantar et al. 2022) is the gold standard for post-training quantization. It uses approximate second-order information (the Hessian matrix) to decide how to round each weight. The clever trick: it processes weights column by column, and after quantizing each column, it adjusts the remaining unquantized columns to compensate for the rounding error. This “error compensation” is what makes GPTQ produce dramatically better results than simply rounding every weight to the nearest quantized value.
AWQ (Activation-Aware Weight Quantization) (Lin et al. 2023) takes a different approach. It observes that a small fraction of weight channels are “salient”: they disproportionately affect output quality. AWQ identifies these critical channels by examining activation magnitudes on a small calibration dataset, then protects them during quantization (either by keeping them at higher precision or by scaling them). The result often outperforms GPTQ at the same bit width.
If you have ever downloaded a model from HuggingFace with “Q4_K_M” or “Q5_K_S” in the name, you have used quantized models in the GGUF format. Created by Georgi Gerganov for his llama.cpp project, GGUF uses k-quant methods that quantize different layers to different bit widths based on their sensitivity. The naming convention tells you the quantization: Q4_K_M means 4-bit with k-quant, medium quality. Q8_0 means 8-bit, basic quantization. This ecosystem has made it trivially easy to run large models on consumer hardware. Go to HuggingFace, search for a model name plus “GGUF,” download it, and run it locally with Ollama or llama.cpp.
12.1.3 Quantization-Aware Training (QAT)
QAT simulates quantization during training. The forward pass uses quantized weights, but gradients are computed with respect to the full-precision “shadow” weights using straight-through estimators (which pretend the rounding function has a gradient of 1). This allows the model to learn weight values that are robust to quantization, typically yielding better results than PTQ at very low bit widths (2-bit, 1-bit).
12.1.4 Extreme Quantization: BitNet
How low can you go? BitNet b1.58 (Ma et al. 2025) constrains every weight to exactly three values: \(\{-1, 0, +1\}\), requiring only 1.58 bits per weight (since \(\log_2(3) \approx 1.58\)). This sounds absurd, but BitNet models match the quality of full-precision models with comparable parameter counts, and enable inference using only integer addition. No floating-point multiplication at all. This could fundamentally change AI hardware: a chip optimized for ternary operations would be dramatically simpler, cheaper, and more energy-efficient than current GPU architectures.
Running a single ChatGPT query uses roughly 10 times the energy of a Google search. As AI usage scales to billions of queries per day, energy consumption becomes a first-order concern. Quantization directly reduces energy costs: 4-bit operations use roughly 4 times less energy than 16-bit operations, and BitNet's integer-only arithmetic uses orders of magnitude less. Model compression is not just about fitting models on your laptop; it is about making AI environmentally sustainable.
12.2 Mixed Precision Training and Inference
Mixed precision uses different numerical precisions for different parts of the computation:
- FP16/BF16 for compute: Matrix multiplications use half-precision, which is \(2\times\) faster on modern GPUs and uses half the memory.
- FP32 for master weights: A full-precision copy of weights is maintained for optimizer updates, preventing the accumulation of tiny rounding errors over millions of steps.
- Loss scaling: To prevent gradient underflow in FP16 (where small gradients round to zero), the loss is multiplied by a large constant before the backward pass, then gradients are scaled back down before the optimizer step.
In PyTorch, mixed precision is as simple as wrapping your training loop with torch.cuda.amp.autocast() and using a GradScaler. Hugging Face's Trainer handles it automatically with a single flag.
BFloat16 (BF16) has the same exponent range as FP32 but less mantissa precision than FP16. This means BF16 can represent the same range of numbers as FP32 (so you rarely get overflows or underflows) but with less precision. In practice, BF16 training is more stable than FP16 training and often does not require loss scaling. If your hardware supports BF16 (A100, H100, Apple M-series), prefer it over FP16.
12.3 Pruning: Removing What Does Not Matter
Pruning removes redundant parameters from the model entirely:
Unstructured pruning sets individual weights to zero based on their magnitude. The idea is simple: weights close to zero contribute little to the output, so zeroing them out has minimal impact. You can achieve 90%+ sparsity (90% of weights are zero) with surprisingly small accuracy drops. The catch: modern GPUs are not designed for sparse computation, so unstructured sparsity does not actually speed up inference without specialized hardware or sparse kernels (like NVIDIA's Sparse Tensor Cores, which support 2:4 structured sparsity).
Structured pruning removes entire attention heads, FFN neurons, or even whole layers. This directly reduces the number of operations and memory usage without requiring sparse hardware. Recent work like LLM-Pruner and Sheared LLaMA applies structured pruning to large language models, achieving 20-30% size reduction with minimal quality loss.
Removing layers (depth pruning) is generally more harmful than removing neurons (width pruning). Deep networks build up representations hierarchically: early layers extract basic features, middle layers compose them, and later layers specialize. Removing a layer disrupts this chain. But removing 10% of neurons across all layers barely affects the model because of the enormous redundancy in each layer's representations.
12.4 Speculative Decoding: Compression Meets Speed
Speculative decoding is an inference optimization that uses compression principles in a clever way. The idea: run a small, fast draft model to generate candidate tokens, then use the large target model to verify them in parallel.
The draft model generates \(k\) tokens autoregressively (cheap, because it is small). The target model then scores all \(k\) tokens in a single forward pass (this is the key: verifying \(k\) tokens in parallel costs roughly the same as generating one token). If the target model agrees with the draft model's predictions (which it often does for predictable tokens like “the,” “is,” “a”), you get \(k\) tokens for the cost of roughly one target-model forward pass.
The speedup depends on how well the draft model's distribution matches the target model's. For many tasks, a 4\(\times\) to 8\(\times\) speedup is achievable with no loss in output quality---the final output is statistically identical to what the target model would have produced alone.
Speculative decoding is one of the rare free lunches in AI: you get faster inference with mathematically guaranteed identical output quality. The verification step uses a rejection sampling scheme that ensures the combined system's output distribution is exactly the target model's distribution. If the draft model proposes a bad token, it is simply rejected and the target model fills in. You never sacrifice quality for speed.
Choosing the draft model: The draft model should be much smaller than the target (e.g., a 0.5B draft for a 7B target) and ideally from the same model family (sharing vocabulary and tokenizer). Some systems use quantized versions of the target model as the draft, which share the same knowledge but run much faster.
12.5 Sparsity-Aware Hardware
Compression techniques are only as good as the hardware that supports them. Modern AI chips are increasingly designed with sparsity and low-precision arithmetic in mind:
NVIDIA's Sparse Tensor Cores support 2:4 structured sparsity: out of every four consecutive weights, exactly two must be zero. This constraint is compatible with a hardware-efficient format that achieves nearly 2\(\times\) speedup on A100 and H100 GPUs with minimal accuracy loss.
Apple's Neural Engine in M-series chips is optimized for 16-bit and 8-bit inference, making it well-suited for running quantized models locally. This is why tools like Ollama and llama.cpp work so well on Mac hardware.
Custom AI accelerators from companies like Cerebras, Groq, and d-Matrix are designing hardware specifically for sparse, low-precision computation. Groq's LPU (Language Processing Unit) achieves extremely fast inference by eschewing the batch-processing paradigm of GPUs entirely.
A fascinating dynamic is emerging: hardware companies design chips optimized for common compression formats (2:4 sparsity, INT4/INT8), which encourages researchers to develop compression methods that match those formats, which drives hardware companies to optimize further. This co-evolution is rapidly closing the gap between the theoretical speedups of compression and the actual speedups achievable in practice.
12.6 Knowledge Distillation: Teaching a Smaller Model
Knowledge distillation (Hinton et al. 2015) trains a smaller “student” model to mimic the outputs of a larger “teacher” model. Instead of training the student on hard labels (the ground truth), you train it on the teacher's soft probability distributions (logits), which contain richer information about the relationships between categories.
For a much deeper treatment of distillation, including its application to LLMs, self-distillation, and dataset distillation, see Chapter 15.
12.7 Combining Techniques: The Compression Pipeline
In practice, compression techniques are combined. A typical deployment pipeline might look like:
- Start with a pre-trained 70B model.
- Apply structured pruning to remove 20% of attention heads and FFN neurons (now effectively $$56B parameters).
- Quantize to 4-bit using GPTQ or AWQ (from $$112 GB to $$28 GB).
- Fine-tune with QLoRA on a small instruction dataset to recover any quality lost during compression.
- Deploy with vLLM or llama.cpp for efficient serving.
The result: a model that fits on a single GPU, runs at interactive speeds, and retains 95%+ of the original model's quality.
12.8 Exercises
- Download a 7B model (e.g., Mistral 7B) in both FP16 and 4-bit GGUF formats. Run both on the same prompts and compare output quality subjectively. Can you tell the difference?
- Quantize a model to 4-bit using both GPTQ (via
auto-gptq) and AWQ (viaautoawq). Measure perplexity on a held-out dataset and compare both with the FP16 baseline. Which method preserves quality better for your model? - Apply QLoRA fine-tuning to a 4-bit quantized model on a small instruction dataset (e.g., Alpaca 52K). Compare the fine-tuned quantized model with the original FP16 model on the same evaluation prompts. Can fine-tuning close the quality gap?
- Experiment with different GGUF quantization levels (Q2_K, Q4_K_M, Q5_K_M, Q8_0) and benchmark inference speed vs. quality on your hardware. Find the sweet spot for your use case.
- Read the BitNet paper and explain why ternary weights (\(\{-1, 0, +1\}\)) can replace floating-point multiplication with integer addition. What are the implications for future hardware design?