22 In-Depth Technical Overview
This chapter is for the curious reader who wants to understand how things actually work beneath the abstractions. We revisit the transformer, scaling laws, mixture-of-experts, distributed training, and inference optimization with full mathematical detail. If the earlier chapters told you what to build, this chapter tells you why it works.
This chapter assumes comfort with linear algebra (matrix multiplication, eigenvalues), calculus (gradients, chain rule), and basic probability (softmax, KL divergence). If you need a refresher, Grant Sanderson's “3Blue1Brown” YouTube channel covers linear algebra and calculus with extraordinary clarity. For probability, review Chapter 3 of Bishop's “Pattern Recognition and Machine Learning.”
22.1 The Transformer Architecture in Full Detail
The transformer (Vaswani et al. 2017) is a stack of identical layers, each containing two sublayers: multi-head self-attention and a position-wise feed-forward network. Both sublayers use residual connections and layer normalization. Let us dissect each component.
22.1.1 Scaled Dot-Product Attention
The core operation. Given matrices of queries \(Q \in \mathbb{R}^{n \times d_k}\), keys \(K \in \mathbb{R}^{n \times d_k}\), and values \(V \in \mathbb{R}^{n \times d_v}\): \[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]
Why the \(\sqrt{d_k}\) scaling? Without it, when \(d_k\) is large, the dot products \(q_i \cdot k_j\) tend to have high variance (growing proportionally to \(d_k\)), pushing the softmax into extremely peaked distributions where gradients vanish. The scaling factor keeps the dot products in a regime where the softmax has useful gradients.
Picture a standard dictionary: you look up a word (the key) and get back its definition (the value). Attention works the same way, except it is differentiable. Each query asks “what am I looking for?”, each key advertises “what do I contain?”, and each value holds “what information should I pass along?” The softmax over \(QK^\top\) computes a soft match between every query and all keys, returning a smooth, weighted mixture of all the values rather than a single exact match.
22.1.2 Multi-Head Attention
Running a single attention function would force the model to compress all task-relevant information into one set of attention weights. Multi-head attention solves this by running \(h\) parallel attention functions, each with its own learned projections: \[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O\] where \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\) and each projection matrix has dimensions \(W_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}\), with \(d_k = d_v = d_{\text{model}} / h\).
Different heads learn to attend to different types of relationships: some track syntax (subject-verb agreement), others track semantics (coreference), others track positional patterns. This specialization emerges naturally from training.
22.1.3 Positional Encodings: RoPE
Self-attention is permutation-equivariant: it treats the input as a set, ignoring order. Position information must be injected explicitly.
Modern models use Rotary Position Embeddings (RoPE) (Su et al. 2021), which encode relative position through rotation matrices applied to query and key vectors before the dot product. For position \(m\) and dimension pair \((2i, 2i+1)\): \[\text{RoPE}(x_m, m) = \begin{pmatrix} x_{m,2i} \cos(m\theta_i) - x_{m,2i+1} \sin(m\theta_i) \ x_{m,2i} \sin(m\theta_i) + x_{m,2i+1} \cos(m\theta_i) \end{pmatrix}\] where \(\theta_i = 10000^{-2i/d}\). The key property: the dot product between two RoPE-encoded vectors depends only on their relative position, not their absolute position. This enables better length generalization than fixed positional encodings.
22.1.4 Feed-Forward Networks
Each transformer layer includes a position-wise FFN applied independently to each token's representation: \[\text{FFN}(x) = W_2 \cdot \text{act}(W_1 x + b_1) + b_2\]
Modern architectures use SwiGLU or GeGLU activations instead of ReLU: \[\text{SwiGLU}(x) = (\text{Swish}(xW_1)) \odot (xW_3)\] where \(\odot\) is elementwise multiplication and Swish\((x) = x \cdot \sigma(x)\).
An influential hypothesis: the FFN layers function as key-value memories. The first matrix \(W_1\) computes “keys” (pattern matchers), the activation function selects which keys are triggered, and the second matrix \(W_2\) retrieves the corresponding “values” (output patterns). This explains why FFN layers store factual knowledge: “Paris is the capital of France” is stored as a key (a pattern matching “capital of France”) associated with a value (a representation of “Paris”). Knowledge editing techniques like ROME exploit this interpretation to modify individual facts.
22.2 Scaling Laws
One of the most important empirical discoveries in modern AI: model performance follows predictable power-law relationships with model size, data, and compute.
Kaplan et al. (Kaplan et al. 2020) (2020) first established that loss \(L\) decreases as a power law with model parameters \(N\), dataset size \(D\), and compute budget \(C\): \[L(N) \approx \left(\frac{N_0}{N}\right)^{\alpha_N} + L_\infty\]
Hoffmann et al. (Hoffmann et al. 2022) refined these laws with the Chinchilla scaling law, showing that model size and training data should be scaled equally. A compute-optimal model trained on the right amount of data for its size outperforms a larger model trained on less data. Concretely: for a 10B parameter model, you should use approximately 200B training tokens. This finding retroactively showed that GPT-3 (175B parameters trained on 300B tokens) was significantly undertrained.
Scaling laws let you predict the performance of a model before training it. This is enormously valuable when training runs cost millions of dollars: you can estimate whether your planned model will achieve the target performance, or whether you need more data, more parameters, or both. Labs like Anthropic and DeepMind use scaling laws extensively for planning their training runs.
22.3 Mixture of Experts (MoE)
MoE architectures (Shazeer et al. 2017) replace the dense FFN with a set of \(E\) expert sub-networks and a learned gating function that routes each token to the top-\(k\) experts: \[\text{MoE}(x) = \sum_{i \in \text{TopK}(G(x))} G(x)_i \cdot \text{Expert}_i(x)\] where \(G(x) = \text{softmax}(W_g \cdot x)\) is the gating function.
This decouples total parameters from active parameters: Mixtral (Jiang et al. 2024) has 47B total parameters but activates only 13B per token (top-2 of 8 experts). DeepSeek-V3 (DeepSeek-AI 2024) pushes this further with 256 fine-grained experts.
Load balancing is the key engineering challenge: if the router sends most tokens to the same expert, training becomes inefficient and the other experts are wasted. An auxiliary loss term penalizes imbalanced routing.
22.4 Training at Scale: Parallelism Strategies
Training a 70B+ parameter model on a single GPU is impossible (the model alone requires 140+ GB in FP16). You must split the work across hundreds of GPUs:
Data Parallelism (DP): The simplest approach. Replicate the entire model on each GPU; split the data batch. Each replica computes gradients independently, then gradients are averaged via all-reduce. Memory inefficient: every GPU stores the full model.
Fully Sharded Data Parallel (FSDP/ZeRO): Shard the model parameters, gradients, and optimizer states across GPUs. Each GPU stores only \(1/N\)th of the total state. Parameters are gathered on-demand for computation and released immediately after. This is what makes training 70B+ models feasible on clusters of $$100 GPUs.
Tensor Parallelism (TP): Split individual weight matrices across GPUs. For example, slice an attention projection matrix row-wise or column-wise so each GPU computes a shard of the matrix multiplication. Requires fast interconnect (NVLink) between GPUs on the same node.
Pipeline Parallelism (PP): Assign different layers to different GPUs. Data flows through the pipeline in micro-batches to keep all stages busy (reducing the “pipeline bubble” where some GPUs idle).
In practice, large training runs combine all four: FSDP within a node, TP across GPUs on the same node (connected by NVLink), PP across nodes, with DP across the full cluster.
22.5 Inference Optimizations
22.5.1 FlashAttention
Standard attention requires materializing the full \(n \times n\) attention matrix, using \(O(n^2)\) memory. FlashAttention (Dao et al. 2022) fuses the attention computation into a single GPU kernel using a tiling strategy that keeps data in fast GPU SRAM (on-chip memory) rather than slower GPU HBM (high-bandwidth memory). This reduces memory to \(O(n)\) and provides 2 to 4\(\times\) speedup in practice. FlashAttention-2 further optimizes parallelism across attention heads.
22.5.2 KV-Cache and PagedAttention
During autoregressive generation, each new token attends to all previous tokens. Without caching, this means recomputing all key and value projections for every token generated. The KV-cache stores these projections, so each generation step only computes the new token's key, value, and attention.
The challenge: KV-cache memory grows linearly with sequence length and batch size, and in a serving system, different requests have different lengths. PagedAttention (Kwon et al. 2023) (used in vLLM) manages KV-cache like virtual memory: it allocates blocks of cache on demand, avoiding the need to pre-allocate memory for the maximum possible sequence length.
22.5.3 Grouped-Query Attention (GQA)
Multi-head attention uses separate key and value projections for each head. GQA shares K/V heads across groups of query heads. If you have 32 query heads but only 8 KV heads (each KV head shared by 4 query heads), you reduce KV-cache size by \(4\times\). Multi-Query Attention (MQA) is the extreme: all query heads share a single KV head.
Putting it all together, a modern inference system uses: FlashAttention for fast attention computation, KV-cache with PagedAttention for memory-efficient generation, GQA for small KV-cache footprint, continuous batching to maximize GPU utilization (multiple requests share the GPU, with new requests added as old ones finish), and speculative decoding (a small “draft” model generates candidate tokens that the larger model verifies in parallel, reducing the number of forward passes).
22.6 Tokenization In Depth
Byte Pair Encoding (BPE) (Sennrich et al. 2016) builds a vocabulary by iteratively merging the most frequent pair of adjacent tokens in the training corpus. Starting from individual characters (or bytes), the algorithm discovers common subwords (“th” + “e” \(\to\) “the”, “un” + “der” \(\to\) “under”).
Key design decisions:
- Vocabulary size: Larger vocabularies compress text better (fewer tokens per sentence) but require larger embedding matrices. Common sizes: 32K (LLaMA), 50K (GPT-2), 100K+ (GPT-4).
- Byte-level vs. character-level: Byte-level BPE (GPT-2 and successors) operates on raw UTF-8 bytes, ensuring any text can be tokenized without unknown tokens. Character-level BPE may need special handling for rare Unicode characters.
- Multilingual balance: A tokenizer trained mostly on English text will be highly inefficient for other languages (a single Chinese character might require multiple tokens). Multilingual tokenizers must balance coverage across languages.
- Implementations: tiktoken (OpenAI, used by GPT-4), SentencePiece (Google, used by LLaMA), and the Hugging Face Tokenizers library are the most widely used.
Andrej Karpathy's “Let's build the GPT Tokenizer” video on YouTube walks through implementing BPE from scratch in Python. It is the single best resource for understanding how tokenization actually works, including all the edge cases and design decisions that textbooks gloss over. Watching it will demystify one of the most under-appreciated components of modern LLMs.
22.7 Exercises
- Implement scaled dot-product attention from scratch in PyTorch. Verify your implementation matches
torch.nn.functional.scaled_dot_product_attentionon random inputs with a tolerance of \(10^{-5}\). - Implement a simple MoE layer with 4 experts and top-2 routing. Train a small model with and without MoE on the same data. Compare training curves: does MoE achieve the same loss faster?
- Implement BPE tokenization from scratch following Karpathy's approach. Train it on a small corpus and compare the resulting vocabulary with tiktoken's vocabulary on the same text. Where do they agree and disagree?
- Run a scaling experiment: train character-level GPT models at four sizes (1M, 5M, 25M, 125M parameters) on proportionally scaled data. Plot loss vs. parameter count on a log-log scale. Do you observe a power law?
- Estimate the KV-cache memory for a LLaMA 3 8B model (32 layers, 32 KV heads with GQA group size 4, head dimension 128) serving a batch of 64 requests with average sequence length 2048. Compare with Multi-Head Attention (no GQA). How much memory does GQA save?