6 Training Foundational Models
Every large language model begins as a blank slate: billions of randomly initialized numbers. The journey from that random noise to a coherent, knowledgeable, instruction-following assistant is one of the most fascinating engineering achievements of our time. In this chapter, we walk through that entire journey: gathering data, tokenizing it, pre-training a transformer, and then the critical post-training pipeline that transforms a raw text predictor into a useful AI system.
If you want to build deep intuition for what happens inside a language model, there is no better starting point than Andrej Karpathy's “Let's build GPT from scratch” YouTube tutorial and his nanoGPT repository on GitHub. Karpathy walks through building a GPT-style language model from first principles in pure PyTorch, complete with tokenization, attention, training loops, and text generation. This chapter complements that hands-on approach with the broader context of how production-scale models are trained.
Since the general reader might not have access to massive GPU clusters, we focus on reproducing smaller models. However, toward the end we explain how the same principles scale to models with billions of parameters.
6.1 Data Gathering and Dataset Generation
Training a foundational model requires an enormous amount of text. GPT-3 (Brown et al. 2020) was trained on roughly 300 billion tokens; LLaMA 3 (Grattafiori et al. 2024) on over 15 trillion tokens. The quality and diversity of this data fundamentally shapes everything the model will know and how it will behave.
6.1.1 Where the Data Comes From
Production-scale training datasets are assembled from multiple sources:
- Web crawls: Common Crawl provides petabytes of web text. This raw data must be aggressively filtered and deduplicated to remove spam, boilerplate, and low-quality content.
- Curated datasets: The Pile (Gao et al. 2020) by EleutherAI is a well-known open dataset combining 22 diverse sources, including books, Wikipedia, GitHub code, StackExchange, PubMed abstracts, and more. You can explore it at https://pile.eleuther.ai/.
- Synthetic data: Increasingly, high-quality data is generated by existing models. The Phi series by Microsoft demonstrated that carefully curated synthetic data can train surprisingly capable small models.
- Specialized corpora: Code repositories (The Stack), scientific papers (S2ORC), mathematical reasoning data, and multilingual text.
HuggingFace released FineWeb, a massive filtered web dataset, and FineWeb-Edu, a subset scored by educational quality. If you are training a model from scratch for learning purposes, FineWeb-Edu is an excellent choice: it is freely available, well-documented, and produces models with strong reasoning abilities even at small scales.
6.1.2 Data Quality Matters More Than Quantity
A recurring finding in the field is that data quality trumps data quantity. The Chinchilla scaling laws (Hoffmann et al. 2022) showed that models should be trained on roughly 20 tokens per parameter. But beyond quantity, aggressive deduplication, quality filtering (removing low-quality web pages), and domain balancing (ensuring appropriate representation of code, science, conversation, etc.) have outsized effects on model quality.
6.2 Tokenization
Before text can enter a neural network, it must be converted to numbers. This process, called tokenization, is deceptively important: the choice of tokenizer affects model efficiency, multilingual performance, and even reasoning ability.
6.2.1 Tokenization Strategies
There are several levels at which text can be tokenized:
- Character-level: Each character is a token. Simple but inefficient: the model must learn to spell every word from individual letters, and sequences become very long.
- Word-level: Each word is a token. Efficient for common words, but the vocabulary explodes with rare words, morphological variants, and multilingual text.
- Subword-level (BPE): The dominant approach. Byte Pair Encoding (Sennrich et al. 2016) starts with individual bytes (or characters) and iteratively merges the most frequent pair into a new token. This naturally handles common words as single tokens while breaking rare words into meaningful subword pieces. “unhappiness” might become [“un”, “happiness”] or [“un”, “happ”, “iness”].
OpenAI's tiktoken library lets you experiment with tokenization interactively. Install it with pip install tiktoken, then tokenize any text to see how GPT-4 breaks it into tokens. You will quickly notice that common English words are single tokens, while rare or technical terms are split into pieces. Karpathy also has an excellent video tutorial called “Let's build the GPT Tokenizer” that walks through building a BPE tokenizer from scratch.
6.2.2 Vocabulary Size Tradeoffs
Larger vocabularies compress text more efficiently (each word maps to fewer tokens) but require larger embedding matrices and may underfit rare tokens. GPT-2 uses a vocabulary of 50,257 tokens; LLaMA 3 uses 128,256. The sweet spot depends on the training data and target languages.
6.3 Model Architecture
The overwhelming majority of modern LLMs use the decoder-only transformer architecture, first introduced in GPT-1 (Radford et al. 2018) by OpenAI in 2018. The model consists of stacked transformer blocks, each containing multi-head self-attention and a feed-forward network (for a detailed mathematical treatment, see Chapter 18).
6.3.1 Choosing an Architecture
For a from-scratch training exercise, we recommend starting with a GPT-2-style architecture:
- Decoder-only transformer with causal (autoregressive) attention.
- Rotary Position Embeddings (RoPE) (Su et al. 2021) instead of learned absolute positions, for better length generalization.
- RMSNorm instead of LayerNorm, for training stability.
- SwiGLU activation in the feed-forward layers.
- Grouped-Query Attention (GQA) for memory efficiency at inference.
For a hands-on reference implementation, EleutherAI's GPT-J architecture is available on HuggingFace at https://huggingface.co/docs/transformers/en/model_doc/gptj. Karpathy's nanoGPT is an even simpler starting point: a clean, readable 300-line implementation of GPT-2 in PyTorch.
You do not need a $$$100 million compute budget to learn model training. A 125M parameter GPT-2 reproduction can be trained on a single GPU in a few days. A character-level model on Shakespeare (as in Karpathy's tutorial) can be trained in minutes on a laptop. The principles of attention, loss, and gradient descent are the same at any scale. Start small, understand deeply, then scale up.
6.4 Pre-Training
Pre-training is the process of training the model on a massive text corpus using the next-token prediction objective. Given a sequence of tokens \([t_1, t_2, ..., t_{n-1}]\), the model learns to predict \(t_n\). The loss function is simply the cross-entropy between the model's predicted probability distribution and the actual next token.
This seemingly simple objective, applied at enormous scale, produces models with remarkable emergent capabilities: factual knowledge, reasoning, code generation, multilingual translation, and more. The model learns all of this from the statistical patterns in text.
6.4.1 Training Infrastructure
Pre-training large models requires distributed computing across many GPUs. The key parallelism strategies (covered in detail in Chapter 18) are:
- Data parallelism: Replicate the model on each GPU, split the batch.
- Tensor parallelism: Split individual weight matrices across GPUs.
- Pipeline parallelism: Assign different layers to different GPUs.
- FSDP / ZeRO: Shard parameters, gradients, and optimizer states across GPUs.
For single-GPU training of small models, frameworks like PyTorch with mixed-precision training (torch.cuda.amp) are sufficient. For multi-GPU, tools like DeepSpeed, Megatron-LM, torchtitan, and nanotron handle the distributed complexity.
6.5 Post-Training: From Text Predictor to Assistant
A pre-trained model is a powerful text predictor, but it is not an assistant. It does not know how to follow instructions, refuse harmful requests, or maintain a helpful conversation. The post-training pipeline transforms the raw model into a useful, safe, well-behaved AI. This is where the magic happens.
Post-training typically has two major stages: (1) Supervised Fine-Tuning (SFT), which teaches the model the format and style of helpful responses, and (2) Reinforcement Learning from feedback, which aligns the model's behavior with human preferences. Modern models like LLaMA 3 (Grattafiori et al. 2024) undergo multiple rounds of both stages with progressively refined data.
6.5.1 Supervised Fine-Tuning (SFT)
SFT trains the model on (instruction, response) pairs. The training data consists of questions, tasks, or conversation turns paired with high-quality human-written or human-verified responses. After SFT, the model learns the conversational format, follows instructions, and produces structured responses.
Instruction Fine-Tuning (IFT) is a subset of SFT specifically focused on teaching the model to follow diverse instructions: summarize this text, translate to French, write code for X, explain Y to a five-year-old. The key insight is that a relatively small amount of high-quality instruction data (tens of thousands of examples) can dramatically change the model's behavior.
6.5.2 RLHF: Reinforcement Learning from Human Feedback
After SFT gives the model the right format, RLHF (Ziegler et al. 2020) teaches it to produce responses that humans actually prefer. The process works as follows:
- Collect preferences: Present human raters with two model responses to the same prompt and ask which is better. This creates a dataset of pairwise preferences.
- Train a reward model: Train a separate neural network to predict which of two responses a human would prefer. This reward model assigns a scalar score to any (prompt, response) pair.
- Optimize with RL: Use Proximal Policy Optimization (PPO) (Schulman et al. 2017) to fine-tune the language model to maximize the reward model's score, while staying close to the SFT model (via a KL divergence penalty to prevent “reward hacking”).
SFT teaches the model what good responses look like, but it treats all training examples as equally good. RLHF adds a sense of degree: some responses are better than others, and the model should learn to produce the best ones. The reward model captures subtle human preferences about helpfulness, clarity, safety, and style that are hard to specify in a fixed dataset. This is why RLHF-trained models feel noticeably more polished than SFT-only models.
6.5.3 DPO: Cutting Out the Middleman
Direct Preference Optimization (DPO) (Rafailov et al. 2023) showed that you can skip the reward model entirely. DPO derives a closed-form mapping from preference data directly to the optimal policy, turning the RL problem into a supervised learning problem. Given a preferred response \(y_w\) and a dispreferred response \(y_l\), DPO maximizes:
\[\mathcal{L}_{\text{DPO}} = -\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\]
DPO is simpler to implement, more stable to train, and achieves results competitive with PPO-based RLHF. It has become the go-to approach for many open-source model trainers.
6.6 RL for Reasoning: The o1 Paradigm
A pivotal shift occurred when researchers discovered that LLMs could develop reasoning capabilities through pure RL, without any supervised chain-of-thought data.
OpenAI's o1 model (OpenAI 2024) demonstrated that a model trained with RL can learn to “think” internally: producing long chains of reasoning tokens before answering. The key insight is that spending more compute at inference time (more thinking tokens) produces better answers. This is called inference-time scaling or test-time compute.
What makes o1's approach remarkable is that the reasoning behavior was not taught by example. No human wrote chain-of-thought demonstrations. Instead, RL incentivized the model to discover that “thinking step by step” leads to higher rewards on reasoning tasks. The model invented its own internal reasoning strategies through trial and error. This is a genuine case of emergent behavior driven by the right training signal.
6.6.1 RLAIF: Scaling Feedback with AI
Instead of relying on expensive human raters, RLAIF (Reinforcement Learning from AI Feedback) uses another LLM to judge response quality. The judge model evaluates outputs and provides the preference signal that drives the RL loop. This approach scales much more easily and is believed to have been used by OpenAI for the o1 and o3 models (OpenAI 2024, 2025).
6.6.2 GRPO: DeepSeek's Approach
DeepSeek-R1 (Guo et al. 2025) introduced Group Relative Policy Optimization (GRPO), an elegant alternative that eliminates the need for both a reward model and a critic network. For each prompt, GRPO generates a group of responses, scores them (using a simple verifier, like checking if a math answer is correct), and uses the group statistics as a baseline. Responses that score above the group average are reinforced; those below are penalized. This simplicity allowed DeepSeek to train a strong reasoning model at a fraction of the cost of PPO-based approaches.
6.7 Parameter-Efficient Fine-Tuning
Full fine-tuning updates every parameter in the model, which is prohibitively expensive for large models. Parameter-efficient methods update only a small subset of parameters while achieving competitive performance.
6.7.1 LoRA and Its Variants
LoRA (Low-Rank Adaptation) (Hu et al. 2021) is the most widely used approach. Instead of updating a weight matrix \(W\) directly, LoRA adds a low-rank decomposition: \(W' = W + BA\), where \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times d}\) with rank \(r \ll d\) (typically 8 to 64). Only \(A\) and \(B\) are trained; \(W\) is frozen. This reduces trainable parameters by 100x or more.
LoRA has spawned a rich family of variants, each addressing a specific limitation:
- QLoRA (Dettmers et al. 2023): Quantizes the base model to 4-bit and trains LoRA adapters in 16-bit, enabling fine-tuning of 65B models on a single 48GB GPU.
- DoRA: Decomposes weight updates into magnitude and direction, improving performance.
- LoRA+, VeRA, LoHA: Various architectural tweaks to the low-rank structure.
- Prefix/Prompt Tuning: Instead of modifying weights, prepend learnable embedding vectors to the input.
For most practitioners, QLoRA is the sweet spot: it enables fine-tuning large models on consumer hardware with minimal quality loss.
6.8 Model Evaluation
After training, how do you know if your model is any good? Evaluation is surprisingly tricky for generative models, because there is no single “correct” output for most tasks.
6.8.1 Automated Benchmarks
The LM Evaluation Harness (Gao et al. 2024) by EleutherAI is the standard tool for benchmarking language models. It supports hundreds of tasks including:
- MMLU (Hendrycks et al. 2021): 57 academic subjects, testing broad knowledge.
- HumanEval / MBPP: Code generation benchmarks.
- GSM8K: Grade-school math word problems.
- TruthfulQA: Tests whether the model avoids common misconceptions.
- ARC-AGI (Chollet 2024): Novel visual reasoning puzzles.
6.8.2 Human Evaluation
For conversational models, automated benchmarks only tell part of the story. The Chatbot Arena (https://lmarena.ai/) ranks models through blind human preference votes in real conversations. This ELO-based ranking is widely considered the most reliable measure of overall model quality.
In practice, experienced practitioners combine benchmark scores with what the community affectionately calls “vibes”: spending time chatting with the model, testing edge cases, and getting an intuitive feel for its strengths and weaknesses. Benchmarks tell you how the model performs on known tasks; vibes tell you how it feels to actually use it. Both matter.
6.9 Exercises
- Watch Karpathy's “Let's build GPT from scratch” video and follow along, training a character-level model on Shakespeare. Modify the model size and context length and observe how it affects generation quality.
- Train a 125M parameter GPT-2-style model on a subset of FineWeb-Edu using PyTorch. Track loss curves and generate sample text at each checkpoint.
- Fine-tune LLaMA 3.1 8B on a custom instruction dataset using QLoRA (via the
peftandtrllibraries). Compare outputs before and after fine-tuning. - Tokenize the same paragraph using three different tokenizers (
tiktokenfor GPT-4, the LLaMA tokenizer, and a character-level tokenizer). Compare the number of tokens produced and discuss the tradeoffs. - Run the LM Evaluation Harness on a model before and after your fine-tuning. Which benchmarks improved? Which degraded?