24 Reinforcement Learning
In March 2016, a program called AlphaGo defeated Lee Sedol, one of the greatest Go players in history, in a five-game match (Silver et al. 2016). Go has more possible board positions than atoms in the observable universe, and experts had predicted it would take at least another decade before AI could beat a top professional. AlphaGo did not learn Go by studying human games alone; it learned by playing millions of games against itself and discovering strategies that no human had ever conceived.
This is reinforcement learning: learning by doing, by trying, by failing, and by occasionally stumbling onto something brilliant. It is the mathematical framework behind many of AI's most spectacular achievements, and it has become central to how we train and align modern LLMs.
If you are primarily interested in language models and wonder why an RL chapter is in this book, here is the answer: RLHF, DPO, GRPO, and the reasoning capabilities of models like o1 and DeepSeek-R1 all rely on reinforcement learning. Understanding RL is no longer optional for anyone working with modern LLMs. The post-training pipeline that transforms a base model into a helpful assistant is fundamentally an RL problem.
24.1 The Reinforcement Learning Framework
RL formalizes the problem of an agent interacting with an environment. At each time step, the agent observes a state, takes an action, receives a reward, and transitions to a new state. The goal: find a strategy (policy) that maximizes the total reward over time.
Formally, the standard RL setup is a Markov Decision Process (MDP):
- \(\mathcal{S}\): the set of possible states
- \(\mathcal{A}\): the set of possible actions
- \(P(s'|s,a)\): transition dynamics (probability of reaching state \(s'\) given state \(s\) and action \(a\))
- \(R(s,a)\): reward function (immediate reward for taking action \(a\) in state \(s\))
- \(\gamma \in [0,1)\): discount factor (how much the agent values future vs. immediate rewards)
The agent's goal is to learn a policy \(\pi(a|s)\) (a mapping from states to action probabilities) that maximizes the expected cumulative discounted reward: \[J(\pi) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]\]
RL faces a fundamental tension: should the agent exploit the best strategy it has found so far, or should it explore new strategies that might be even better? A restaurant analogy: do you go to your favorite restaurant (exploit) or try a new one that might be amazing or terrible (explore)? Too much exploitation leads to suboptimal strategies. Too much exploration wastes time on bad options. Balancing the two is one of the deepest problems in RL.
24.2 Value Functions
State-value function: \(V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t R_t \mid s_0 = s\right]\) estimates the expected total reward from state \(s\) under policy \(\pi\).
Action-value function: \(Q^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t R_t \mid s_0 = s, a_0 = a\right]\) estimates the expected total reward from state \(s\) after taking action \(a\), then following policy \(\pi\).
The optimal policy \(\pi^*\) satisfies \(V^*(s) = \max_a Q^*(s,a)\) for all states: always take the action with the highest expected future reward.
24.3 Key Algorithms
24.3.1 Q-Learning and DQN
Q-Learning is a model-free algorithm that directly learns the optimal action-value function \(Q^*\) without knowing the environment's dynamics. The update rule: \[Q(s,a) \leftarrow Q(s,a) + \alpha \left[R + \gamma \max_{a'} Q(s', a') - Q(s,a)\right]\]
This converges to \(Q^*\) under mild conditions. The agent can then act greedily: \(\pi^*(s) = \arg\max_a Q^*(s,a)\).
Deep Q-Networks (DQN) (Mnih et al. 2015) replace the Q-table with a neural network, enabling RL to scale to high-dimensional state spaces like video game pixels. DQN achieved human-level performance on dozens of Atari 2600 games using only raw pixels as input.
DQN introduced two techniques that were essential for stability: (1) Experience replay: store transitions \((s, a, r, s')\) in a buffer and sample random mini-batches for training, breaking the correlation between consecutive samples. (2) Target network: use a slowly-updated copy of the Q-network to compute target values, preventing the training target from changing too quickly. Without these tricks, deep Q-learning diverges spectacularly.
24.3.2 Policy Gradient Methods
Instead of learning a value function and deriving a policy, policy gradient methods directly optimize the policy \(\pi_\theta\) by gradient ascent on expected reward. The REINFORCE algorithm estimates the gradient: \[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]\] where \(G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k}\) is the return from time \(t\).
The intuition: increase the probability of actions that led to high returns and decrease the probability of actions that led to low returns. While simple, REINFORCE has high variance and is slow to converge.
24.3.3 Proximal Policy Optimization (PPO)
PPO (Schulman et al. 2017) is the workhorse of modern RL. It addresses REINFORCE's instability by constraining how much the policy can change in a single update, using a clipped surrogate objective: \[L^{\text{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\left(r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1{-}\epsilon, 1{+}\epsilon) \hat{A}_t\right)\right]\] where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the probability ratio between the new and old policies, and \(\hat{A}_t\) is the advantage estimate (how much better this action is compared to the average).
The clipping prevents catastrophically large policy updates: if an action's probability changes by more than \(\epsilon\) (typically 0.2), the gradient is zeroed out. This makes training remarkably stable.
PPO is used for RLHF in ChatGPT, Claude, and Gemini. It trained OpenAI Five (Dota 2), robot manipulation policies, and autonomous driving systems. Its popularity comes from its rare combination of simplicity (easy to implement), stability (rarely diverges), and effectiveness (works well across diverse domains). If you learn only one RL algorithm, learn PPO.
24.4 RL for Language Models
The connection between RL and LLMs is profound. Generating text is a sequential decision problem: at each step, the model (agent) observes the context (state), produces a token (action), and (eventually) receives a reward signal from human preferences.
RLHF (Ouyang et al. 2022): The original approach. Train a reward model on human preference data, then use PPO to optimize the LLM against this reward model. This is how InstructGPT/ChatGPT was trained. The KL penalty between the RL-trained policy and the base model prevents the model from “reward hacking” (finding degenerate outputs that score high with the reward model but are low quality).
DPO (Rafailov et al. 2023): Direct Preference Optimization eliminates the reward model entirely. It optimizes the policy directly from preference pairs using a clever reparameterization (see Chapter 3 for details). DPO is simpler to implement and more stable than PPO-based RLHF.
GRPO (Guo et al. 2025): Group Relative Policy Optimization, used by DeepSeek-R1 to train reasoning capabilities. For each prompt, generate \(k\) responses, compute rewards for each (e.g., “is the math answer correct?”), and use the group's mean and variance as a baseline. Responses better than average are reinforced; responses worse than average are penalized. No separate reward model or critic needed.
Test-time RL: Models like o1 (OpenAI 2024) use RL not just during training but conceptually during inference: they explore multiple reasoning paths (internal “chain of thought”) and select the best one. The RL training teaches the model when to think more and when to stop.
24.5 Multi-Agent Reinforcement Learning
What happens when multiple RL agents interact? Multi-agent RL (MARL) studies settings where agents must cooperate, compete, or do both simultaneously. This is far more complex than single-agent RL because each agent's optimal strategy depends on what the other agents are doing---and they are all learning simultaneously.
Cooperative MARL: Agents share a common objective. Examples include robot teams coordinating to move a heavy object, or multiple AI assistants collaborating to answer a complex query. The key challenge is credit assignment: when the team succeeds, which agent's actions deserve credit?
Competitive MARL: Agents have opposing objectives. This is the setting of games like Go, chess, and poker. Self-play (an agent playing against copies of itself) is the dominant training paradigm, as demonstrated by AlphaZero (Silver et al. 2017).
Mixed-motive settings: The most realistic and most difficult. Agents have partially aligned, partially conflicting interests. Think of autonomous vehicles at an intersection: everyone wants to get through quickly (individual goals), but a crash hurts everyone (shared risk). Game theory provides the mathematical framework for analyzing these situations.
Self-play is one of the most powerful ideas in RL: the agent generates its own curriculum by playing against itself. As it improves, its opponent improves, creating an ever-escalating challenge. AlphaZero's chess play exceeded the accumulated chess knowledge of centuries of human play after just four hours of training. Self-play works because it provides an infinite source of appropriately challenging training data.
24.6 Safe Reinforcement Learning
RL agents optimize for reward, and they are very good at it---sometimes too good. Reward hacking occurs when the agent finds a way to maximize reward that violates the spirit of the task. A cleaning robot might hide trash under the rug (it “cleaned” the floor). A game-playing agent might exploit bugs in the game engine. An LLM might produce sycophantic responses that get high human ratings but are actually unhelpful.
Safe RL addresses this with several strategies:
Constrained optimization: Add constraints to the RL objective (e.g., “maximize helpfulness subject to the constraint that toxicity stays below 0.01”). This is formalized as a Constrained MDP, where the agent maximizes one reward while keeping other costs below specified thresholds.
Conservative exploration: Limit how far the agent can deviate from a known-safe policy during exploration. This prevents the agent from trying dangerous actions even to learn from them.
Reward modeling with uncertainty: Instead of a single reward model, maintain an ensemble of reward models. When the models disagree (high uncertainty), the agent should be cautious and ask for human guidance rather than acting on uncertain rewards.
Safe RL is directly connected to the AI alignment problem (Chapter 12). RLHF, DPO, and GRPO are all attempts to align LLMs with human values through reward-based training. The same challenges---reward hacking, distributional shift, specification gaming---appear in both the RL and alignment literatures. Understanding safe RL gives you the foundation for understanding AI safety at the frontier.
24.7 Model-Based RL and Planning
Model-free RL (Q-learning, PPO) learns entirely from experience. Model-based RL first learns a world model (a neural network that predicts what will happen next) and then uses this model for planning. This is far more sample-efficient but requires the world model to be accurate.
AlphaZero (Silver et al. 2017) combined a learned model with Monte Carlo Tree Search (MCTS) to master chess, shogi, and Go from self-play alone, without any human game data. DreamerV3 (Hafner et al. 2023) demonstrated that a single world-model-based algorithm can master tasks ranging from Atari games to robot locomotion. For more on world models, see Chapter 19.
RL pioneer Richard Sutton wrote an influential essay (Sutton 2019) arguing that the biggest lesson in 70 years of AI research is that general methods that leverage computation (like search and learning) ultimately beat methods that leverage human domain knowledge. He argues against hand-engineering solutions and in favor of scalable approaches that improve with more compute and data. This “bitter lesson” directly motivates the scaling approach to AI: instead of building in human knowledge, build systems that can discover knowledge through experience.
24.8 Exercises
- Implement Q-Learning from scratch on a simple grid-world (e.g., FrozenLake from Gymnasium). Visualize the learned Q-values as arrows showing the optimal action in each cell. How many episodes does it take to converge?
- Train a DQN agent on CartPole-v1 using PyTorch. Implement both experience replay and a target network. Plot the learning curve (episode reward vs. training steps). Then remove experience replay and observe what happens to training stability.
- Implement PPO and apply it to LunarLander-v2 from Gymnasium. Compare the learning speed with a vanilla REINFORCE baseline. How much faster does PPO converge?
- Apply DPO to fine-tune a small language model (e.g., GPT-2) on a preference dataset (e.g., a subset of Anthropic's HH-RLHF). Compare the outputs of the DPO-trained model with the SFT-only model on a set of test prompts. Is the DPO model more helpful? More cautious?
- Read Sutton's “The Bitter Lesson” essay (available at
incompleteideas.net). Write a one-page response: do you agree or disagree? What are the limits of the “scale and compute” approach?