5  History of Deep Learning

TipA Note on “History”

When we say history here, we're not going all the way back to the Dartmouth workshop in 1956 or Ada Lovelace's notes on the Analytical Engine (though both are fascinating rabbit holes). This chapter picks up where deep learning captured the public imagination and everything started moving absurdly fast. Buckle up.

5.1 Concepts

This chapter traces the arc of deep learning from its theoretical roots to the era of frontier AI systems. The key ideas we will cover are:

  • The Scaling Hypothesis - the observation that simply making models larger and training them on more data consistently yields better performance, often rendering clever architectural tricks unnecessary.
  • The rise of “generic” models - models that can perform tasks they were never explicitly trained on. This encompasses zero-shot (no examples), one-shot, and \(n\)-shot learning, where a model generalizes from minimal demonstrations provided in-context.
  • Foundational Models - massive pre-trained models such as GPT-3 (Brown et al. 2020), GPT-4 (OpenAI 2023), and the Chinchilla scaling laws (Hoffmann et al. 2022) that formalized how to balance model size and dataset size.
  • Mixture of Experts (MoE) - architectures that route different inputs through different subnetworks, enabling models with trillions of total parameters while keeping inference costs manageable (Shazeer et al. 2017).
  • Multimodality - “fake” multimodality (pipelining separate text, vision, and audio models together) vs. “true” multimodality (a single model with a shared embedding space across modalities).
  • Prompting - the art of eliciting desired behavior from a pre-trained model through carefully crafted inputs, including chain-of-thought prompting (Wei et al. 2022).
  • Inference-time scaling - the paradigm shift from making models bigger at training time to making them “think harder” at inference time, via chain-of-thought reasoning combined with reinforcement learning. This gave rise to OpenAI's o1 (OpenAI 2024), o3 (OpenAI 2025), and DeepSeek-R1 (Guo et al. 2025).
  • Computer-using agents - systems like Claude Computer Use and UFO (Zhang et al. 2024) that can operate desktop applications and web browsers autonomously.
  • Ultra-generic benchmarks - MMLU (Hendrycks et al. 2021), ARC-AGI (Chollet 2024), SWE-bench (Jimenez et al. 2024), and others that attempt to measure broad intelligence rather than narrow task performance.
  • Automated AI coders - the emergence of fully autonomous coding agents such as Devin (Cognition), Magic AI, and others that aim to replace or augment human software engineers.
  • New compute architectures - from Google's Titan extensions to Yann LeCun's JEPA framework (LeCun 2022), the search for post-transformer paradigms.
ImportantThe Big Picture

This chapter covers the most transformative decade in AI history. The key insight threading through everything: the combination of simple, scalable methods with massive compute consistently beats clever, hand-engineered approaches. This principle, called the scaling hypothesis, motivated GPT-3, drove the creation of ChatGPT, and continues to shape the field today. Understanding this history is not just academic; it explains why the field looks the way it does right now.

5.2 Richard Sutton's Scaling Hypothesis

In 2019, Richard Sutton published a short essay called “The Bitter Lesson” (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) that would become one of the most influential pieces of writing in modern AI. His argument is simple and, as he acknowledges, bitter for researchers to accept:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

The key insights:

  • Search and learning scale arbitrarily. Methods that can exploit additional compute - bigger models, more data, longer training - consistently win over methods that rely on hand-engineered features or clever domain-specific architectures.
  • Fancier architectures become meaningless over time. When compute was scarce, researchers invested enormous effort in encoding human knowledge into AI systems (expert systems, hand-crafted features, rule-based NLP). As the amount of available compute increased massively and the cost per unit of compute decreased, these efforts were overtaken by simple, scalable methods.
  • This makes computer scientists upset, but it is the reality. Researchers naturally want their intellectual contributions to matter. It is disheartening to learn that a larger version of a simpler model consistently outperforms a smaller version of a more elegant one.
  • Opposition to this principle has often hindered progress. In chess (Deep Blue's hand-crafted evaluation vs. AlphaZero's learned evaluation (Silver et al. 2017)), in computer vision (SIFT features vs. learned CNNs (Krizhevsky et al. 2012)), and in speech recognition (HMMs with phonetic knowledge vs. end-to-end neural systems), the pattern repeats: scale wins.
TipWhy It Is Called “Bitter”

Sutton calls it the bitter lesson because it is uncomfortable for researchers. If you have spent years designing clever features, elegant architectures, or domain-specific tricks, learning that “just make it bigger” outperforms your work is genuinely painful. But the historical record is unambiguous. Deep Blue's hand-crafted chess evaluation was crushed by AlphaZero's learned evaluation. Hand-designed image features (SIFT, HOG) were obliterated by learned convolutional features. The bitter lesson does not mean clever ideas are worthless; it means they must be scalable to survive.

The scaling hypothesis directly motivated the creation of GPT-3 and subsequent large language models. OpenAI's bet was that scaling a relatively simple transformer architecture (Vaswani et al. 2017) to hundreds of billions of parameters, trained on most of the internet's text, would produce capabilities that no amount of architectural cleverness could match at smaller scale. They were right.

5.3 History: DeepMind vs. OpenAI

The history of modern AI can be understood through the parallel journeys of two labs that took fundamentally different approaches - and eventually converged.

5.3.1 $$2016-2017: DeepMind and the RL Revolution

DeepMind made extraordinary progress with reinforcement learning applied to game-playing AI. AlphaGo (Silver et al. 2016) defeated the world champion at Go in 2016 - a game long considered a grand challenge for AI due to its enormous search space. AlphaZero (Silver et al. 2017) followed in 2017, mastering chess, shogi, and Go from scratch through pure self-play, with no human knowledge beyond the rules. Remarkably, these systems were built on the same principles created by Arthur Samuel half a century earlier (1950s-1960s): learn from experience through trial and error. DeepMind proved that with modern compute and neural networks, these ideas could achieve superhuman performance.

5.3.2 $$2019: OpenAI and the Scaling Bet

While DeepMind focused on RL in constrained environments, OpenAI bet on scaling language models. GPT-2 (Radford et al. 2019) achieved impressive text generation by training a 1.5-billion-parameter transformer on web text. The results were strong enough that OpenAI initially withheld the full model, citing concerns about misuse. Internally, OpenAI decided to scale this approach to an extreme extent.

5.3.3 $$2020-2021: GPT-3 Changes Everything

OpenAI released GPT-3 (Brown et al. 2020) - 175 billion parameters, trained on a massive internet corpus. Only a technical report was published; the model weights remained proprietary. GPT-3 demonstrated remarkable few-shot learning: by providing a few examples in the prompt, it could perform translation, arithmetic, code generation, and creative writing without any fine-tuning. The model was then fine-tuned to respond to instructions (InstructGPT (Ouyang et al. 2022)), making it conversational. By re-feeding the history of (model, human) exchanges to the model, users could have continuous, coherent conversations.

5.3.4 $$2022: ChatGPT and the Public Awakening

ChatGPT was released in November 2022, built on GPT-3.5 (an improved version of GPT-3 with RLHF). It became the fastest-growing consumer application in history, reaching 100 million users within two months. This was the moment AI moved from a research curiosity to a mainstream technology. Immediate replication efforts began in the open-source community.

NoteThe ChatGPT Effect

ChatGPT did not represent a fundamental research breakthrough. GPT-3.5 was not dramatically different from GPT-3. What changed was the interface: a free, web-based chat UI that anyone could use. This is a profound lesson about technology adoption. The best technology does not always win; the most accessible technology does. RLHF made the model conversational enough for non-experts, and the chat interface made it feel like talking to a person. The combination turned a research artifact into a cultural phenomenon overnight.

5.3.5 $$2022-2023: The Ecosystem Explodes

The period after ChatGPT saw an explosion of both closed and open-source activity:

  • Engineering around LLMs: An entire ecosystem of tools emerged to optimize LLM pipelines and deliver more intelligence without additional model training:

    • Vector Databases - databases optimized for storing and retrieving high-dimensional embeddings.
    • RAG (Retrieval Augmented Generation) (Lewis et al. 2020) - grounding LLM responses in current knowledge from proprietary databases, web search, or document stores.
    • Tools and Function Calling - LLMs learned to call APIs, run code, use calculators, and interact with external software (Schick et al. 2023).
    • The first intelligent agents - systems like AutoGPT (Significant Gravitas 2023) and BabyAGI (Nakajima 2023) that looped LLM calls with planning and tool use.
  • Open-source models: Meta released LLaMA (Touvron, Lavril, et al. 2023) and LLaMA 2 (Touvron, Martin, et al. 2023), Mistral AI released Mistral 7B (Jiang et al. 2023) and Mixtral (Jiang et al. 2024), and Alibaba released the Qwen series - breaking OpenAI's monopoly on capable models.

5.4 The Present

As of 2024-2025, the field has entered a new phase characterized by several converging trends:

  • Reasoning agents: Models like o1 (OpenAI 2024), o3 (OpenAI 2025), and DeepSeek-R1 (Guo et al. 2025) can “reason” - they generate internal chains of thought via RL training, solving problems that require multi-step, human-like thinking.
  • LLMs/VLMs + RL: The combination of large pre-trained models with reinforcement learning has unlocked capabilities that neither approach achieves alone. The models understand the world (from pre-training) and can pursue goals (from RL).
  • Ultra large-scale multimodality: Models like GPT-4o natively process text, images, audio, and video within a single architecture. The pipelined approach (separate encoder per modality) is giving way to truly unified models.
  • Deployment to real-world robots: Vision-language-action models (Kim et al. 2024; Brohan et al. 2023) are being deployed on physical robots, connecting language understanding with motor control.
  • Extreme model compression: Quantization to 4-bit and below (Frantar et al. 2022; Lin et al. 2023), distillation (Hinton et al. 2015), and pruning are making frontier-class models run on consumer hardware.
  • Inference-time compute scaling: Instead of only making models bigger, researchers are making them think harder at inference time - spending more compute per question to achieve better answers.

5.5 The Rise of Generic Models

CautionFrom Specialist to Generalist

For most of machine learning's history, models were specialists: one model for spam detection, another for sentiment analysis, a third for translation. Each required its own training data, its own architecture, and its own evaluation pipeline. The rise of generic models inverted this entirely. A single model, trained once, can perform tasks it was never explicitly designed for. This shift from “one model per task” to “one model, many tasks” is arguably the most important conceptual change in the field's history.

One of the most surprising developments in deep learning has been the emergence of generic models - systems that can perform tasks they were never explicitly trained on. This represents a fundamental shift from the traditional machine learning paradigm, where each task required a dedicated model trained on task-specific data.

Segment Anything (SAM) (Kirillov et al. 2023) by Meta demonstrated zero-shot segmentation of arbitrary objects in images. Given any image and a point, box, or text prompt, SAM can segment the indicated object - even objects it has never seen before. This was achieved by training on over 1 billion mask annotations, creating a “foundation model for segmentation” that generalizes across domains (medical imaging, satellite imagery, microscopy) without domain-specific training. Meta subsequently released Segment Anything 2 for video segmentation and extensions to 3D point clouds.

The pattern generalizes: GPT-3 demonstrated zero-shot language capabilities, CLIP (Radford et al. 2021) enabled zero-shot image classification, and Whisper (Radford et al. 2022) provided zero-shot multilingual speech recognition. In each case, scale and diverse training data replaced task-specific engineering.

5.6 The ChatGPT Revolution

GPT-3 (Brown et al. 2020) was the model that proved the scaling hypothesis right. At 175 billion parameters, it showed that a sufficiently large language model trained on enough data develops emergent capabilities - few-shot learning, reasoning, translation, and code generation - that were not explicitly programmed.

The training pipeline for such foundational models involves:

  1. Scrape the web: This is the most time-consuming and expensive step. Massive amounts of text data must be gathered from diverse sources - web pages, books, academic papers, code repositories, forums. The resulting corpus can be hundreds of terabytes.
  2. Clean the data: Raw web data is noisy. Deduplication, filtering low-quality content, removing personally identifiable information, and balancing data sources are critical. The quality of the pre-training data directly determines the quality of the resulting model.
  3. Divide into batches: The cleaned data is sharded into training batches distributed across thousands of GPUs.
  4. Tokenize: Text is converted into sequences of integer tokens using algorithms like Byte Pair Encoding (BPE) (Sennrich et al. 2016). Tokenization happens on the fly during training.
  5. Pre-train: The model is trained with a simple next-token prediction objective: given all previous tokens, predict the next one. Despite this simplicity, the model learns grammar, facts, reasoning, and even some common sense.

GPT-3.5 improved upon GPT-3 through instruction fine-tuning (InstructGPT (Ouyang et al. 2022)) and RLHF (Ziegler et al. 2020), making it conversational, helpful, and safer. This became the model behind ChatGPT.

5.7 GPT-4

GPT-4 (OpenAI 2023) represented another leap in capability. While OpenAI disclosed very few architectural details, it is widely rumored to be a Mixture of Experts (MoE) model with approximately 1 trillion total parameters - specifically, eight expert sub-models of roughly 220 billion parameters each, with a gating network that routes each token to the two most relevant experts. This means that while the total parameter count is enormous, only a fraction (around 220-440 billion parameters) is active for any given input, keeping inference costs manageable.

The MoE approach was also adopted by Mistral AI in their Mixtral model (Jiang et al. 2024), which used eight experts of 7 billion parameters each (56 billion total, $$13 billion active per token), achieving performance competitive with much larger dense models at a fraction of the inference cost. Interestingly, Meta has so far avoided the MoE architecture for their LLaMA family, preferring dense transformer models. The trade-off is that MoE models are harder to train (load balancing across experts, communication overhead across GPUs) but more efficient at inference time.

GPT-4 was also natively multimodal: it could accept both text and image inputs, though it initially only produced text outputs. This made it the first widely available commercial model to demonstrate strong vision-language understanding within a single architecture.

TipThe MoE Trade-Off

Mixture of Experts is a “have your cake and eat it too” architecture, but it comes with real trade-offs. The total model is huge (all experts must be stored in memory), but only a fraction is active per token (fast inference). Training is more complex because you need load-balancing losses to prevent “expert collapse” (all tokens routing to the same expert). And distributed training requires careful placement of experts across GPUs. The payoff: MoE models consistently punch above their active parameter count. A Mixtral 8$$7B model with 13B active parameters performs like a 70B dense model.

5.8 “True” Multimodality

Multimodality in AI refers to the ability to process and relate information across different modes - text, images, audio, video, depth maps, and more. However, not all multimodal systems are created equal. There is a crucial distinction between pipelined multimodality and true multimodality.

5.8.1 Shared Latent Spaces

In pipelined multimodal systems, separate encoder models handle each modality independently, and their outputs are combined downstream. For example, an early “multimodal chatbot” might use Whisper to transcribe audio to text, GPT-4 to generate a text response, and a TTS model to convert the response back to speech. Each component operates in its own representation space.

True multimodality, by contrast, embeds multiple modalities into a shared latent space - a single high-dimensional vector space where semantically related concepts from different modalities are nearby. A photo of a dog, the word “dog”, and the sound of barking would all map to similar regions of this shared space.

ImageBind (Girdhar et al. 2023) by Meta exemplifies this approach. It creates a single embedding space that binds together six modalities: images, text, audio, depth maps, thermal (heat map) images, and IMU (inertial measurement unit) data. The key insight is that images naturally co-occur with all other modalities (photos have captions, videos have audio, depth sensors produce aligned depth maps), so image embeddings can serve as the “binding” modality that aligns everything else. This enables zero-shot cross-modal retrieval - for example, retrieving images from audio queries or generating audio from depth maps - without ever training on those specific pairings.

5.8.2 CLIP: Contrastive Language-Image Pre-training

CLIP (Radford et al. 2021) was the model that pioneered the shared latent space approach for vision and language. Developed by OpenAI, CLIP was trained on 400 million image-text pairs scraped from the internet using a contrastive learning objective: given a batch of images and captions, the model learns to maximize the similarity between matching (image, text) pairs while minimizing the similarity between non-matching pairs.

The result is a model that can perform zero-shot image classification. To classify an image, one simply computes the CLIP embedding of the image and compares it to the CLIP embeddings of candidate text labels (e.g., “a photo of a cat”, “a photo of a dog”). The label with the highest similarity wins. CLIP matched or exceeded supervised classifiers on many benchmarks - without ever being trained on those specific datasets.

TipTry It Yourself

You can experiment with CLIP in minutes. Install pip install openai-clip, load the model, and embed any image and any text. Compute the cosine similarity and you have a zero-shot classifier. Try classifying your own photos with creative labels (“a photo taken on a rainy day”, “a photo of something delicious”). The results will give you an intuitive sense of what shared embedding spaces can do.

5.8.3 Vision-Language Models (VLMs)

Building on CLIP's success, Vision-Language Models (VLMs) take the integration further. These are ultra-massive transformer-based models trained jointly on images and text, enabling them to answer questions about images, generate image descriptions, reason about visual content, and follow instructions that involve both modalities.

Notable VLMs include GPT-4V (GPT-4 with vision), Google's Gemini, and open-source models such as LLaVA and InternVL. Unlike CLIP, which produces embeddings for retrieval, VLMs generate free-form text responses conditioned on visual input, making them far more versatile. The training typically involves three stages: (1) pre-training a vision encoder (often a CLIP-style model), (2) pre-training a language model, and (3) aligning the two through joint fine-tuning on image-text instruction data.

5.9 Mixture of Experts (MoE)

The Mixture of Experts (MoE) architecture (Shazeer et al. 2017) draws a powerful analogy from neuroscience: different areas of the human brain specialize in different cognitive functions - the visual cortex processes sight, Broca's area handles speech production, and the hippocampus manages memory. Similarly, an MoE model consists of multiple “expert” sub-networks, each specializing in different types of inputs or tasks, with a gating network (router) that decides which expert(s) to activate for each input token.

The key advantages of MoE are:

  • Scaling without proportional compute cost: A model with 8 experts of 7B parameters each has 56B total parameters, but if only 2 experts are active per token, the inference cost is comparable to a 14B dense model.
  • Specialization: Each expert can develop expertise in different domains (e.g., code, mathematics, natural language, multilingual text), leading to better performance than a dense model of equal active parameter count.
  • Training efficiency: Although the total parameter count is large, the gradient computation per token only involves the active experts, making training faster per step.

The challenges include load balancing (ensuring all experts are utilized, not just a few), communication overhead in distributed training (experts may reside on different GPUs), and the increased memory footprint (all expert weights must be stored even though only a subset is active).

GPT-4 is widely rumored to use an MoE architecture with approximately eight experts of $$220 billion parameters each, for a total of $\(1.8 trillion parameters. Mixtral 8\)$7B (Jiang et al. 2024) demonstrated that MoE works at smaller scales too, matching the performance of LLaMA 2 70B while being significantly faster at inference.

5.10 State of the Art in Multimodality

The frontier of multimodal AI has advanced rapidly, moving from pipelined systems toward natively integrated architectures.

5.10.1 Video Understanding

Models with 7 billion or more parameters now exist that can take video as input and produce detailed text descriptions, answer questions about the video content, and even reason about temporal dynamics. Examples include Video-LLaVA, InternVideo, and Qwen-VL, which process video frames through a vision encoder and feed the resulting tokens into a language model alongside text prompts.

5.10.2 Native Audio Modality

The latest iteration of GPT-4o introduced a core audio modality as part of its training. This enables:

  • Direct audio-in, audio-out - the model processes raw audio waveforms and generates spoken responses natively.
  • Text-in, audio-out and audio-in, text-out - seamless cross-modal interaction.

This represents a fundamental shift from the previous pipelined approach:

  1. Audio \(\rightarrow\) Whisper (ASR) \(\rightarrow\) text transcription
  2. Text \(\rightarrow\) GPT-4 (LLM) \(\rightarrow\) text response
  3. Text \(\rightarrow\) TTS model \(\rightarrow\) synthesized speech

The pipelined approach introduces latency at each stage, loses paralinguistic information (tone, emotion, emphasis), and compounds errors across components. A natively multimodal model avoids all of these issues.

5.10.3 ImageBind and Unified Embeddings

Meta's ImageBind extends the shared embedding space to six modalities simultaneously. This enables remarkable zero-shot cross-modal capabilities: retrieving images from audio queries, generating audio descriptions from depth maps, and linking thermal imagery to textual descriptions - all without explicit training on those modality pairings.

5.10.4 Multimodal In, Multimodal Out

The ultimate stage of multimodal AI is a system that can accept any combination of modalities as input and produce any combination as output. Imagine an agent you interact with live that can generate video, images, text, 3D content, and audio on the fly, all within a single coherent interaction.

This is not purely hypothetical. Models already exist that can generate interactive environments: Google's Genie (Bruce et al. 2024) and Genie 2 (Google DeepMind 2024) can generate playable video game worlds from a single image prompt. The convergence of generation, understanding, and interaction across all modalities is the defining research frontier of 2024-2025.

ImportantThe Multimodality Progression

The history of multimodal AI follows a clear trajectory: (1) separate models for each modality, pipelined together; (2) aligned embedding spaces (CLIP, ImageBind) that let models understand cross-modal relationships; (3) natively multimodal models (GPT-4o) that process multiple modalities within a single architecture; (4) multimodal generation (Sora, Genie 2) that creates content across modalities. Each stage is a genuine leap in capability, not just an incremental improvement. We are currently between stages 3 and 4.

5.11 Ultra-Generic Benchmarks

As AI models become more general-purpose, the community has developed increasingly ambitious benchmarks to measure broad intelligence:

  • MMLU (Massive Multitask Language Understanding) (Hendrycks et al. 2021) - Tests knowledge and reasoning across 57 academic subjects, from elementary mathematics to professional law and medicine. A model's MMLU score approximates the breadth of a generalist's education. Frontier models now exceed 90% accuracy, surpassing the average human performance in many domains.
  • SWE-bench (Jimenez et al. 2024) - A software engineering benchmark where models must resolve real GitHub issues in popular open-source Python repositories. Given an issue description and the repository code, the model must produce a patch that fixes the issue and passes all relevant tests. This benchmark measures practical, real-world coding ability and is one of the hardest to game. Top-performing agents solve around 50% of issues.
  • MMMU (Massive Multi-discipline Multimodal Understanding) - Extends the MMLU concept to multimodal reasoning, requiring models to answer questions that involve understanding images, charts, diagrams, and mathematical notation alongside text. This tests whether models can truly integrate visual and textual information for reasoning, not just process them independently.

5.11.1 Emerging AGI Benchmarks

ARC-AGI (Chollet 2024) (and its successor ARC-AGI-2) is a benchmark created by Franois Chollet, a prominent skeptic of the “scale is all you need” narrative. ARC consists of visual pattern-recognition puzzles that are trivial for humans but extremely difficult for current AI systems. Each puzzle requires the solver to infer an abstract transformation rule from a few input-output examples and apply it to a new input.

The key insight behind ARC is that it tests fluid intelligence - the ability to reason about novel problems - rather than crystallized intelligence - recall of memorized knowledge. Current LLMs excel at crystallized intelligence (they have “read” the internet) but struggle with fluid intelligence. ARC-AGI is therefore considered one of the most meaningful benchmarks for measuring progress toward genuine artificial general intelligence.

5.12 AI vs. Humans (2023)

As of 2023, there is a fundamental asymmetry between how AI and humans approach problem-solving:

AI (instantaneous): When you ask a question to an AI system, it immediately produces an answer. At most, it performs a vector-database lookup, runs some RAG (Retrieval Augmented Generation) over web data or a proprietary knowledge base, and then responds. The entire process takes seconds. There is no deliberation, no reflection, no revision.

Humans (deliberate): A human given a complex task will research the problem, ponder it for days or weeks, draft a solution, gather feedback, iterate, and eventually reach a conclusion. This deliberative process - with its pauses, course corrections, and accumulated understanding - is precisely what produces high-quality, creative, and robust work.

This deliberative capacity was largely absent from generative AI through 2023. The instantaneous nature of response, and the lack of any mechanism for genuine reflection, limited what LLMs could achieve on hard problems. Early attempts to bridge this gap - such as BabyAGI (Nakajima 2023) and AutoGPT (Significant Gravitas 2023), which re-fed prompts to the model in a loop - showed that naive self-prompting is insufficient. Without a true objective function and the ability to evaluate progress, these systems often spiraled into incoherent loops.

NoteSystem 1 vs. System 2 Thinking

Daniel Kahneman's distinction (Kahneman 2011) between System 1 (fast, automatic, intuitive) and System 2 (slow, deliberate, analytical) thinking is a useful lens for understanding AI progress. Through 2023, LLMs were pure System 1 thinkers: they responded instantly based on learned patterns. The introduction of chain-of-thought reasoning and inference-time compute (o1, DeepSeek-R1) gave them something resembling System 2: the ability to slow down, think step by step, and check their work. This distinction will keep appearing throughout the book.

The analogy that captures this limitation: generative AI is like Shakespeare's proverbial roomful of monkeys with typewriters. They produce text at enormous speed, but they have no conception of whether they are progressing toward a meaningful goal. Exploration of non-generative, objective-driven models - systems that can set goals, evaluate their own progress, and adjust their strategy - emerged as the critical research direction. This is precisely what reinforcement learning would provide, as discussed in the next section.

5.13 Adding Reinforcement Learning to the Loop

The integration of reinforcement learning with large language models has evolved through three distinct phases, each more powerful than the last.

5.13.1 Phase 1: RLHF

The first application of RL to LLMs was Reinforcement Learning from Human Feedback (RLHF) (Ziegler et al. 2020; Ouyang et al. 2022). The goal was straightforward: make models more helpful, harmless, and honest. Human annotators ranked model outputs, a reward model was trained on these rankings, and the LLM was fine-tuned with Proximal Policy Optimization (PPO) to maximize the reward model's score. This is what transformed GPT-3 into ChatGPT.

5.13.2 Phase 2: Goal and Temporal Understanding

The next evolution combined LLMs and VLMs with RL not just for alignment, but for capability. RL adds two critical dimensions that pure language modeling lacks:

  • Goal understanding: RL provides an objective function - a reward signal - that the model can optimize toward. This enables the model to pursue multi-step goals rather than simply generating the most likely next token.
  • Temporal understanding: RL operates over sequential decision-making, giving the model a sense of time, state progression, and the consequences of actions.

5.13.3 Phase 3: Reasoning Models

The most recent and most powerful phase combines chain-of-thought prompting (Wei et al. 2022) with reinforcement learning, producing models that can reason through complex problems step by step:

  • “Direct response” - the model answers immediately, analogous to subconscious or System 1 thinking. This is fast but error-prone on hard problems.
  • “Thoughtful response” - the model generates an internal chain of reasoning before answering, analogous to deliberate or System 2 thinking. The model effectively self-prompts, breaking the problem into sub-steps and working through each one.

This breakthrough was made possible by the ultra-high-quality interaction data collected from billions of conversations with GPT-3 and GPT-4. The exact methodology remains proprietary - it is not publicly known how OpenAI trains their reasoning models. The result was OpenAI's o1 (OpenAI 2024) and o3 (OpenAI 2025), which achieved dramatic improvements on mathematics, coding, and scientific reasoning benchmarks.

DeepSeek-R1 (Guo et al. 2025) later replicated much of this capability in an open-source setting, demonstrating that reasoning through RL is not solely an OpenAI phenomenon.

CautionThe DeepSeek Surprise

DeepSeek-R1's release in January 2025 shocked the industry. A Chinese lab with a fraction of the compute budget of OpenAI or Google produced a reasoning model competitive with o1, and released it with open weights. The key innovation was not scale but efficiency: clever training recipes, MoE architectures, and engineering optimizations. DeepSeek demonstrated that the frontier is not solely determined by how much money you spend. This is excellent news for anyone who does not work at a trillion-dollar company.

A side note on openness: Since 2022, the state of the art has shifted from open-source to closed-source. The most capable models are held proprietary by OpenAI, Google DeepMind, and Anthropic. Open-source models (LLaMA, Mistral, Qwen, DeepSeek) have narrowed the gap significantly but typically lag the frontier by months. ## Large Models + Reinforcement Learning

The combination of large pre-trained models with reinforcement learning is arguably the most important development in AI since the original transformer paper. The insight is that each approach contributes something the other lacks:

  • Ultra-massive pre-trained models excel at understanding the world. Through pre-training on internet-scale data, they develop rich internal representations of language, vision, common sense, and factual knowledge. However, they lack the ability to pursue goals, plan over time, or learn from trial and error.
  • Reinforcement learning excels at task execution. RL agents can optimize toward specific objectives, plan over long horizons, and improve through experience. However, traditional RL agents start from scratch - they have no prior knowledge of the world.

By combining both, we get models that understand the world (from pre-training) and can act within it toward goals (from RL). This is exactly the paradigm behind:

  • OpenAI o1 (OpenAI 2024) and o3 (OpenAI 2025) - LLMs trained with RL to reason through multi-step problems, achieving state-of-the-art results on mathematics, coding, and science.
  • Google Gemini Flash Thinking - Google's reasoning models that apply inference-time compute scaling through extended chain-of-thought reasoning.
  • DeepSeek-R1 (Guo et al. 2025) - An open-source replication of the reasoning model paradigm, demonstrating that the approach generalizes beyond proprietary systems.
  • Upcoming replications by Anthropic (Claude with extended thinking) and Meta are expected to further validate this paradigm.

The convergence is clear: the future of AI is not just bigger models or better RL algorithms, but the synthesis of world knowledge with goal-directed behavior. ## VLMs + RL + What is Next?

The natural next step beyond reasoning LLMs is world model generation - AI systems that can not only understand and reason about the world, but actively generate and simulate it.

5.13.4 World Models and Action Generation

A world model is an internal representation that allows an agent to predict the consequences of actions before taking them. Humans do this constantly: before crossing a street, you simulate the trajectories of oncoming cars in your mind. Building world models into AI systems is a long-standing goal of the field.

The combination of Vision-Language Models (VLMs) with RL brings us closer to this goal. VLMs provide rich visual and linguistic understanding; RL provides the mechanism to act, explore, and improve. Together, they enable systems that can perceive the world, reason about it, plan actions, and learn from outcomes.

5.13.5 Google Genie and Genie 2

Google DeepMind's Genie (https://sites.google.com/view/genie-2024/) demonstrated that generative models can create interactive, playable 2D video game environments from a single image. The model learns the dynamics of game worlds from unlabeled internet video and can generate consistent, interactive environments that a user or an RL agent can explore.

Genie 2 (https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/) extended this to ultra-high-fidelity 3D environments. It functions as a large-scale foundation world model - given a single image prompt, it generates a consistent, explorable 3D world with realistic physics, lighting, and object interactions. This represents a convergence of generative modeling, world simulation, and interactive AI that points toward the future of embodied intelligence.

5.14 AI for Software Engineering

AI is transforming software engineering at three levels of increasing autonomy:

5.14.1 Requirement-Based Software Development

At the simplest level, AI acts as a coding assistant. Given explicit requirements - a function specification, a bug report, a user story - the AI generates code to satisfy them. This is the paradigm of tools like GitHub Copilot, Cursor, and similar code-completion systems. The human defines what needs to be built; the AI helps with how.

5.14.2 Solution-Oriented Software Development

At a higher level, AI moves from implementing requirements to creating them. Given a problem description (“users are churning after the onboarding flow”), the AI analyzes data, proposes solutions, generates the requirements, and then implements them. This is prescriptive AI - it does not just follow instructions but actively recommends what should be done. Systems like Devin (Cognition) and SWE-Agent (Jimenez et al. 2024) aim for this level of autonomy.

5.14.3 Automatically Identifying Problems

The most ambitious level is fully autonomous problem detection and resolution:

  • Log analysis and event correlation: AI continuously monitors application logs, metrics, and alerts, correlating events across distributed systems to identify emerging issues before they become incidents.
  • Root-cause analysis: Given a detected anomaly, the AI traces causality through the system to identify the root cause.
  • Autonomous remediation: The AI generates and deploys fixes, tests them, and monitors the results - closing the loop entirely.

These tools leverage the latest advances in deep learning - not just generative AI, but also anomaly detection, time-series forecasting, and graph neural networks - to create systems that function as “AI co-workers” rather than simple “coding assistants.”

5.15 New Compute Architectures

While the transformer (Vaswani et al. 2017) has dominated AI for years, researchers are actively exploring architectures that address its fundamental limitations - particularly the quadratic cost of self-attention with respect to sequence length.

5.15.1 Google Titan

Google's Titan architecture extends the transformer with a learned memory module inspired by how the human brain consolidates short-term experiences into long-term memory. Titan augments the standard attention mechanism with a neural long-term memory that can store and retrieve information across extremely long contexts, potentially enabling models to maintain coherent reasoning over entire books or codebases without the memory and compute costs of attending to every token.

5.15.2 JEPA: Joint Embedding Predictive Architecture

Yann LeCun (Meta, Chief AI Scientist) has proposed the Joint Embedding Predictive Architecture (JEPA) (LeCun 2022) as an alternative path toward human-level AI. Rather than generating predictions in pixel or token space (as autoregressive models do), JEPA predicts in embedding space - it learns to predict abstract representations of future states rather than raw sensory data.

The motivation is that the real world is too complex and stochastic to predict at the pixel level. Humans do not predict every pixel of what they will see next; they predict abstract outcomes (“if I push this cup, it will fall”). JEPA aims to build world models that capture this abstract predictive ability, and LeCun argues this is a necessary step toward machines that truly understand the world, rather than merely generating plausible text or images.

JEPA represents a fundamentally different philosophy from the scaling-focused approach of OpenAI and Google: one that prioritizes architectural innovation over brute-force scaling. Whether JEPA or a variant will ultimately succeed remains an open question, but it highlights that the transformer may not be the final architecture for AI.

TipThe Architecture Debate

The AI community is split on whether the transformer is the “final” architecture or a stepping stone. OpenAI and Anthropic bet on scaling transformers. LeCun bets on JEPA and world models. Google explores Titan and state-space models. The honest answer: nobody knows. But this is exactly why understanding the principles (scaling, attention, representation learning) matters more than memorizing any specific architecture. The principles transfer; the architectures may not.

5.16 Notable Models and Products

To ground the theoretical developments discussed in this chapter, here are some of the most notable AI models and products as of 2024-2025:

  • Claude Computer Use (Anthropic): An AI agent that can operate a computer like a human - clicking buttons, typing text, navigating applications, and browsing the web. This represents the frontier of computer-using agents, moving AI from conversation to action.
  • Google Genie 2: A foundational world model that generates interactive, high-fidelity 3D environments from a single image. It demonstrates that generative models can create not just static content but dynamic, explorable worlds with consistent physics.
  • Figure AI: A robotics company building general-purpose humanoid robots powered by VLMs. Their robots can understand spoken instructions, perceive their environment through vision, and manipulate objects - combining language understanding with physical dexterity.
  • OpenAI o3: The latest reasoning model, achieving state-of-the-art performance on the ARC-AGI benchmark and demonstrating that inference-time compute scaling (thinking longer) can substitute for model scaling (training bigger).
  • DeepSeek-R1: An open-source reasoning model from a Chinese AI lab that matched much of o1's performance, released with full weights and training details, democratizing access to reasoning model capabilities.
  • UFO (Microsoft): A UI-focused agent framework (Zhang et al. 2024) that can autonomously operate Windows applications by understanding screenshots and generating mouse/keyboard actions, enabling AI to use any software designed for humans.

5.17 Suggested Papers and References

The following papers and projects provide deeper context for the topics discussed in this chapter:

  1. Yann LeCun's Lecture on Objective-Oriented AI (LeCun 2022) - A foundational talk laying out the case for JEPA and world models as alternatives to autoregressive generation.

    https://www.youtube.com/watch?v=MiqLoAZFRSE

  2. AI as Automated Project Manager - A practical exploration of using LLM agents for project management, planning, and task decomposition.

    https://www.reddit.com/r/ArtificialIntelligence/comments/1485z75/

  3. From Task-Driven AI Copilots to Goal-Driven AI Pair Programmers - An academic paper proposing the shift from reactive coding assistants to proactive, goal-oriented programming agents.

    https://arxiv.org/pdf/2404.10225

  4. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face - Demonstrates using an LLM as a controller that orchestrates specialized models from HuggingFace to solve complex, multi-modal AI tasks.

    https://arxiv.org/abs/2303.17580

  5. OpenVLA: An Open-Source Vision-Language-Action Model (Kim et al. 2024) - A 7B-parameter open-source VLA model for robot manipulation, demonstrating how language and vision understanding can be connected to physical actions.

    https://arxiv.org/abs/2406.09246

  6. BabyAGI (Nakajima 2023) - One of the first attempts at an autonomous AI agent that creates, prioritizes, and executes tasks using LLMs, demonstrating both the promise and limitations of self-prompting loops.

    https://github.com/yoheinakajima/babyagi

  7. AutoGPT (Significant Gravitas 2023) - An experimental open-source project that chains GPT-4 calls with internet access, file I/O, and memory to accomplish user-defined goals autonomously.

    https://github.com/Significant-Gravitas/AutoGPT

  8. Reinforcement Learning from Vision-Language Foundation Model Feedback - Proposes using VLM outputs as reward signals for RL, replacing human feedback with model-generated evaluations for training embodied agents.

    https://arxiv.org/abs/2402.03681

5.18 Exercises

  1. Read Richard Sutton's “The Bitter Lesson” (http://www.incompleteideas.net/IncIdeas/BitterLesson.html). In your own words, summarize the argument in one paragraph. Then write a second paragraph: do you agree? Can you think of counter-examples where clever, specialized methods beat brute-force scaling?
  2. Compare DeepMind and OpenAI's approaches from 2016 to 2024. Write a one-page timeline of each lab's major releases. Where did their philosophies diverge, and where did they converge?
  3. Try GPT-3's original few-shot learning: using any modern LLM (ChatGPT, Claude, LLaMA), provide three examples of a task (e.g., translating English to French) in the prompt, then ask it to perform the task on a new input. Vary the number of examples from zero to five. How does performance change?
  4. Read the Chinchilla paper (Hoffmann et al. 2022) (or a summary of it). What is the “compute-optimal” relationship between model size and training data? How did this finding change how subsequent models were trained?
  5. Pick any model from the “Notable Models and Products” section. Research its current status: Is it still active? Has it been superseded? What has changed since it was released? This exercise will teach you how fast the field moves.

References

Brohan, Anthony, Noah Brown, Justice Carbajal, et al. 2023. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv Preprint arXiv:2307.15818.
Brown, Tom B, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877-901.
Bruce, Jake, Michael Dennis, Ashley Edwards, et al. 2024. “Genie: Generative Interactive Environments.” International Conference on Machine Learning.
Chollet, François. 2024. ARC-AGI: A Benchmark for General Intelligence. https://arcprize.org/.
Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” arXiv Preprint arXiv:2210.17323.
Girdhar, Rohit, Alaaeldin El-Nouby, Zhuang Liu, et al. 2023. ImageBind: One Embedding Space to Bind Them All.” CVPR.
Google DeepMind. 2024. Genie 2: A Large-Scale Foundation World Model. https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/.
Guo, Daya, Dejian Yang, Haowei Zhang, et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv Preprint arXiv:2501.12948.
Hendrycks, Dan, Collin Burns, Steven Basart, et al. 2021. “Measuring Massive Multitask Language Understanding.” arXiv Preprint arXiv:2009.03300.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv Preprint arXiv:1503.02531.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv Preprint arXiv:2203.15556.
Jiang, Albert Q, Alexandre Sablayrolles, Arthur Mensch, et al. 2023. “Mistral 7B.” arXiv Preprint arXiv:2310.06825.
Jiang, Albert Q, Alexandre Sablayrolles, Antoine Roux, et al. 2024. “Mixtral of Experts.” arXiv Preprint arXiv:2401.04088.
Jimenez, Carlos E, John Yang, Alexander Wettig, et al. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? https://arxiv.org/abs/2310.06770.
Kahneman, Daniel. 2011. Thinking, Fast and Slow.
Kim, Moo Jin, Karl Pertsch, Siddharth Karamcheti, et al. 2024. OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv Preprint arXiv:2406.09246.
Kirillov, Alexander, Eric Mintun, Nikhila Ravi, et al. 2023. “Segment Anything.” arXiv Preprint arXiv:2304.02643.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems.
LeCun, Yann. 2022. “A Path Towards Autonomous Machine Intelligence.” OpenReview.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33: 9459-74.
Lin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration.” arXiv Preprint arXiv:2306.00978.
Nakajima, Yohei. 2023. BabyAGI. GitHub. https://github.com/yoheinakajima/babyagi.
OpenAI. 2023. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774.
OpenAI. 2024. O1 System Card. https://cdn.openai.com/o1-system-card.pdf.
OpenAI. 2025. O3-Mini System Card. https://cdn.openai.com/o3-mini-system-card-feb10.pdf.
Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730-44.
Radford, Alec, Jong Wook Kim, Chris Hallacy, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv Preprint arXiv:2103.00020.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv Preprint arXiv:2212.04356.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luen, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog.
Schick, Timo, Jane Dwivedi-Yu, Roberto Dessı̀, et al. 2023. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv Preprint arXiv:2302.04761.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” arXiv Preprint arXiv:1508.07909.
Shazeer, Noam, Azalia Mirhoseini, Krzysztof Marozas, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv Preprint arXiv:1701.06538.
Significant Gravitas. 2023. AutoGPT. GitHub. https://github.com/Significant-Gravitas/AutoGPT.
Silver, David, Aja Huang, Chris J Maddison, et al. 2016. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature 529: 484-89.
Silver, David, Thomas Hubert, Julian Schrittwieser, et al. 2017. “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.” arXiv Preprint arXiv:1712.01815.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and Efficient Foundation Language Models.” arXiv Preprint arXiv:2302.13971.
Touvron, Hugo, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv Preprint arXiv:2307.09288.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv Preprint arXiv:2201.11903.
Zhang, Chaoyun, Liqun Li, Shilin He, et al. 2024. UFO: A UI-Focused Agent for Windows OS Interaction.” arXiv Preprint arXiv:2402.07939.
Ziegler, Daniel M., Nisan Stiennon, Jeffrey Wu, et al. 2020. Fine-Tuning Language Models from Human Preferences. https://arxiv.org/abs/1909.08593.