10 Multimodality

The real world is not made of text. It is a rich tapestry of images, sounds, language, motion, and sensation. A truly intelligent system must be able to perceive and reason across all of these modalities simultaneously, not switch between isolated specialists. This insight has driven the field of multimodal AI, which aims to build models that can process, relate, and generate data across multiple types of input and output.

A modality is simply a type of data: text, images, video, audio, depth maps, infrared, or even inertial measurements from a device's accelerometer and gyroscope. Unimodal models operate on a single modality (e.g., a language model that only sees text), while multimodal models can handle two or more. The motivation is clear: our world is inherently multimodal, and so are we. Humans perceive through sight, hearing, touch, taste, and smell, effortlessly combining these streams into a coherent understanding of our environment. To build AI that interacts naturally with the world, we need models that can do the same.

Why Multimodality Matters

Consider a simple task: a robot must “pick up the red cup next to the laptop.” This requires visual perception (recognizing the cup and laptop), language understanding (parsing the instruction), spatial reasoning (locating “next to”), and motor control (executing the grasp). No single modality suffices. Multimodal models unify these capabilities in a single system.

The benefits of multimodal training extend beyond practical necessity. Joint training across modalities produces better embeddings that capture richer semantic content, leading to higher accuracy and lower loss on downstream tasks. A model that has seen both images and their descriptions develops a deeper understanding of visual concepts than one trained on images alone.

In this chapter, we will explore the landscape of multimodal AI. We begin with the distinction between pipelined and truly integrated multimodal systems, then examine foundational models like CLIP and ImageBind, survey the rise of Vision-Language Models (VLMs), explore Vision-Language-Action (VLA) models and their applications in robotics, discuss generative multimodal models for images, video, and audio, and close with a look at where the field is headed.

10.1 Pipelined vs. Truly Integrated Multimodality

There are two broad approaches to building multimodal systems. The first, and historically more common, is the pipelined approach: separate specialist models are trained for each modality (a vision encoder, a language model, an audio encoder) and then connected through adapters, projection layers, or simple concatenation. The output of one model becomes the input to another. For example, an image captioning system might use a CNN to extract visual features, project them into the language model's embedding space, and then decode text.

The second approach is true integration: a single model processes all modalities natively, converting each into a common token or embedding representation and processing them together in one unified architecture. GPT-4o and Google's Gemini exemplify this paradigm, where text, images, and audio are all tokenized and processed by the same transformer.

The Spectrum of Integration

Most real systems fall somewhere between these extremes. LLaVA, for instance, uses a frozen CLIP vision encoder and connects it to a language model via a learnable projection, making it a hybrid. The trend, however, is clearly toward deeper integration, as it enables the model to learn cross-modal relationships that pipelined systems cannot capture.

The advantage of true integration is that the model can learn subtle cross-modal correlations during training. A pipelined system where the vision encoder is frozen cannot adapt its visual representations based on language context, while an integrated model can jointly optimize across all modalities.

10.2 CLIP: Contrastive Language-Image Pretraining

CLIP (Radford et al. 2021), released by OpenAI in 2021, was a watershed moment for multimodal AI. It demonstrated that simple contrastive learning on large-scale image-text pairs could produce visual representations that rival or exceed those of supervised models, with remarkable zero-shot transfer capabilities.

10.2.1 Architecture and Training

CLIP consists of two encoders: an image encoder (either a ResNet or a Vision Transformer) and a text encoder (a standard Transformer). Given a batch of \(N\) image-text pairs, CLIP computes embeddings for all images and all texts, producing an \(N \times N\) matrix of cosine similarities. The training objective is to maximize the similarity between the \(N\) correct image-text pairs (the diagonal of the matrix) while minimizing the similarity between the \(N^2 - N\) incorrect pairs (the off-diagonal elements). This is the contrastive loss.

Contrastive Learning in a Nutshell

Suppose a training batch contains 100 images, each paired with its caption. For every image, there is exactly one correct caption out of the hundred. The model learns to push each image's embedding close to its matching caption and far from all 99 wrong ones. Over billions of such comparisons, a shared image-text embedding space emerges where semantically similar concepts cluster together.

CLIP was trained on 400 million image-text pairs scraped from the internet. The scale and diversity of this dataset is what gives CLIP its generalization power. Unlike supervised models trained on fixed label sets (e.g., ImageNet's 1,000 classes), CLIP can recognize virtually any visual concept that can be described in natural language.

10.2.2 Zero-Shot Classification

CLIP's most celebrated capability is zero-shot image classification. To classify an image into one of \(K\) categories, you simply create \(K\) text prompts (e.g., “a photo of a dog,” “a photo of a cat”) and compute the cosine similarity between the image embedding and each text embedding. The category with the highest similarity is the prediction. No task-specific training is needed.

CLIP's Impact

CLIP's learned representations have become foundational building blocks across AI. They power the text-to-image generation in DALL-E 2, provide the vision backbone for LLaVA and many other VLMs, and serve as the image encoder in countless downstream applications. CLIP showed that natural language supervision is a powerful, scalable alternative to manual labeling.

10.3 ImageBind by Meta

While CLIP aligns two modalities (images and text), Meta's ImageBind (Girdhar et al. 2023) extends this idea to six modalities: images, text, audio, depth, thermal (infrared), and IMU (inertial measurement unit) data. The result is a single embedding space where all six modalities coexist.

10.3.1 The Bridge Modality Insight

ImageBind's key innovation is elegant: it does not require data paired across all six modalities. Collecting such data would be prohibitively expensive. Instead, ImageBind leverages the observation that large-scale paired datasets already exist for several modality pairs that share images: (image, text), (image, audio), (image, depth), (image, thermal), and (image, IMU). By training separate contrastive objectives for each pair, with images as the common anchor, all modalities are pulled into a shared embedding space.

Emergent Cross-Modal Abilities

Because all modalities are aligned through images, ImageBind exhibits “emergent” zero-shot capabilities that it was never explicitly trained for. For example, given a sound (a dog barking), it can retrieve the most relevant image, even though it was never trained on direct audio-image pairs. It can also match text to depth maps, or audio to thermal images. Images serve as a universal bridge connecting all other modalities.

10.3.2 Implications

ImageBind demonstrates that you do not need paired data for every combination of modalities. As long as there is a common anchor modality (in this case, images), you can indirectly align any two modalities through that anchor. This has profound implications for scaling to even more modalities: as long as you can pair a new modality with images, it automatically gains alignment with all other modalities in the space.

10.4 Vision-Language Models

Vision-Language Models (VLMs) combine visual perception with language understanding and generation. Unlike CLIP, which produces embeddings but cannot generate text, VLMs can hold conversations about images, answer visual questions, describe scenes, and reason about visual content.

10.4.1 LLaVA: Visual Instruction Tuning

LLaVA (Liu et al. 2023) (Large Language and Visual Assistant) is one of the most influential open-source VLMs. Its design is remarkably simple: take a pre-trained CLIP vision encoder, take a pre-trained language model (such as Vicuna or LLaMA), and connect them with a small learnable projection layer (a linear layer or a small MLP).

During training, an image is passed through the CLIP encoder to produce a sequence of visual tokens. These tokens are projected into the language model's embedding space and prepended to the text tokens. The language model then processes both visual and text tokens together, generating a text response. Training proceeds in two stages: first, adapting the projection layer on image-caption data (alignment), then fine-tuning on visual instruction-following data (instruction tuning).

Why LLaVA Works So Well

LLaVA's success comes from standing on the shoulders of two giants: CLIP's powerful visual representations and a strong pre-trained language model. The projection layer is small, so training is cheap and fast. The instruction tuning data (generated using GPT-4) teaches the model to follow complex visual instructions, answer questions, and engage in multi-turn dialogues about images.

10.4.2 GPT-4o: Native Multimodality

GPT-4o, released by OpenAI in 2024, represents a fundamentally different approach. Unlike LLaVA, which stitches together pre-trained components, GPT-4o is a single unified transformer trained end-to-end on text, images, and audio simultaneously.

The key architectural difference is that GPT-4o does not have separate encoders per modality. Instead, each modality is converted into a sequence of tokens using learnable front-end modules. Text is tokenized as usual. Images are converted into sequences of visual tokens. Audio is converted into audio tokens. All of these are concatenated into a single sequence that the transformer processes, allowing the model to attend across modalities at every layer.

Training uses mixed sequences: sometimes pure text, sometimes image and text together, sometimes audio and text, or all three at once. The model learns to reason across modalities seamlessly because it has always seen them together. GPT-4o can look at a picture, hear your voice, understand both, and respond in speech or text, all in one forward pass.

10.4.3 Gemini

Google's Gemini models take a similar approach to native multimodality. Gemini is trained from the ground up on interleaved text, image, audio, and video data. Gemini's architecture processes all modalities within a unified transformer, and it can both understand and generate content across multiple modalities. The Gemini family spans multiple sizes, from the lightweight Gemini Nano (designed for on-device use) to the powerful Gemini Ultra.

The Convergence of VLMs

The trend is clear: the most capable VLMs are moving toward native multimodal training in a single unified architecture, rather than bolting separate components together. This enables deeper cross-modal reasoning and more natural interaction. However, the modular approach (exemplified by LLaVA) remains important for research and for settings where compute is limited, since it allows reusing existing pre-trained models.

10.5 Vision-Language-Action Models and Robotics

Vision-Language-Action (VLA) models extend multimodal AI into the physical world. While VLMs take in images and text and output text, VLAs additionally output actions: motor commands that control a robot's joints, grippers, or wheels.

The core idea is to leverage the rich visual and language understanding of pre-trained VLMs and extend them to predict robotic actions. If a model can look at a scene, understand a natural language instruction, and produce the sequence of motor commands needed to carry out that instruction, then we have a general-purpose robot controller.

10.5.1 RT-1 and RT-2

Google's RT-1 (Brohan et al. 2022) was a robotics transformer trained on 130,000 real-world robot demonstrations. It takes camera images and natural language instructions (“pick up the can”) as input and outputs discretized motor actions. The model learned to map visual observations and language goals directly to low-level control commands.

RT-2 (Brohan et al. 2023) took this further by fine-tuning a large VLM (PaLI-X) to additionally output robot actions. The insight was that web-scale visual-language pre-training transfers to robotic control: the model could follow novel instructions it had never seen during robot training, because it had learned rich visual and linguistic representations from internet data.

From Web Knowledge to Robot Actions

RT-2 demonstrated that a model trained on billions of web images and text paragraphs can transfer that knowledge to a robot arm. When asked to “pick up the extinct animal” (a toy dinosaur), RT-2 succeeded even though it had never been trained on that specific instruction with a robot. Its web-scale training gave it the concept of “extinct animal,” and its robotic fine-tuning gave it the motor skills.

10.5.2 OpenVLA

OpenVLA (Kim et al. 2024) is an open-source 7B parameter VLA model that combines a vision encoder, a language model, and an action head into a single architecture. It was trained on the Open X-Embodiment dataset containing over one million robot episodes across multiple robot platforms and tasks. OpenVLA can be fine-tuned for specific robots with only a few hundred demonstrations, making it a practical starting point for researchers and labs building robotic systems.

10.5.3 Challenges in VLA Deployment

Despite rapid progress, VLAs face several significant challenges:

Precise manipulation: Current models struggle with tasks requiring sub-millimeter precision, such as inserting a key into a lock or threading a needle.
Long-horizon tasks: Multi-step tasks (“clean the kitchen”) require planning over extended time horizons, which current VLAs handle poorly.
Novel environments: Sim-to-real transfer remains difficult. Models trained in simulation often fail when confronted with the messiness of real-world lighting, textures, and physics.
Safety: A robot that misinterprets an instruction can cause physical harm. Robust safety mechanisms are essential before deployment.

10.6 VLA Models and RL

VLAs and Reinforcement Learning (RL) are deeply complementary. Most VLAs are initially trained via behavior cloning: supervised learning on expert demonstrations. While this provides a strong starting point, it inherits the limitations of the demonstration data and cannot improve beyond the demonstrator's skill level.

RL addresses this by allowing the agent to learn from trial and error, optimizing a reward signal rather than imitating fixed demonstrations. The combination is powerful:

Pre-training with imitation, fine-tuning with RL: VLAs are pre-trained via behavior cloning on large demonstration datasets, then fine-tuned with RL in simulation or the real world to improve robustness and handle edge cases the demonstrations did not cover.
Language-conditioned RL: An RL agent receives its reward based on whether it successfully followed a natural language instruction. The VLA's language understanding enables zero-shot generalization to new tasks described in words.
World model planning: A learned world model (discussed in Chapter 19) can simulate future states, allowing the VLA to plan actions “in imagination” before executing them physically. This drastically reduces the number of costly real-world interactions needed for learning.

The Role of Simulation

Simulation environments like MuJoCo, Isaac Gym, and Habitat are critical for VLA training. They allow millions of RL episodes to be run in parallel at negligible cost. The challenge is the sim-to-real gap: policies learned in simulation must transfer to the real world, where physics, lighting, and object properties differ from the simulator's approximations. Domain randomization (training with varied simulation parameters) helps bridge this gap.

10.7 Text-to-Image and Text-to-Video Generation

Generative multimodal models can produce images and videos from text descriptions. This is, in a sense, the inverse of visual understanding: instead of perceiving images and outputting text, these models perceive text and output images.

10.7.1 Diffusion Models

The dominant paradigm for image generation is the diffusion model. The core idea is to train a neural network to reverse a gradual noising process. During training, noise is progressively added to clean images until they become pure Gaussian noise. The model learns to predict and remove this noise step by step. At generation time, you start from random noise and iteratively denoise, guided by a text prompt, until a clean image emerges.

How Text Guides Image Generation

In a text-conditioned diffusion model, the text prompt is encoded (often using a CLIP text encoder) and injected into the denoising network via cross-attention layers. At each denoising step, the model attends to the text embedding, which steers the generation toward the described content. “A golden retriever playing in snow” pushes the model to denoise in a direction consistent with dogs, gold fur, and snowy landscapes.

10.7.2 Key Models

DALL-E 2 (OpenAI, 2022) uses a two-stage process: a prior model maps CLIP text embeddings to CLIP image embeddings, and then a diffusion decoder generates the image from the image embedding. This architecture leverages CLIP's learned image-text alignment.

Stable Diffusion (Stability AI, 2022) performs diffusion in a compressed latent space rather than pixel space, significantly reducing computation. An encoder compresses images to latent representations, diffusion operates in this latent space, and a decoder maps back to pixels. This Latent Diffusion Model (LDM) architecture made high-quality image generation accessible to consumer hardware.

FLUX (Black Forest Labs, 2024) introduced architectural improvements including flow matching objectives and improved transformer backbones, achieving state-of-the-art image quality and prompt adherence.

10.7.3 Video Generation: Sora and Beyond

Extending image generation to video adds the dimension of temporal consistency: generated frames must be coherent across time, maintaining object identity, motion continuity, and physical plausibility.

OpenAI's Sora (2024) demonstrated that diffusion transformers (DiTs) can generate photorealistic videos up to a minute long from text descriptions. Sora processes spacetime patches (3D chunks of video) as tokens, allowing it to handle variable durations and resolutions. The results showed an impressive understanding of physics, lighting, and camera motion, though artifacts and inconsistencies remain.

Video as World Simulation

Some researchers have described video generation models as “world simulators.” If a model can generate realistic video of a ball rolling down a hill, bouncing off a wall, and coming to rest, it must encode implicit knowledge of gravity, friction, and collision. This connects to the world models discussed in Chapter 19: generative video models may eventually serve as learned physics engines for planning and reasoning.

10.8 Audio and Speech Models

Multimodal AI extends beyond vision and language to the auditory domain. Several recent models have demonstrated remarkable capabilities in speech and audio processing.

Whisper (OpenAI, 2022) is a speech recognition model trained on 680,000 hours of multilingual audio-text data. It uses a simple encoder-decoder transformer architecture and achieves robust transcription across dozens of languages, accents, and acoustic conditions. Its success comes from the sheer scale and diversity of its training data.

VALL-E (Microsoft, 2023) is a text-to-speech model that treats speech synthesis as a language modeling problem. Given a 3-second audio sample of a speaker and a text prompt, VALL-E generates speech in that speaker's voice. It uses neural audio codec tokens (from EnCodec) as its “vocabulary,” treating speech generation as next-token prediction over audio tokens.

MusicGen (Meta, 2023) generates music from text descriptions or melody inputs. It operates on multiple streams of audio tokens simultaneously and can produce coherent, high-quality musical compositions in various styles.

Audio Tokens: The Key Insight

A recurring pattern across audio models is the use of neural audio codecs (such as EnCodec or SoundStream) to tokenize audio waveforms. These codecs compress audio into discrete token sequences, analogous to how text tokenizers convert words to token IDs. Once audio is tokenized, it can be processed by transformers using the same next-token prediction framework that powers language models. This unification is what enables truly integrated multimodal systems.

10.9 The Future of Multimodal AI

The trajectory of multimodal AI points toward increasingly unified, capable systems. Several trends are shaping this future:

Universal architectures: The distinction between “vision model,” “language model,” and “audio model” is dissolving. Future systems will be trained from scratch on interleaved data from all modalities, with a single architecture that makes no distinction between seeing, reading, and hearing.

More modalities: Current systems handle a handful of modalities. Future models may incorporate touch (haptic feedback), smell (chemical sensors), proprioception (body position awareness), and even electromagnetic or lidar data. ImageBind's bridge modality approach shows how this can scale without requiring all-pairs data.

Embodied multimodality: As VLAs mature, multimodal AI will increasingly inhabit physical bodies: robots, autonomous vehicles, surgical systems, and assistive devices. The combination of perception, language understanding, and action in a single model is the key to general-purpose robotics.

Real-time interaction: GPT-4o's ability to see, hear, and speak in real time previews a future where AI assistants interact with humans as naturally as another person would, perceiving and responding to facial expressions, tone of voice, and visual context simultaneously.

The Multimodal Scaling Hypothesis

Just as the “scaling hypothesis” for language models predicted that larger models trained on more text would develop emergent capabilities, many researchers believe that scaling multimodal models across more modalities, more data, and more compute will yield emergent cross-modal reasoning abilities that we cannot yet anticipate. The early evidence from models like GPT-4o and Gemini supports this hypothesis.

Multimodal AI is not merely about adding new input types to existing models. It represents a fundamental shift in how we think about intelligence: from narrow, single-modality specialists to integrated systems that perceive and act in the world as holistically as biological organisms do. The models discussed in this chapter are early steps on that path, and the most exciting developments likely lie ahead.

References

Brohan, Anthony, Noah Brown, Justice Carbajal, et al. 2022. “RT-1: Robotics Transformer for Real-World Control at Scale.” arXiv Preprint arXiv:2212.06817.

Brohan, Anthony, Noah Brown, Justice Carbajal, et al. 2023. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv Preprint arXiv:2307.15818.

Girdhar, Rohit, Alaaeldin El-Nouby, Zhuang Liu, et al. 2023. “ImageBind: One Embedding Space to Bind Them All.” CVPR.

Kim, Moo Jin, Karl Pertsch, Siddharth Karamcheti, et al. 2024. “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv Preprint arXiv:2406.09246.

Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. “Visual Instruction Tuning.” arXiv Preprint arXiv:2304.08485.

Radford, Alec, Jong Wook Kim, Chris Hallacy, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv Preprint arXiv:2103.00020.