Appendix D - Glossary

Glossary

Adversarial suffix
An optimized token sequence appended to a prompt to bypass a model's safety filters (e.g., the GCG attack).
Agent
An LLM-based system that can reason, plan, and take actions by calling external tools or APIs.
Alignment
The problem of ensuring an AI system's learned behavior matches human intentions and values.
Attention (self-attention)
The core mechanism in transformers, allowing each token to attend to every other token in a sequence and learn contextual relationships.
Backpropagation
The algorithm for computing gradients of a loss function with respect to every weight in a neural network, enabling gradient-based training.
Byte Pair Encoding (BPE)
A tokenization algorithm that iteratively merges the most frequent adjacent character pairs into new tokens.
Chain-of-thought (CoT)
A prompting strategy that elicits step-by-step reasoning from an LLM before it produces a final answer.
Chinchilla scaling
The observation that compute-optimal training allocates budget roughly equally between model size and training data.
Continuous batching
An inference-serving technique where new requests can join a running batch dynamically, improving GPU utilization.
Corrigibility
An AI system's willingness to be corrected, modified, or shut down by its operators.
Data contamination
When benchmark test data leaks into a model's pre-training corpus, artificially inflating evaluation scores.
Decoder-only transformer
The architecture used in GPT-style models, generating tokens left-to-right with causal masking.
Diffusion model
A generative model that learns to reverse a gradual noising process, producing images, audio, or video from random noise.
Direct Preference Optimization (DPO)
A simpler alternative to RLHF that optimizes human preferences directly without training a separate reward model.
Distillation
See Knowledge distillation.
Embedding
A dense, continuous vector representation of a discrete input (token, sentence, image) in a high-dimensional space.
Equivariance
A property where a function's output transforms predictably when its input undergoes a specific transformation (e.g., rotation).
Feature attribution
Methods (gradient-based, LIME, SHAP) that assign importance scores to individual inputs to explain a model's prediction.
Fine-tuning
Continuing training of a pre-trained model on a smaller, task-specific dataset to adapt it to a particular use case.
FlashAttention
A memory-efficient attention algorithm that avoids materializing the full \(N \times N\) attention matrix, reducing memory from \(O(N^2)\) to \(O(N)\).
Goodhart's Law
“When a measure becomes a target, it ceases to be a good measure.” A recurring concern in AI evaluation.
Graph Neural Network (GNN)
A neural network that operates on graph-structured data, aggregating information from neighboring nodes via message passing.
Guardrail model
A secondary model that monitors and filters an LLM's inputs or outputs for safety and policy compliance.
Hallucination
When a model generates plausible-sounding but factually incorrect or unsupported content.
In-context learning
A model's ability to learn from examples provided in the prompt at inference time, without any weight updates.
Induction head
An attention-head circuit that implements in-context pattern matching and copying.
Invariance
A property where a function's output is unchanged under specific transformations of its input.
Jailbreaking
Prompt techniques designed to bypass an LLM's safety guardrails and elicit disallowed outputs.
JEPA (Joint Embedding Predictive Architecture)
Yann LeCun's framework for self-supervised learning that predicts in embedding space rather than pixel or token space.
Knowledge distillation
Training a smaller “student” model to mimic a larger “teacher” model's outputs, transferring capability at reduced cost.
KV-cache
Cached key and value tensors from previously generated tokens, avoiding redundant computation during autoregressive inference.
Latent diffusion
Running the diffusion process in a compressed latent space rather than raw pixel space, making generation much faster.
Latent space
The internal, compressed representation space learned by a neural network.
Linear probing
Training a simple linear classifier on a model's hidden states to test what information is encoded at each layer.
Logit lens
A technique that projects intermediate transformer hidden states to vocabulary space to visualize how predictions evolve layer by layer.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that adds small trainable low-rank matrices alongside frozen model weights.
Mechanistic interpretability
The practice of reverse-engineering the specific algorithms and circuits implemented by a neural network's weights.
Message passing
The GNN paradigm where each node aggregates feature vectors from its neighbors to update its own representation.
Mixture of Experts (MoE)
An architecture where a routing mechanism activates only a subset of parameters (“experts”) for each input, enabling larger models at lower compute cost.
Multi-head attention
Running multiple attention computations in parallel, allowing the model to capture different types of relationships simultaneously.
PagedAttention
Managing KV-cache memory in fixed-size pages (like OS virtual memory), eliminating fragmentation during inference serving.
Perplexity
A standard metric for language models, measuring how surprised the model is by a held-out text. Lower is better.
Polysemantic neuron
A neuron that activates for multiple, seemingly unrelated concepts---a consequence of superposition.
Post-training
The alignment stage after pre-training, typically including supervised fine-tuning, RLHF, and/or DPO.
Pre-training
The initial large-scale training phase where a model learns general language (or multimodal) representations from massive corpora.
Prompt injection
Manipulating an LLM's behavior by embedding adversarial instructions inside untrusted input data.
Pruning
Removing unimportant weights or entire structures (heads, layers) from a model to reduce its size and latency.
Quantization
Reducing the numerical precision of model weights (e.g., FP32 \(\to\) INT4), shrinking memory footprint and speeding up inference.
RAG (Retrieval-Augmented Generation)
Grounding LLM responses by first retrieving relevant documents from an external knowledge base and including them in the prompt.
ReAct
An agent framework that interleaves reasoning traces (“thought”) with tool-calling actions (“act”) in an alternating loop.
Recursive self-improvement
A hypothetical scenario where an AI system improves its own capabilities in a continuous feedback loop.
Red teaming
Systematically probing an AI system for vulnerabilities, biases, or failure modes.
Representation engineering
Directly modifying a model's internal activations at inference time to steer its behavior (e.g., increasing honesty or reducing toxicity).
Reward model
A model trained on human preference data that scores LLM outputs, used as the reward signal in RLHF.
Scaling laws
Empirical power-law relationships between compute, dataset size, model parameters, and resulting performance.
Sparse autoencoder (SAE)
A learned decomposition that extracts interpretable, monosemantic features from a model's polysemantic internal activations.
Speculative decoding
Using a small, fast “draft” model to propose candidate tokens that are then verified in parallel by the full model.
Superposition
When a neural network encodes more conceptual features than it has dimensions, overlapping them in a compressed representation.
Tokenization
The process of converting raw text into a sequence of discrete integer tokens for model input.
Transformer
The dominant neural network architecture for language and multimodal AI, based entirely on self-attention rather than recurrence or convolution.
Vector database
A database optimized for storing and searching high-dimensional embedding vectors via approximate nearest-neighbor algorithms.
World model
An internal model that predicts the consequences of actions before they are executed, enabling planning and imagination.
Zero-shot / few-shot
Performing a task with zero or a few examples in the prompt, relying on the model's pre-trained knowledge.