Appendix D - Glossary

Glossary

Adversarial suffix: An optimized token sequence appended to a prompt to bypass a model's safety filters (e.g., the GCG attack).
Agent: An LLM-based system that can reason, plan, and take actions by calling external tools or APIs.
Alignment: The problem of ensuring an AI system's learned behavior matches human intentions and values.
Attention (self-attention): The core mechanism in transformers, allowing each token to attend to every other token in a sequence and learn contextual relationships.
Backpropagation: The algorithm for computing gradients of a loss function with respect to every weight in a neural network, enabling gradient-based training.
Byte Pair Encoding (BPE): A tokenization algorithm that iteratively merges the most frequent adjacent character pairs into new tokens.
Chain-of-thought (CoT): A prompting strategy that elicits step-by-step reasoning from an LLM before it produces a final answer.
Chinchilla scaling: The observation that compute-optimal training allocates budget roughly equally between model size and training data.
Continuous batching: An inference-serving technique where new requests can join a running batch dynamically, improving GPU utilization.
Corrigibility: An AI system's willingness to be corrected, modified, or shut down by its operators.
Data contamination: When benchmark test data leaks into a model's pre-training corpus, artificially inflating evaluation scores.
Decoder-only transformer: The architecture used in GPT-style models, generating tokens left-to-right with causal masking.
Diffusion model: A generative model that learns to reverse a gradual noising process, producing images, audio, or video from random noise.
Direct Preference Optimization (DPO): A simpler alternative to RLHF that optimizes human preferences directly without training a separate reward model.
Distillation: See Knowledge distillation.
Embedding: A dense, continuous vector representation of a discrete input (token, sentence, image) in a high-dimensional space.
Equivariance: A property where a function's output transforms predictably when its input undergoes a specific transformation (e.g., rotation).
Feature attribution: Methods (gradient-based, LIME, SHAP) that assign importance scores to individual inputs to explain a model's prediction.
Fine-tuning: Continuing training of a pre-trained model on a smaller, task-specific dataset to adapt it to a particular use case.
FlashAttention: A memory-efficient attention algorithm that avoids materializing the full \(N \times N\) attention matrix, reducing memory from \(O(N^2)\) to \(O(N)\).
Goodhart's Law: “When a measure becomes a target, it ceases to be a good measure.” A recurring concern in AI evaluation.
Graph Neural Network (GNN): A neural network that operates on graph-structured data, aggregating information from neighboring nodes via message passing.
Guardrail model: A secondary model that monitors and filters an LLM's inputs or outputs for safety and policy compliance.
Hallucination: When a model generates plausible-sounding but factually incorrect or unsupported content.
In-context learning: A model's ability to learn from examples provided in the prompt at inference time, without any weight updates.
Induction head: An attention-head circuit that implements in-context pattern matching and copying.
Invariance: A property where a function's output is unchanged under specific transformations of its input.
Jailbreaking: Prompt techniques designed to bypass an LLM's safety guardrails and elicit disallowed outputs.
JEPA (Joint Embedding Predictive Architecture): Yann LeCun's framework for self-supervised learning that predicts in embedding space rather than pixel or token space.
Knowledge distillation: Training a smaller “student” model to mimic a larger “teacher” model's outputs, transferring capability at reduced cost.
KV-cache: Cached key and value tensors from previously generated tokens, avoiding redundant computation during autoregressive inference.
Latent diffusion: Running the diffusion process in a compressed latent space rather than raw pixel space, making generation much faster.
Latent space: The internal, compressed representation space learned by a neural network.
Linear probing: Training a simple linear classifier on a model's hidden states to test what information is encoded at each layer.
Logit lens: A technique that projects intermediate transformer hidden states to vocabulary space to visualize how predictions evolve layer by layer.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that adds small trainable low-rank matrices alongside frozen model weights.
Mechanistic interpretability: The practice of reverse-engineering the specific algorithms and circuits implemented by a neural network's weights.
Message passing: The GNN paradigm where each node aggregates feature vectors from its neighbors to update its own representation.
Mixture of Experts (MoE): An architecture where a routing mechanism activates only a subset of parameters (“experts”) for each input, enabling larger models at lower compute cost.
Multi-head attention: Running multiple attention computations in parallel, allowing the model to capture different types of relationships simultaneously.
PagedAttention: Managing KV-cache memory in fixed-size pages (like OS virtual memory), eliminating fragmentation during inference serving.
Perplexity: A standard metric for language models, measuring how surprised the model is by a held-out text. Lower is better.
Polysemantic neuron: A neuron that activates for multiple, seemingly unrelated concepts---a consequence of superposition.
Post-training: The alignment stage after pre-training, typically including supervised fine-tuning, RLHF, and/or DPO.
Pre-training: The initial large-scale training phase where a model learns general language (or multimodal) representations from massive corpora.
Prompt injection: Manipulating an LLM's behavior by embedding adversarial instructions inside untrusted input data.
Pruning: Removing unimportant weights or entire structures (heads, layers) from a model to reduce its size and latency.
Quantization: Reducing the numerical precision of model weights (e.g., FP32 \(\to\) INT4), shrinking memory footprint and speeding up inference.
RAG (Retrieval-Augmented Generation): Grounding LLM responses by first retrieving relevant documents from an external knowledge base and including them in the prompt.
ReAct: An agent framework that interleaves reasoning traces (“thought”) with tool-calling actions (“act”) in an alternating loop.
Recursive self-improvement: A hypothetical scenario where an AI system improves its own capabilities in a continuous feedback loop.
Red teaming: Systematically probing an AI system for vulnerabilities, biases, or failure modes.
Representation engineering: Directly modifying a model's internal activations at inference time to steer its behavior (e.g., increasing honesty or reducing toxicity).
Reward model: A model trained on human preference data that scores LLM outputs, used as the reward signal in RLHF.
Scaling laws: Empirical power-law relationships between compute, dataset size, model parameters, and resulting performance.
Sparse autoencoder (SAE): A learned decomposition that extracts interpretable, monosemantic features from a model's polysemantic internal activations.
Speculative decoding: Using a small, fast “draft” model to propose candidate tokens that are then verified in parallel by the full model.
Superposition: When a neural network encodes more conceptual features than it has dimensions, overlapping them in a compressed representation.
Tokenization: The process of converting raw text into a sequence of discrete integer tokens for model input.
Transformer: The dominant neural network architecture for language and multimodal AI, based entirely on self-attention rather than recurrence or convolution.
Vector database: A database optimized for storing and searching high-dimensional embedding vectors via approximate nearest-neighbor algorithms.
World model: An internal model that predicts the consequences of actions before they are executed, enabling planning and imagination.
Zero-shot / few-shot: Performing a task with zero or a few examples in the prompt, relying on the model's pre-trained knowledge.