28 AI Software Ecosystem

Knowing the theory of transformers, RL, and fine-tuning is necessary but not sufficient. To actually build things, you need to master the tools: the frameworks, libraries, and platforms that turn ideas into running code. This chapter is your guided tour through the software ecosystem of modern AI, from training frameworks to production serving to the small utilities that save hours of debugging.

Tools Change, Principles Endure

The specific tools in this chapter will evolve. Some may not exist in two years; others not yet created will become essential. (When the first edition of this book was being written, vLLM was six months old and SGLang barely existed.) The meta-skill is knowing how to evaluate tools: Does it have active maintenance? Good documentation? A community that answers questions? Does it solve your specific problem, or is it a general-purpose tool you would have to bend into shape? Learn the ecosystem, not just individual tools.

28.1 Training Frameworks

28.1.1 PyTorch: The Lingua Franca

PyTorch has won the framework wars. It is the dominant framework for AI research and increasingly for production, having displaced TensorFlow through a combination of Pythonic design, eager execution (you can debug with print statements!), and a research community that overwhelmingly adopted it.

If you are starting your AI journey, learn PyTorch. Not TensorFlow, not JAX, not some wrapper that hides the details. PyTorch. You will need to read research code, and research code is written in PyTorch.

The key components you will use daily:

torch.nn: Module-based building blocks. Your model is a tree of modules, each tracking its own parameters. This is the foundation of every model you will build.
torch.optim: Optimizers. AdamW with cosine learning rate decay is the default for most LLM training. If someone asks you what optimizer to use and you do not know the specifics, say “AdamW” and you will be right 90% of the time.
torch.cuda.amp: Automatic mixed precision. Wrap your training loop in autocast() and use GradScaler() for stable FP16 training. This roughly doubles your effective batch size for free.
torch.compile: JIT compilation introduced in PyTorch 2.0. Add one line (model = torch.compile(model)) and the compiler fuses operations into optimized CUDA kernels, yielding 30 to 200% speedups depending on your model.
torch.distributed: Multi-GPU training primitives. You probably will not use these directly (higher-level tools wrap them), but understanding all-reduce, broadcast, and gather helps when debugging distribution failures.

The Karpathy Path

Andrej Karpathy's “Neural Networks: Zero to Hero” YouTube series is the best way to learn PyTorch by building things. Start with “micrograd” (backpropagation from scratch in 100 lines), then “makemore” (character-level language models), then “Let's build GPT” (a full transformer). By the end, you will understand both PyTorch and transformer architecture at a level that most practitioners never reach.

28.1.2 The Hugging Face Ecosystem

Hugging Face has become the GitHub of AI: the central hub where models, datasets, and tools are shared. If PyTorch is the programming language, Hugging Face is the standard library.

Their libraries cover the entire ML workflow, and they are designed to work together:

transformers gives you access to thousands of pre-trained models with a unified API. Three lines of code to load any model, any tokenizer. This is where most people's AI journey begins in practice.

datasets handles data loading and processing with memory-efficient streaming (for datasets too large for RAM) and built-in preprocessing. It removes the most tedious part of ML engineering.

accelerate handles multi-GPU and multi-node training with minimal code changes. Write your training loop for one GPU; accelerate handles device placement, mixed precision, gradient accumulation, and distributed strategies. It is the bridge between “this works on my laptop” and “this runs on a cluster.”

peft implements parameter-efficient fine-tuning methods (Hu et al. 2021): LoRA, QLoRA, IA3, and other adapter techniques in a unified interface. This is what makes fine-tuning accessible on consumer hardware.

trl provides trainers for SFT, DPO, PPO, ORPO, and other RL-based fine-tuning approaches. It handles the considerable complexity of reinforcement learning from human feedback so you can focus on your data and evaluation rather than training infrastructure.

The 15-Minute Fine-Tune

With Hugging Face's ecosystem, you can go from “I have a dataset” to “I have a fine-tuned model” in about 15 minutes of setup. Load a base model with transformers, prepare your data with datasets, configure LoRA with peft, and train with trl's SFT trainer. The entire pipeline fits in a single notebook. Two years ago, this would have taken weeks of custom engineering.

28.1.3 Distributed Training: When One GPU Is Not Enough

Training models beyond 7B parameters requires distributing computation across multiple GPUs and often multiple machines. This is where things get interesting (and occasionally painful).

DeepSpeed (Microsoft) introduced ZeRO (Rajbhandari et al. 2020) (Zero Redundancy Optimizer), which partitions optimizer states, gradients, and model parameters across GPUs to reduce memory per device. ZeRO Stage 3 can train models that would never fit on a single GPU. DeepSpeed has become the standard for training models up to hundreds of billions of parameters.

Megatron-LM (NVIDIA) provides tensor parallelism (splitting individual layers across GPUs) and pipeline parallelism (assigning different layers to different GPUs). It is often combined with DeepSpeed for maximum efficiency. If you are training a model with more than 70B parameters, you are probably using some combination of these two.

torchtitan (Meta) is a relatively new reference implementation for distributed training of LLaMA-style models. It showcases current best practices and is a great learning resource even if you do not use it directly.

Axolotl takes the opposite approach: maximum simplicity. Configure your fine-tuning run with a YAML file (model, dataset, LoRA rank, learning rate) and Axolotl handles everything else. It supports LoRA, QLoRA, full fine-tuning, and DPO. If you want to fine-tune a model and do not want to write training code, Axolotl is your tool.

28.2 Inference: Getting Answers Out of Models

28.2.1 Local Inference: Your Computer, Your Models

Running models locally gives you complete privacy (your data never leaves your machine), zero API costs (after the initial hardware investment), and complete control over the deployment. The local inference ecosystem has matured remarkably quickly.

Ollama is the easiest path to local models. Install it, run ollama pull llama3 to download a model, run ollama run llama3 to chat with it. It provides an OpenAI-compatible API, so any application that works with ChatGPT can work with your local model by changing one URL. The entire experience takes about five minutes from zero to chatting.

llama.cpp is the engine behind most local inference. Written in C/C++ by Georgi Gerganov (one person!), it supports CPU, GPU, and Apple Silicon inference with remarkable efficiency. The GGUF format it introduced allows quantization from F16 down to 2-bit, letting you trade quality for speed and memory in a controlled way. Ollama, LM Studio, and many other tools are built on top of llama.cpp.

LM Studio provides a desktop GUI for browsing, downloading, and chatting with local models. It includes a built-in server mode and a visual interface for comparing model responses side by side. If you prefer clicking to typing commands, LM Studio is excellent.

MLX (Apple) is optimized for Apple Silicon's unified memory architecture. If you have a MacBook with an M1/M2/M3/M4 chip, MLX lets you use the full system memory for model weights, which means you can run surprisingly large models (up to 70B at low quantization) on hardware you carry in your backpack.

Starting with Local Models

If you have never run a local model, start with Ollama. Install it, run ollama run llama3.2, and have a conversation. Then try ollama run deepseek-r1:8b for a reasoning model. Then try ollama run phi3 for a small, fast model. Within an hour, you will have an intuitive sense of what different model sizes and families feel like. This is the fastest way to develop taste for models.

28.2.2 Production Serving: Handling Real Traffic

When you need to serve models to many users with low latency and high throughput, consumer inference tools are not enough. Production serving engines are optimized for concurrent requests, batching, and hardware utilization.

vLLM (Kwon et al. 2023) introduced PagedAttention, which manages the KV cache the way operating systems manage virtual memory: allocating and freeing memory in pages rather than contiguous blocks. Combined with continuous batching (new requests join a running batch without waiting for others to finish) and speculative decoding (using a small model to draft tokens that the large model verifies), vLLM has become the standard for self-hosted production serving.

SGLang is growing rapidly as a vLLM alternative. Its RadixAttention enables efficient prefix caching (reusing KV cache across requests that share a common prefix, like system prompts), and it has strong support for structured generation (constraining model output to valid JSON, SQL, or other formats).

TensorRT-LLM (NVIDIA) squeezes maximum performance from NVIDIA hardware with custom CUDA kernels, INT8/FP8 quantization, and inflight batching. If you are serving a model on NVIDIA GPUs and latency matters above all else, TensorRT-LLM provides the best raw throughput.

Text Generation Inference (TGI) from Hugging Face offers production-ready serving with token streaming, watermarking, and easy deployment. It integrates naturally with the rest of the Hugging Face ecosystem.

Choosing Your Inference Stack

For personal tinkering: Ollama. For Mac development: MLX. For production with moderate traffic: vLLM. For maximum throughput on NVIDIA hardware: TensorRT-LLM. For structured generation needs: SGLang. The right choice depends on your hardware, traffic patterns, and latency requirements. If you are unsure, start with vLLM; it has the broadest community support.

28.3 Application Frameworks: Building with LLMs

Inference engines give you a model that can generate text. Application frameworks help you build products on top of that capability.

28.3.1 LangChain: The Swiss Army Knife

LangChain (Chase 2023) provides abstractions for the common patterns in LLM applications: prompt templates, chains (sequences of LLM calls), agents (LLMs that decide which tools to call and in what order), memory (conversation history), and output parsers. It is the most widely used framework for LLM applications.

LangChain has been criticized for over-abstraction: wrapping simple API calls in layers of classes that make debugging harder. This criticism has merit, and the LangChain team has responded by simplifying the core library and introducing LangGraph, which models agent workflows as state machines with explicit state transitions. LangGraph is better suited to complex, multi-step agent workflows than the original chain abstraction.

28.3.2 LlamaIndex: The RAG Specialist

LlamaIndex (Liu 2023) specializes in connecting LLMs with your data. It handles the entire RAG pipeline: document ingestion (PDFs, databases, APIs, web pages), chunking (splitting documents into retrievable pieces), indexing (vector, tree, keyword, and hybrid), and query engines (combining retrieval with generation).

If LangChain is a general-purpose application framework, LlamaIndex is a domain-specific one. For RAG applications, LlamaIndex often provides a faster path to a working system because it makes opinionated choices about retrieval that you would have to implement yourself in LangChain.

28.3.3 DSPy: Programming, Not Prompting

DSPy (Khattab et al. 2023) takes a radically different approach from both LangChain and LlamaIndex. Instead of writing prompts (which are essentially natural language programs with no type system, no testing framework, and no compiler), you define signatures (input/output specifications like “question, context \(\rightarrow\) answer”) and modules (composable LLM operations). DSPy then automatically optimizes prompts and few-shot examples to maximize a metric you define.

The DSPy Paradigm Shift

Prompt engineering is manual hyperparameter tuning. You try a prompt, evaluate the output, tweak the wording, try again. DSPy automates this loop entirely. You specify what you want (“given a question and context, produce an answer with citations”) and DSPy figures out how to prompt the model to do it well. It generates and evaluates thousands of prompt variations automatically. This is a genuine paradigm shift: from crafting prompts to programming with LLMs. If you find prompt engineering tedious and brittle, DSPy is worth your time.

28.4 Experiment Tracking: Remembering What You Tried

ML experiments produce a combinatorial explosion of results. You train a model with learning rate \(3 \times 10^{-4}\), LoRA rank 16, batch size 32. Then you try rank 32. Then you change the dataset. Then you change the base model. Within a week, you have 40 runs and no idea which configuration produced the best result. Experiment tracking tools prevent this chaos.

Weights & Biases (W&B) is the most popular experiment tracker. It logs metrics, hyperparameters, model checkpoints, and artifacts automatically. Its dashboard lets you compare runs side by side, and its hyperparameter sweep tool automates search over configurations. It is free for academics and individual researchers, which is why it has become the default in research labs.

MLflow is the open-source alternative. It covers experiment tracking, model registry (versioning and staging models), and deployment. It is more self-hosted than W&B, which makes it attractive for organizations with data privacy requirements.

TensorBoard is the lightweight option: simple visualization of training metrics (loss curves, learning rate schedules, gradient norms) with no account required. It is built into PyTorch's SummaryWriter. For quick local experiments, TensorBoard is often all you need.

Start Tracking from Day One

The biggest experiment tracking mistake is not starting early enough. “I will add tracking later” means “I will lose my first twenty experiments.” Add W&B logging to your training script from the very first run. It takes five lines of code and will save you hours of frustration when you need to remember which hyperparameters produced your best result.

28.5 Vector Databases: Search Over Meaning

Vector databases store and retrieve high-dimensional embeddings, enabling similarity search: “find me documents that are about the same thing as this query, even if they do not share any words.” This is the core retrieval mechanism in RAG systems and the reason LLMs can answer questions about your documents.

FAISS (Meta) is the gold standard library for similarity search. It supports billions of vectors with approximate nearest neighbor algorithms (HNSW, IVF) and runs on both CPU and GPU. FAISS is a library, not a database: it does the math, but you manage storage, persistence, and metadata separately. Use it when you want maximum control and performance.

ChromaDB is the developer-friendly option: lightweight, runs in-process (no separate server), and perfect for prototyping. You can build a working RAG system with ChromaDB in under 50 lines of Python. Use it for prototypes and small-scale applications.

Pinecone is fully managed: you send vectors over an API, and Pinecone handles storage, indexing, scaling, and availability. Zero operational burden, but vendor lock-in and costs that grow with scale. Good for startups that want to ship fast without managing infrastructure.

Weaviate is open-source with a distinctive feature: built-in vectorization modules that can embed text and images automatically, so you can index documents without running a separate embedding model.

Milvus is the scalable open-source option for production workloads, with features like high availability, horizontal scaling, and hybrid search (combining vector similarity with traditional filtering).

Choosing a Vector Database

For prototyping: ChromaDB (simplest setup). For maximum performance and control: FAISS (library, not managed). For managed production: Pinecone (easiest ops, highest cost). For self-hosted production: Milvus or Weaviate. Start simple and scale up when the simple option hits its limits.

28.6 Exercises

The local model tour: Install Ollama. Pull three models of different sizes (e.g., phi3, llama3.2, deepseek-r1:14b). Ask each model the same ten questions (mixing factual, creative, coding, and reasoning tasks). Which model performs best overall? Where does the smallest model surprise you?
Your first RAG system: Build a RAG application using LlamaIndex over a collection of PDF documents (use your own class notes, research papers, or book chapters). Try two different chunking strategies (256 vs. 1024 tokens) and compare retrieval quality on ten test questions you write yourself.
DSPy vs. prompt engineering: Implementation a question-answering pipeline two ways: once with carefully hand-crafted prompts, once with DSPy's automatic optimization. Run both on the same 50 test questions and compare accuracy. Does the computer beat the human prompt engineer?
Serving benchmarks: Deploy a 7B model with vLLM and benchmark throughput at batch sizes of 1, 4, 16, and 64. Test at FP16 and INT4 quantization. Plot tokens per second vs. batch size for each configuration.
Experiment tracking: Set up Weights & Biases and log a fine-tuning run (even a short one with a small model). Create two runs with different hyperparameters (e.g., different learning rates or LoRA ranks). Use the W&B dashboard to compare them. Which hyperparameter mattered more?

References

Chase, Harrison. 2023. LangChain. GitHub. https://github.com/langchain-ai/langchain.

Hu, Edward J, Yelong Shen, Phillip Wallis, et al. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685.

Khattab, Omar, Arnav Singhvi, Paridhi Maheshwari, et al. 2023. “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.” arXiv Preprint arXiv:2310.03714.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” arXiv Preprint arXiv:2309.06180.

Liu, Jerry. 2023. LlamaIndex. GitHub. https://github.com/run-llama/llama_index.

Rajbhandari, Samyam, Jeff Rasley, Olatunji Rber, and Yuxiong He. 2020. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” arXiv Preprint arXiv:1910.02054.