8 Build Your Own AI

This is the chapter where we get our hands dirty. Everything we have discussed so far: transformers, tokenization, pre-training, fine-tuning, RLHF, and more, comes together here. By the end of this chapter, you will have a concrete roadmap for building, customizing, and deploying your own AI system, whether that means training a model from scratch, fine-tuning an existing one, or orchestrating a RAG pipeline over your own data.

Pick Your Adventure

There is no single “right way” to build AI. Your path depends on your resources, goals, and data. A solo developer with a laptop and a weekend has a very different optimal strategy than a startup with a cluster of A100s. This chapter covers all the paths, from the simplest (prompting an existing model) to the most ambitious (pre-training from scratch), so you can choose the one that fits.

8.1 Choosing Your Approach

There are four main paths to building your own AI, in order of increasing effort and control:

Prompt engineering + RAG. Use an existing model (via API or local) with retrieval-augmented generation (Lewis et al. 2020) to ground its answers in your own data. No weight updates needed. This is the fastest path from zero to a useful system.
Fine-tune an existing model. Start from an open-weight model such as LLaMA (Touvron et al. 2023) or Mistral (Jiang et al. 2023) and adapt it to your domain using LoRA (Hu et al. 2021) or QLoRA (Dettmers et al. 2023). A few thousand high-quality examples and a single GPU can produce impressive results.
Agentic wrappers. Combine a capable LLM with tool use, memory, and planning layers (see Chapter 4) to create an autonomous agent that can browse the web, write code, query databases, and take actions.
Train from scratch. Pre-train a transformer on a large corpus. This is the most resource-intensive path but gives full control over the model's knowledge and architecture (see Chapter 3).

The 80/20 Rule

For most real-world applications, option 1 (RAG) or option 2 (fine-tuning) will get you 80% of the way there with 20% of the effort. Training from scratch is only necessary when you need a model with fundamentally different knowledge or capabilities than any existing model provides. Start with the simplest approach that could work, and escalate only if needed.

8.2 Hardware: What Do You Actually Need?

One of the most common questions is: “What GPU do I need?” The answer depends entirely on what you are doing.

8.2.1 For Inference (Running Models)

CPU only: Thanks to llama.cpp and GGUF quantization, you can run 7B models on a modern laptop with no GPU at all. It will be slow (1 to 5 tokens per second), but it works.
Consumer GPU (8 to 24GB VRAM): An NVIDIA RTX 3090 or 4090 can run quantized models up to 70B parameters and serve them at interactive speeds. This is the sweet spot for individual developers.
Apple Silicon (M2/M3/M4): Apple's unified memory architecture lets you load surprisingly large models. An M4 Max with 128GB unified memory can run 70B models via MLX or llama.cpp at decent speeds.

8.2.2 For Fine-Tuning

Single GPU (24GB): QLoRA makes it possible to fine-tune 7B to 13B parameter models on a single RTX 4090.
Cloud GPUs: For larger models or full fine-tuning, cloud GPU instances (A100 80GB, H100) from providers like Lambda Labs, RunPod, Vast.ai, or major clouds (AWS, GCP) are the standard approach. Budget $1 to $3 per GPU-hour.

8.2.3 For Pre-Training

Pre-training requires significantly more compute. A 7B model trained on 1 trillion tokens requires approximately 150,000 GPU-hours on A100s. This is cloud-scale compute. For educational purposes, training a 125M model on a subset of data is perfectly feasible on a single GPU.

Free and Cheap GPU Access

Google Colab provides free access to T4 GPUs (enough for inference and small fine-tuning). Kaggle offers free P100 GPUs. For more serious work, Lambda Labs and Vast.ai offer competitive rates. Many universities provide GPU clusters for research. Do not let hardware be a barrier to getting started.

8.3 Step-by-Step: Building a RAG System

RAG (Retrieval-Augmented Generation) (Lewis et al. 2020) is the fastest path to a custom AI that knows about your data. The idea is simple: when the user asks a question, retrieve the most relevant documents from your knowledge base and include them in the LLM's context, so it can answer based on your specific data rather than just its training knowledge.

Ingest your data. Collect your documents: PDFs, Markdown files, web pages, code repositories, Notion exports, whatever you have. Use a library like LlamaIndex (Liu 2023) or LangChain (Chase 2023) to load and parse them.
Chunk the documents. Split documents into chunks of 256 to 1024 tokens. Overlap adjacent chunks by 50 to 100 tokens to preserve context across boundaries. The chunking strategy matters more than you might expect.
Embed the chunks. Pass each chunk through an embedding model (e.g., text-embedding-3-small from OpenAI, or the open-source bge-large from BAAI) to produce a dense vector representation.
Store in a vector database. Insert the embeddings into a vector store: ChromaDB (simple, local), FAISS (fast, from Meta), Pinecone (managed), or Weaviate (full-featured). Each chunk's embedding is stored alongside its original text.
Query and retrieve. When the user asks a question, embed the question using the same model, search the vector database for the $k$ most similar chunks (typically $k = 3$ to 10), and retrieve them.
Generate with context. Construct a prompt that includes the retrieved chunks and the user's question, and pass it to the LLM. The model generates an answer grounded in your specific data.

Common RAG Pitfalls

RAG is simple in concept but tricky in practice. Common failure modes include: chunks that are too small (losing context) or too large (diluting relevance), poor embedding models that do not capture domain-specific semantics, retrieving irrelevant chunks that confuse the model, and not including enough context for the model to answer accurately. Start with a simple setup, evaluate on real questions, and iterate.

8.4 Step-by-Step: Fine-Tuning a 7B Model

Fine-tuning adapts a pre-trained model to your specific domain or task. With QLoRA, this is now accessible on consumer hardware.

Choose a base model. Pick an open-weight model from HuggingFace. Good starting points include meta-llama/Llama-3.1-8B-Instruct or mistralai/Mistral-7B-Instruct-v0.3. Choose an instruction-tuned model if you want to fine-tune for a specific task; choose a base (non-instruct) model if you want to teach it a new format entirely.
Prepare your dataset. Format training data as instruction-response pairs. The most common formats are ChatML and ShareGPT JSON. Quality matters far more than quantity: 1,000 carefully curated examples often outperform 100,000 noisy ones.
Set up QLoRA. Use the peft library to configure LoRA adapters. Typical settings: rank 16 to 64, alpha 32 to 128, targeting the attention projection matrices (q_proj, k_proj, v_proj, o_proj). Load the base model in 4-bit quantization using bitsandbytes.
Train. Use the trl library's SFTTrainer for a clean training loop. Alternatively, Axolotl provides a YAML-based configuration that handles datasets, training, and evaluation out of the box. Monitor loss convergence with Weights & Biases.
Merge and export. After training, merge the LoRA weights back into the base model. Export to GGUF format (for llama.cpp / Ollama) or safetensors (for HuggingFace / vLLM).
Evaluate. Run the LM Evaluation Harness (Gao et al. 2024) on benchmarks relevant to your domain. Compare against the base model to measure improvement. Also test informally: chat with your model and see if it has learned what you wanted.

Axolotl: The Easy Button

If you want to fine-tune without writing much code, the Axolotl framework by OpenAccess-AI-Collective lets you configure an entire fine-tuning run in a single YAML file. It handles dataset loading, LoRA configuration, training, and evaluation. Many of the top models on the Open LLM Leaderboard were trained with Axolotl.

8.5 Step-by-Step: Pre-Training a Small Model

For educational purposes, training a small transformer from scratch is one of the most instructive exercises in AI. Nothing builds intuition like watching a model go from outputting random characters to generating coherent text.

Choose a corpus. Download a manageable dataset: a subset of FineWeb-Edu, The Pile (Gao et al. 2020), or even just the complete works of Shakespeare (for a character-level model). For a BPE-tokenized model, aim for at least a few billion tokens.
Tokenize. Train a BPE tokenizer using the tokenizers library, or reuse an existing tokenizer (GPT-2's tokenizer via tiktoken).
Define the architecture. Use a GPT-2-style decoder-only transformer. Start with 125M parameters (12 layers, 768 hidden dim, 12 heads). Add RoPE and RMSNorm for a modern touch.
Train. Write a training loop in PyTorch with mixed-precision training. Use the AdamW optimizer with cosine learning rate schedule. On a single A100, a 125M model can be trained on several billion tokens in a few days.
Generate and evaluate. At each checkpoint, generate sample text and watch the quality improve. Track loss curves. The progression from gibberish to coherent English is deeply satisfying.

The Karpathy Challenge

Andrej Karpathy's nanoGPT repository trains a GPT-2-scale model in about 300 lines of PyTorch. His build-nanogpt video walks through rebuilding GPT-2 from scratch and reproducing OpenAI's original results. If you complete this exercise, you will understand transformer training more deeply than 99% of people who use LLMs daily. It is worth the effort.

8.6 Deploying Your Model

Once your model is trained or fine-tuned, you need to serve it. Several options exist, from local to production-scale:

Ollama: The easiest path to local deployment. Convert your model to GGUF format, create an Ollama Modelfile, and run ollama create mymodel. You instantly get a local API endpoint. Ollama also powers many desktop chat applications.
vLLM (Kwon et al. 2023): For high-throughput production serving. vLLM uses PagedAttention for efficient KV-cache management, supports continuous batching, and provides an OpenAI-compatible API. It handles multiple concurrent users efficiently.
Building a chat UI: Gradio (by HuggingFace) lets you build a web-based chat interface in 10 lines of Python. Streamlit is another popular option. For a desktop experience, Open WebUI provides a polished ChatGPT-like interface that connects to Ollama or any OpenAI-compatible API.
Edge deployment: For mobile or embedded devices, quantize aggressively (4-bit or lower) and use frameworks like MLC-LLM or ExecuTorch. Apple's CoreML and Google's MediaPipe support on-device inference.

The Full Stack

A complete “build your own AI” stack might look like this: fine-tune a model with QLoRA, merge the weights, convert to GGUF, serve with Ollama, build a RAG pipeline with LlamaIndex over your documents, and wrap it all in a Gradio chat interface. This entire stack can run on a single machine with a consumer GPU, and you own every piece of it.

8.7 Exercises

Build a RAG chatbot over your own documents (course notes, a textbook, or personal knowledge base). Use LlamaIndex with ChromaDB and a local Ollama model. Evaluate it by asking 20 questions and scoring the answer quality.
Fine-tune LLaMA 3.1 8B on a custom dataset of your choice using QLoRA. Compare the model's outputs before and after fine-tuning on 10 test prompts.
Follow Karpathy's nanoGPT tutorial and train a character-level GPT on a text corpus of your choice. Generate samples at 5 checkpoints during training and document how the output quality evolves.
Deploy your fine-tuned model with Ollama and build a simple chat interface with Gradio. Share it with a friend and collect feedback on response quality.
Set up a complete pipeline: fine-tune a model, deploy with Ollama, add RAG with LlamaIndex, and build a Gradio frontend. Document the entire process and measure end-to-end response latency.

References

Chase, Harrison. 2023. LangChain. GitHub. https://github.com/langchain-ai/langchain.

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. “QLoRA: Efficient Finetuning of Quantized LLMs.” arXiv Preprint arXiv:2305.14314.

Gao, Leo, Stella Biderman, Sid Black, et al. 2020. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling.” arXiv Preprint arXiv:2101.00027.

Gao, Leo, Jonathan Tow, Baber Abbasi, et al. 2024. A Framework for Few-Shot Language Model Evaluation. Version v0.4.3. Zenodo. https://doi.org/10.5281/zenodo.12608602.

Hu, Edward J, Yelong Shen, Phillip Wallis, et al. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685.

Jiang, Albert Q, Alexandre Sablayrolles, Arthur Mensch, et al. 2023. “Mistral 7B.” arXiv Preprint arXiv:2310.06825.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” arXiv Preprint arXiv:2309.06180.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33: 9459-74.

Liu, Jerry. 2023. LlamaIndex. GitHub. https://github.com/run-llama/llama_index.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et al. 2023. “LLaMA: Open and Efficient Foundation Language Models.” arXiv Preprint arXiv:2302.13971.