21 Advanced Projects

The projects in the previous chapter are fun and educational. The projects in this chapter are harder, more open-ended, and closer to the kind of systems that AI researchers and engineers build professionally. Each one will stretch your skills and potentially produce something genuinely novel.

21.1 Multi-Agent Collaboration System

Build a system where multiple AI agents collaborate to complete complex tasks, each specializing in a different role.

What you will learn: Agent architectures, inter-agent communication, task decomposition, error recovery, the challenges of coordinating autonomous systems.

The Architecture:

Planner Agent: Takes a high-level task description and breaks it into subtasks with dependencies.
Coder Agent: Receives a subtask specification and writes executable code.
Critic Agent: Reviews the code for correctness, style, and potential bugs.
Executor Agent: Runs the code in a sandboxed environment and reports results.
Orchestrator: Manages the workflow, routes messages between agents, and handles failures.

How to build it:

Start simple: implement just the Planner and Coder with a hardcoded orchestration loop.
Add the Critic: after the Coder generates code, the Critic reviews it and either approves or sends it back with feedback.
Add the Executor: run the approved code and feed results back to the Planner.
Add error recovery: when the Executor reports an error, the Planner diagnoses the issue and assigns a fix subtask.

Frameworks for Multi-Agent Systems

Several frameworks simplify building multi-agent systems: AutoGen (Wu et al. 2023) (Microsoft) provides a conversation-based framework where agents communicate by sending messages; CrewAI defines agents with roles, goals, and tools; LangGraph represents agent workflows as state machines. All three support tool use, memory, and human-in-the-loop interaction. Start with the framework that matches your mental model.

21.2 Build a Retrieval-Augmented Research Assistant

Build an AI assistant that can read, summarize, and synthesize information from research papers. Given a research question, it searches for relevant papers, reads them, extracts key findings, and produces a literature review.

What you will learn: Advanced RAG, multi-document summarization, citation handling, the challenge of synthesizing information across sources.

How to build it:

Integrate the Semantic Scholar API to search for papers by keyword and retrieve PDFs.
Build a pipeline that extracts text from PDFs, chunks them intelligently (respecting section boundaries), and indexes them in a vector store.
Implement a multi-step query pipeline: (a) search for papers, (b) retrieve relevant chunks, (c) summarize each paper's contribution, (d) synthesize a coherent overview.
Add citation tracking: every claim in the output should be traceable to a specific paper and section.

Stretch goals: Add a “follow-up” capability that identifies gaps in the literature and suggests research questions. Add support for Tables and figures using a multimodal model.

21.3 Fine-Tune a Domain-Specific Expert Model

Take a general-purpose 7B model and fine-tune it into a domain expert: a medical assistant, a legal advisor, a financial analyst, or a specialized coding assistant.

What you will learn: Data curation, QLoRA fine-tuning, evaluation methodology, the gap between “performs well on benchmarks” and “is actually useful in practice.”

How to build it:

Curate a dataset of domain-specific instruction-response pairs. Use a mix of existing datasets (e.g., PubMedQA for medicine, LegalBench for law) and synthetic data generated by a stronger model.
Fine-tune using QLoRA (4-bit base model + LoRA adapters). Use Axolotl, TRL, or the Hugging Face PEFT library.
Evaluate rigorously: use domain-specific benchmarks, blind human evaluation (have domain experts rate outputs), and adversarial testing (try to make the model give dangerously wrong advice).
Iterate: identify systematic failure modes, augment training data to address them, and re-fine-tune.

The Evaluation Trap

Fine-tuning is easy. Evaluating the fine-tuned model properly is hard. A model that scores well on multiple-choice medical questions may still give dangerous advice in open-ended conversations. Always evaluate in the format your users will actually interact with, not just the format that is easiest to grade automatically.

The Portfolio Effect

Each project in this chapter is designed to demonstrate a different dimension of AI engineering: multi-agent coordination, information synthesis, domain adaptation, multimodal processing, scientific reproduction, and real-world deployment. Completing even two or three of these projects gives you a portfolio that demonstrates breadth and depth far beyond what any course certificate can show. For job seekers and research applicants alike, demonstrated work beats claimed knowledge every time.

21.4 Build a Multimodal AI Application

Build a system that processes and generates multiple modalities: text, images, and audio.

Project idea: AI Podcast Generator. Given a research paper or blog post, the system: (1) summarizes the content, (2) generates a conversational script between two “hosts,” (3) synthesizes speech for each host using different voices, and (4) produces a podcast-style audio file, complete with intro music (generated by MusicGen).

What you will learn: Pipeline orchestration across modalities, text-to-speech, the challenges of maintaining quality when chaining multiple AI models.

Components:

LLM for summarization and script generation
TTS model for voice synthesis (e.g., Bark, Coqui TTS, OpenVoice)
MusicGen for intro/outro music
FFmpeg for audio mixing and assembly

21.5 Reproduce a Research Paper

Pick a recent paper that interests you, implement it from scratch, and reproduce the main results. This is simultaneously the most educational and the most humbling project on this list.

What you will learn: The gap between reading a paper and implementing it, the importance of details that papers omit, debugging ML systems, and a deep understanding of the chosen technique.

Recommended papers for reproduction:

LoRA (Hu et al. 2021): Relatively straightforward to implement and test.
Retrieval-Augmented Generation (Lewis et al. 2020): Teaches both retrieval and generation.
GPTQ (Frantar et al. 2022): Teaches quantization from first principles.
Sparse Autoencoders for interpretability: implement the approach from Anthropic's “Towards Monosemanticity” (Bricken et al. 2023).

How to approach it:

Read the paper thoroughly (three-pass method from Chapter 13a).
If the authors released code, resist the urge to look at it until you have tried your own implementation.
Start with the simplest possible version and verify it works before adding complexity.
Document every discrepancy between the paper and your implementation. These discrepancies are where the real learning happens.

Why Reproduction Matters

Some of the best researchers in the field got their start by reproducing papers. Karpathy's nanoGPT is a reproduction of the GPT architecture. Neel Nanda built TransformerLens for reproducing interpretability experiments. The act of reproduction forces you to understand every detail, and often reveals that the paper's description is incomplete or slightly wrong. That discovery is itself valuable knowledge.

The Reproduction Mindset

Reproducing a paper is the single best way to transition from “I understand AI conceptually” to “I can build AI systems.” The gap between reading a paper and implementing it is enormous: papers omit details, use ambiguous notation, and sometimes contain errors. Discovering these gaps is not frustrating; it is the entire point. Every experienced ML engineer has stories about the paper that took three weeks to reproduce because of one undocumented hyperparameter. Those stories are how expertise is built.

21.6 Build an AI Agent for a Real-World Task

Build an agent that accomplishes a genuinely useful real-world task: managing your email, organizing your files, monitoring news in a specific domain, or automating a repetitive workflow.

What you will learn: Tool integration, real-world data handling, error recovery, the difference between demos and production systems.

Key challenges:

Real-world data is messy, inconsistent, and full of edge cases.
APIs fail, rate-limit, and change their interfaces.
Users (including you) provide ambiguous instructions.
Safety: an agent with access to your email or file system can cause real damage if it misunderstands a command.

Recommended constraints: Start with read-only access (the agent can read your emails and draft responses, but you must approve before sending). Add write access only after you trust the system's judgment on at least 50 consecutive actions.

21.7 Exercises

Build the multi-agent coding system (Planner + Coder + Critic). Give it a task like “write a Python script that downloads the top 10 stories from Hacker News and saves them to a JSON file.” Does the system produce working code? How many iterations does it take?
Fine-tune a 7B model on a domain you know well. Create a test set of 50 questions that a domain expert would ask. Have the fine-tuned model and the base model both answer the questions. Blind-evaluate the results (or have a colleague evaluate). By how much does fine-tuning improve domain performance?
Reproduce one of the recommended papers listed above. Write a blog post documenting your experience: what was harder than expected, what the paper left out, and what you learned.
Build the AI podcast generator. Generate a 5-minute podcast episode from a paper of your choice. Play it for someone who has not read the paper. Can they understand the key ideas?

References

Bricken, Trenton, Adly Templeton, Joshua Batson, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Anthropic.

Frantar, Elias, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers.” arXiv Preprint arXiv:2210.17323.

Hu, Edward J, Yelong Shen, Phillip Wallis, et al. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33: 9459-74.

Wu, Qingyun, Gagan Bansal, Jieyu Zhang, et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. Https://arxiv.org/abs/2308.08155.