7 Agentic Systems
Andrej Karpathy memorably coined the term “LLM Psychology” - the art and science of coaxing the right behavior out of language models. But agents go far beyond psychology. An agent is anything that perceives its environment through sensors and acts upon that environment through actuators. In the context of modern AI, an intelligent agent is an LLM that has been given tools - functions it can call to interact with the outside world, the digital space, or both.
This chapter walks through the complete engineering stack for building agentic systems: from the inference engines that power them, through the memory and retrieval layers that ground them, to the orchestration frameworks that coordinate them. Along the way, we will meet agents that hack into computer networks, run scientific experiments, and swarm together to solve problems no single model could tackle alone.
This term distinguishes systems that actively take actions in the world---calling APIs, writing files, browsing the web, running code---from passive chatbots that merely generate text in response to prompts. If a system can observe, decide, and act in a loop, it is agentic.
7.1 Inference Engines
Before you can build an agent, you need a way to run the underlying language model. Two open-source inference engines have become the backbone of the agentic ecosystem.
Ollama is designed for local, single-machine inference. It wraps quantized models in a Docker-like interface - you simply ollama run llama3 and get a local API endpoint. Ollama is ideal for prototyping, privacy-sensitive applications, and running agents on laptops without sending data to external servers.
vLLM (Kwon et al. 2023) is a high-throughput inference engine built for production. Its key innovation, PagedAttention, manages the KV-cache like virtual memory pages in an operating system, dramatically reducing memory waste and enabling higher batch sizes. If you are serving agents at scale - handling hundreds of concurrent requests - vLLM is the standard choice.
For local experimentation and privacy-first workflows, start with Ollama. For production deployments with multiple users, use vLLM. Many developers prototype with Ollama and deploy with vLLM - the model formats (GGUF, Safetensors) are largely interchangeable.
7.2 Vector Databases
Vector databases store high-dimensional embeddings of text, images, audio, or any other data. But why store embeddings rather than raw data? Because language models understand meaning through embeddings. In high-dimensional space, the cosine similarity between the embeddings of “water” and “wet” will be close to 1, while the cosine similarity between “water” and “fire” will be close to 0.
This has profound practical implications. If you have fine-tuned a model on proprietary data, or you want to ground its responses in a corporate knowledge base, you cannot simply dump an entire database into the prompt. Instead, you convert both the user's query and your stored documents into embeddings, find the nearest neighbors, and inject only the most relevant passages into the context window. This is far more efficient and accurate than brute-force keyword search.
Popular vector databases include Pinecone, Weaviate, Milvus, Qdrant, and ChromaDB. Each offers slightly different trade-offs in terms of scalability, hosting options, and integration with popular LLM frameworks.
The quality of your vector database depends entirely on the quality of your embedding model. Models like OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source options like bge-large or nomic-embed-text all produce different embedding spaces. Choosing the right one for your domain is as important as choosing the right LLM.
7.3 RAG - Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) (Lewis et al. 2020) is one of the most impactful engineering patterns to emerge from the LLM era. The idea is elegant:
- Embed the user query: The user's question is converted into an embedding vector.
- Retrieve relevant context: A nearest-neighbor search in a vector database retrieves the text chunks most semantically similar to the query.
- Inject context: The retrieved text, along with the original user question, is injected into the LLM's prompt as context.
- Generate a grounded response: The language model generates its answer conditioned on both the question and the retrieved evidence.
RAG is powerful because it dramatically reduces hallucinations - the model does not have to “guess” missing knowledge; it retrieves it. The vector database can be updated with fresh data periodically, keeping the system current without expensive retraining. And it scales naturally: instead of fine-tuning the model (which is expensive and done infrequently), you simply update the database.
RAG has become the default architecture for enterprise AI systems. Legal firms use it to search case law. Healthcare companies use it to query medical literature. Software teams use it to search documentation and codebases. For a comprehensive survey, see (Fan et al. 2024).
7.4 Web Search
When ChatGPT was first released, it could not search the web. It could not upload PDFs or Word documents. It was a closed system, limited to whatever it had seen during training.
So how does it search the web now? As part of its token output, the model emits special search query tokens. The inference engine monitors the output stream for these tokens - for example, something like <search_start>...<search_end> - and intercepts them. The enclosed query is executed against a search engine, the results are fetched and injected back into the model's context, and generation resumes with the new information available.
This pattern generalizes beyond web search. Tools like SearXNG (a privacy-respecting meta-search engine) can be integrated as the backend, giving the agent access to multiple search engines simultaneously without tracking.
If you want to give your local agent web search capabilities without relying on proprietary APIs, tools like SearXNG, Tavily, and Serper provide search-as-a-service endpoints. SearXNG can even be self-hosted for complete privacy.
7.5 Tools and Function Calling
Tool use is what transforms a language model from a text generator into an agent. The mechanism is surprisingly simple: the model is trained (or prompted) to emit structured outputs - often JSON - that describe a function call. A parser in the inference engine intercepts these outputs, executes the corresponding function, and feeds the result back into the model's context alongside the original user query.
For example, a user asks “What is the square root of 1764?” The model, instead of trying to compute it via token generation (which is unreliable), emits something like:
The engine runs the calculator, obtains \(42\), and feeds \(42\) back to the model, which then responds: “The square root of 1764 is 42.”
Tool use can be reactive (triggered by the model's output tokens during generation) or proactive (the orchestration layer pre-determines which tools are available and instructs the model to use them when relevant).
Anthropic's Model Context Protocol (MCP) is an emerging open standard for connecting LLMs to external tools and data sources. Think of it as a USB-C for AI - a universal interface that allows any model to connect to any tool through a standardized protocol, rather than requiring custom integrations for each tool-model pair.
7.6 Automatic Operation of Computers
One of the most exciting frontiers in agentic AI is agents that can operate computers the same way humans do - clicking buttons, typing text, navigating menus, and browsing the web through a graphical user interface.
Browser Use (Müller and Žunič 2024) enables LLMs to control web browsers programmatically: navigating to URLs, filling forms, clicking elements, and extracting information from rendered web pages. UFO (Zhang et al. 2024) by Microsoft extends this to desktop applications - it can control any Windows application by understanding screenshots, identifying UI elements, and generating mouse and keyboard actions.
Anthropic's Claude Computer Use pushes this further, enabling the model to operate an entire desktop environment autonomously: opening applications, switching between windows, copying data between programs, and completing multi-step workflows that span multiple applications.
The shift from API-based tool use to GUI-based computer operation is profound. API agents can only interact with software that exposes an API. GUI agents can interact with any software designed for humans - legacy enterprise software, desktop applications, proprietary tools with no API. This makes AI accessible in environments where API integration is impossible or prohibitively expensive.
7.7 Agents: The Core Abstraction
An agent is an LLM augmented with the ability to perceive its environment, reason about goals, and take actions. While a vanilla LLM simply generates text in response to a prompt, an agent uses the LLM as a “brain” within a loop: observe \(\to\) think \(\to\) act \(\to\) observe (Yao, Zhao, et al. 2023).
The ReAct (Reasoning + Acting) paradigm (Yao, Zhao, et al. 2023) interleaves chain-of-thought reasoning with tool calls. The LLM generates a thought (“I need to search for the population of France”), then an action (calling a search tool), then observes the result, and continues reasoning. This tight coupling of thinking and doing is what makes agents qualitatively different from chatbots.
The key components of an LLM agent are:
Planning. Complex goals must be decomposed into manageable sub-tasks. Techniques like chain-of-thought prompting (Wei et al. 2022), tree of thoughts (Yao, Yu, et al. 2023), and hierarchical task decomposition allow agents to break down problems like “build me a web application” into sequences of concrete steps.
Memory. Agents need both short-term memory (the conversation context and recent tool outputs) and long-term memory (vector database retrieval of past interactions, domain knowledge, and learned preferences). Without memory, agents are goldfish - they forget everything between sessions.
Tool use. The ability to call external APIs, run code, search the web, read and write files, and interact with databases (Schick et al. 2023). Tools are the agent's hands.
Reflection. Self-evaluation and iterative refinement. The Reflexion framework (Shinn et al. 2023) enables agents to review their own outputs, identify errors, and retry with improved strategies. This is analogous to a human proofreading their own work.
Think of an agent as having a cognitive architecture analogous to the human mind: the LLM is the “thinking” module (System 2), the tools are the “motor” system, the memory is the “hippocampus,” and the reflection module is the “inner critic.” Understanding this analogy helps when designing agent systems - every component must work together coherently.
7.8 Multi-Agent Systems
When a single agent is not enough, you bring in a team. Multi-agent systems involve multiple specialized agents collaborating on a task (Hong et al. 2023). Each agent has a distinct role - coder, reviewer, tester, project manager - and they communicate through structured messages, much like a human team communicating via Slack or email.
MetaGPT (Hong et al. 2023) is perhaps the most elegant example. It assigns agents roles following a real software development workflow: product manager \(\to\) architect \(\to\) developer \(\to\) QA engineer. Each agent produces artifacts (requirements documents, architecture diagrams, code, test reports) that are consumed by the next agent in the pipeline. The result is a system that can go from a one-line product description to a working codebase.
AutoGPT (Significant Gravitas 2023) takes a different approach: a single autonomous agent that decomposes goals, creates plans, and executes tasks with minimal human intervention. It was one of the first viral demonstrations of agentic AI, showing that an LLM could use the internet, write files, and pursue long-range goals autonomously - though it also showed the limitations of early agent architectures (infinite loops, hallucinated actions, context window overflow).
BabyAGI (Nakajima 2023) simplified the concept to its core: a task-driven agent that creates, prioritizes, and executes tasks in a loop. Its simplicity made it an excellent starting point for understanding agent architectures.
HuggingGPT (Shen et al. 2023) introduced a meta-agent pattern: an LLM controller that decomposes complex AI tasks and dispatches subtasks to specialized models on Hugging Face. Need to segment an image, caption it, and translate the caption? HuggingGPT figures out which models to call and in what order.
For a comprehensive survey of the multi-agent landscape, see (L. Wang et al. 2024) and (Wang et al. 2025).
Sequential pipeline (like MetaGPT) works best when the task has a natural workflow with clear handoff points. Autonomous loop (like AutoGPT) works for open-ended exploration. Manager-worker patterns work when a “boss” agent can decompose work and delegate to specialists. There is no one-size-fits-all - match the architecture to the problem.
7.9 Agentic Swarms
What happens when you scale multi-agent systems from a handful of agents to hundreds or thousands? You get agentic swarms - large collections of lightweight agents that self-organize, communicate, and collaborate to solve problems that no individual agent could handle.
The inspiration comes from nature: ant colonies, bee swarms, and flocks of birds all accomplish complex collective behavior through simple local interactions. Each individual follows basic rules, but the emergent behavior of the collective is sophisticated and adaptive. Agentic swarms apply the same principle to AI: each agent is simple (perhaps a small model with one or two tools), but the swarm as a whole can tackle complex, multi-faceted problems.
OpenAI's Swarm framework provides a lightweight, ergonomic way to build multi-agent systems. Unlike heavy orchestration frameworks, Swarm focuses on simplicity: agents are just Python functions with a system prompt and a list of tools. Agents can hand off to other agents by returning a reference to them, enabling dynamic, context-dependent routing. The framework intentionally avoids persistent state or complex coordination protocols, making it easy to reason about and debug.
Microsoft AutoGen (Wu et al. 2023) takes a more structured approach, providing a conversation-based framework where agents communicate through messages. AutoGen supports group chats (multiple agents discussing a topic), sequential pipelines, and nested conversations where one agent system can be invoked as a “tool” by another.
CrewAI models agents as “crew members” with roles, goals, and backstories. It provides a high-level API for assembling teams of agents that can collaborate on tasks, with built-in support for task delegation, memory, and interoperability with hundreds of tools.
The key difference is scale and emergence. A multi-agent system typically has 2-10 agents with pre-defined roles and communication patterns. A swarm has dozens to hundreds of agents that self-organize, with behavior emerging from local interactions rather than top-down control. Swarms are better for problems that are massively parallelizable (e.g., competitive analysis, large-scale code review, distributed data analysis).
7.10 Agent Orchestration
Orchestration is the problem of coordinating multiple agents, tools, and data sources in a production system. It is the “plumbing” that holds everything together, and it is harder than it looks.
Control flow is the first challenge. Should agents communicate in a fixed pipeline (agent A always passes to agent B), or should they dynamically decide who to invoke next based on the current state? LangGraph (part of the LangChain ecosystem (Chase 2023)) answers this with a graph-based approach: you define agents as nodes and transitions as edges, creating a state machine that controls the flow of execution. This gives you the flexibility of dynamic routing with the predictability of a defined graph.
Error handling is critical. Agents can produce incorrect tool calls, hallucinate actions, or enter infinite loops. Robust orchestration requires timeouts, retry limits, fallback strategies, and human-in-the-loop checkpoints where a human can review and approve actions before they are executed.
State management involves maintaining shared context - conversation history, intermediate results, file artifacts, database connections - across multiple agents. Without careful state management, agents can step on each other's work or lose track of progress.
Evaluation is perhaps the hardest part. Measuring the performance of an agentic system is far more complex than evaluating a single LLM call. Benchmarks like SWE-bench (Jimenez et al. 2024) test end-to-end agent capability on real-world software engineering tasks, but agentic evaluation remains an open research problem.
Just as the MCP protocol standardizes how models connect to tools, Google's Agent-to-Agent (A2A) protocol aims to standardize how agents communicate with each other. A2A defines a common language for agents to discover each other's capabilities, negotiate tasks, and exchange results - even when they are built on different frameworks and run in different organizations. Think of it as HTTP for agents.
7.11 Agents for Cybersecurity
One of the most electrifying applications of agentic AI is in cybersecurity - specifically, autonomous penetration testing.
PentAGI (PentAGI Contributors 2025) is a fully autonomous AI agent for penetration testing. Given a target (with proper authorization), PentAGI can:
- Reconnaissance: Scan the target's network, enumerate open ports, identify running services and their versions.
- Vulnerability analysis: Cross-reference discovered services against known vulnerability databases (CVEs), identifying potential attack vectors.
- Exploitation: Attempt exploits against discovered vulnerabilities, gaining access to systems.
- Post-exploitation: Pivot within the network, escalate privileges, and assess the full extent of potential damage.
- Reporting: Generate a detailed penetration testing report with findings, severity ratings, and remediation recommendations.
The entire pipeline runs autonomously. PentAGI uses a combination of LLM reasoning (to decide what to try next), tool use (to run Nmap, Metasploit, and other security tools), and memory (to keep track of what it has discovered and tried). It essentially replicates the workflow of a human penetration tester, but it can work 24/7, never gets tired, and can test hundreds of systems in parallel.
Autonomous penetration testing must always be conducted with explicit, written authorization from the system owner. Unauthorized penetration testing is illegal in virtually every jurisdiction. PentAGI and similar tools are designed for authorized security assessments, red-team exercises, and defensive security research. Never deploy them against systems you do not own or have permission to test.
Beyond PentAGI, the cybersecurity-AI intersection includes defensive agents that monitor networks for anomalies and automatically respond to incidents, threat intelligence agents that continuously scan the dark web and vulnerability databases for emerging threats, and compliance agents that audit systems against security frameworks (SOC 2, ISO 27001, NIST) and flag non-compliance.
The cybersecurity community has always organized around “red teams” (attackers) and “blue teams” (defenders). AI is now playing both sides. Red-team agents like PentAGI probe for vulnerabilities; blue-team agents detect and respond to attacks. The result is an AI arms race in security, where the quality of defense is limited only by the quality of the offense used to test it.
7.12 Agents for Scientific Research
Perhaps the most transformative application of agentic AI is in scientific research. The dream of a system that can formulate hypotheses, design experiments, run them, analyze the results, and write up the findings - all autonomously - is no longer science fiction.
7.12.1 The AI Scientist
The AI Scientist (Lu et al. 2024) is a landmark project from Sakana AI that introduced the concept of a fully autonomous scientific discovery agent. Given a research area and some starter code, The AI Scientist can:
- Generate novel research ideas by surveying related work and identifying gaps.
- Design and implement experiments by writing code, running it, and collecting results.
- Analyze results by producing plots, computing metrics, and identifying trends.
- Write a complete research paper in standard academic format, including abstract, introduction, methods, results, and discussion.
- Conduct automated peer review of its own and other papers, providing detailed scores and critiques.
The system produces papers that, in blind review, were sometimes rated comparably to human-authored workshop papers. While the quality is not yet at the level of top-tier venue publications, the fact that a machine ran the entire scientific pipeline autonomously represents a paradigm shift.
The AI Scientist can generate a complete research paper - including all experiments - for approximately $15 in API costs. Even if only a fraction of these papers contain genuinely novel insights, the cost-effectiveness for generating preliminary research ideas and prototypes is extraordinary.
7.12.2 The AI Scientist v2
The AI Scientist v2 (Yamada et al. 2025) dramatically expands the scope. While v1 was limited to machine learning experiments that could be run in code, v2 introduces:
- Agentic tree search over the space of possible experiments, systematically exploring promising directions and pruning dead ends.
- Multi-disciplinary support extending beyond pure ML to domains like physics, chemistry, and biology.
- Improved experiment management with better code generation, error recovery, and experiment tracking.
- Higher-quality paper generation with improved LaTeX formatting, better figure generation, and more rigorous analysis.
V2 papers were rated by human reviewers as reaching the quality threshold for acceptance at peer-reviewed venues - a significant improvement over v1.
The AI Scientist v1: “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery” (Lu et al. 2024).
The AI Scientist v2: “The AI Scientist v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search” (Yamada et al. 2025).
These papers are essential reading for anyone interested in the future of AI-assisted research. Both papers (and the code) are fully open-source.
7.12.3 Denario and Other Scientific Agents
Beyond The AI Scientist, several other projects are pushing the boundaries of AI in research:
Denario is an agent-based system for data science and quantitative research. It can autonomously analyze datasets, generate hypotheses about the data, run statistical tests, build predictive models, and produce visualizations and reports. Think of it as a data scientist that never sleeps.
ChemCrow (Bran et al. 2024) is a chemistry-specific agent that can plan chemical syntheses, look up molecular properties, predict reaction outcomes, and interact with chemistry databases - all by augmenting an LLM with a curated set of chemistry tools.
OpenHands (formerly OpenDevin) (X. Wang et al. 2024) is an open-source platform for building software development agents. It provides a sandboxed environment where agents can write code, run tests, browse the web, and interact with a full Linux terminal - making it particularly well-suited for computational science.
Imagine a research lab where AI agents formulate hypotheses, design experiments, run them on robotic lab equipment, analyze the results, and submit papers - all while human scientists focus on the creative, high-level direction-setting. This is not a distant fantasy; projects like The AI Scientist and ChemCrow are building it piece by piece, right now.
7.13 Agents for Software Engineering
Software engineering may be the domain most immediately transformed by agentic AI. Agents are moving from “autocomplete on steroids” (code completion) to fully autonomous software engineers.
Devin (by Cognition AI) was the first widely publicized autonomous software engineering agent. Given a natural language task (“fix this bug,” “add user authentication,” “refactor this module”), Devin can use a web browser, code editor, and terminal to plan, write, test, and debug code - all without human intervention.
SWE-Agent (Jimenez et al. 2024) is an open-source alternative that achieves competitive results on SWE-bench by combining an LLM with a custom computer interface optimized for code navigation and editing.
OpenHands (X. Wang et al. 2024) provides the most complete open-source platform for coding agents, with a sandboxed environment, browser access, and a rich library of pre-built agent architectures.
Cursor, Windsurf, and GitHub Copilot represent the commercial frontier - IDE-integrated agents that can understand your entire codebase, make multi-file changes, run tests, and iterate on feedback.
SWE-bench (Jimenez et al. 2024) has become the gold standard for evaluating coding agents. It consists of real GitHub issues from popular Python repositories, along with the tests that verify whether the fix is correct. Top agents now solve over 50% of issues - remarkable progress considering the benchmark has only existed since early 2024.
7.14 Agentic Coding: Deep Dive
The most advanced coding agents operate in a loop that closely mirrors human software engineering:
- Understand the task: Read the issue, bug report, or feature request. Browse relevant code files and documentation.
- Plan the approach: Decide which files to modify, what tests to write, and in what order to make changes.
- Implement: Write the code changes, handling edge cases and following the codebase's existing patterns.
- Test: Run existing tests to verify correctness. Write new tests if needed.
- Debug: If tests fail, read the error messages, form hypotheses about the cause, and iterate.
- Submit: When all tests pass, submit the changes for review.
What makes this hard is not any single step - it is the integration of all steps, the ability to recover from errors, and the need to maintain coherent context across a long-running session. Context window limitations, hallucination, and the difficulty of precise code editing remain the primary bottlenecks.
Give the agent concrete, specific instructions - not vague wishes. “Fix the authentication bug in auth.py where tokens expire prematurely” is far more actionable than “fix the login.” Provide test cases whenever possible - they give the agent a clear success criterion. And always review the output - agents are powerful but not infallible.
7.15 Agent Protocols: MCP and A2A
As the agentic ecosystem matures, two standardization efforts are shaping how agents connect to tools and to each other.
7.15.1 Model Context Protocol (MCP)
Anthropic's Model Context Protocol is an open standard that defines how language models connect to external data sources, tools, and services. Before MCP, every tool integration required custom code: parsing the model's output, formatting tool results, managing authentication. MCP provides a universal interface - a server exposes “resources” (data) and “tools” (actions), and any MCP-compatible model can discover and use them without bespoke integration.
Think of MCP as the USB-C of AI: one standard connector that works with everything.
7.15.2 Agent-to-Agent Protocol (A2A)
Google's A2A protocol standardizes communication between agents. When agents from different organizations, built on different frameworks, need to collaborate, A2A provides a common language for:
- Capability discovery: An agent publishes what it can do via an “Agent Card.”
- Task negotiation: A requesting agent sends a task; the receiving agent can accept, reject, or negotiate.
- Streaming results: Long-running tasks can stream intermediate results back to the requester.
- Multi-modal payloads: Tasks and results can include text, images, files, and structured data.
Together, MCP (agent \(\leftrightarrow\) tool) and A2A (agent \(\leftrightarrow\) agent) are laying the foundation for an “Internet of Agents” - a world where specialized AI agents can discover each other, negotiate services, and collaborate on tasks, regardless of who built them or where they run. This is still early, but the trajectory is clear.
7.16 The Rise of Agentic Frameworks
The agentic ecosystem has exploded with frameworks, each offering different trade-offs:
LangChain (Chase 2023) and LangGraph provide a comprehensive toolkit for building LLM applications, including agents. LangGraph's graph-based orchestration is particularly well-suited for complex multi-agent workflows with branching, looping, and human-in-the-loop patterns.
LlamaIndex (Liu 2023) specializes in data-centric agents - systems that need to query, index, and reason over structured and unstructured data. Its “query engine” abstraction makes it easy to build agents that pull insights from databases, documents, and APIs.
DSPy (Khattab et al. 2023) takes a radically different approach: instead of writing prompts by hand, you write programs that specify what the model should do, and DSPy automatically optimizes the prompts, few-shot examples, and even the model choice. This is “compiling” prompts rather than writing them.
Smolagents (by Hugging Face) is a minimalist framework focused on code-execution agents. Instead of the model generating JSON tool calls, the model writes Python code that is executed directly. This is more flexible than JSON-based tool calling and allows complex logic like loops, conditionals, and variable assignments within a single agent step.
AG2 (formerly AutoGen) (Wu et al. 2023) provides a high-level abstraction for building conversational multi-agent systems, supporting group chats, nested conversations, and a wide variety of agent topologies.
The answer depends on your use case. For data-heavy RAG applications, use LlamaIndex. For complex multi-agent workflows, use LangGraph or AG2. For prompt optimization research, use DSPy. For simple, code-executing agents, use Smolagents. For quick prototypes, OpenAI's Agents SDK or Anthropic's Claude Code are the fastest paths.
7.17 HuggingGPT
One of the earliest and most influential visions of agentic AI was HuggingGPT (Shen et al. 2023) - a system where a single LLM acts as a controller that decomposes complex tasks and dispatches them to hundreds of specialist AI models hosted on Hugging Face. It deserves a deeper look because it anticipated much of today's agentic architecture years before the term “agentic” became mainstream.
The HuggingGPT pipeline works in four stages:
- Task planning: The LLM analyzes the user's request and decomposes it into a sequence of sub-tasks, identifying which AI capability each sub-task requires (e.g., “object detection,” “image captioning,” “text-to-speech”).
- Model selection: For each sub-task, the controller consults the Hugging Face model hub - which hosts thousands of specialist models - and selects the most appropriate one based on model descriptions, download counts, and task compatibility.
- Task execution: The selected models are invoked in the correct order, with outputs from one model feeding into the next. The controller handles data format conversions and dependency resolution.
- Response generation: The LLM aggregates all sub-task results into a coherent response for the user.
What made HuggingGPT visionary was the insight that you do not need one model that can do everything. Instead, you need one model that can coordinate everything. The LLM does not segment images or synthesize speech - it figures out which model does, calls it, and integrates the result. This “LLM as brain, specialist models as hands” paradigm is now the dominant architecture for complex agentic systems.
HuggingGPT's core insight - an LLM as a task planner that orchestrates specialist models - directly influenced systems like Microsoft's TaskWeaver, Gorilla (which specializes in API calling), and even modern agent frameworks. Every time an agent “decides which tool to call,” it is applying the HuggingGPT pattern.
7.18 Reducing Bloat in Agentic Systems
As agentic systems grow in complexity - more tools, more agents, more orchestration layers - they accumulate bloat: excessive token usage, redundant tool descriptions, inflated system prompts, and sprawling conversation histories that consume context windows and inflate costs. This is one of the most important practical challenges in deploying agents at scale.
The sources of bloat are insidious:
System prompt inflation. Every tool the agent has access to must be described in the system prompt. An agent with 50 tools might have a system prompt of 5,000+ tokens before any user interaction. Multi-agent systems compound this: if each of 5 agents has its own tool descriptions, the total prompt overhead can exceed 25,000 tokens.
Conversation history accumulation. Agents work in loops - observe, think, act, observe - and each iteration adds tokens. A complex task might require 20+ iterations, and the full conversation history (including tool outputs, error messages, and retry attempts) must be maintained. Long tool outputs (e.g., an entire web page or a large code file) are particularly wasteful.
Redundant reasoning. Agents often re-derive information they have already computed, wasting reasoning tokens on thoughts like “Let me think about what I already know...” followed by a recapitulation of the entire conversation history.
Strategies for reducing bloat include:
- Dynamic tool loading: Only inject tool descriptions relevant to the current sub-task. If the agent is writing code, it does not need the description of the email-sending tool in its context.
- Context compression: Summarize older conversation turns into compact representations. Frameworks like LangChain offer conversation summary memory that periodically compresses the history.
- Structured output enforcement: Force the agent to produce concise, structured outputs (JSON, function calls) rather than verbose natural language reasoning when reasoning is not needed.
- Tiered agent architectures: Use a cheap, fast model (e.g., GPT-4o-mini) for routine routing and tool selection, and only invoke the expensive model (e.g., Claude Opus, GPT-4o) for complex reasoning steps.
- Tool output truncation: Automatically summarize or truncate large tool outputs before injecting them back into the context.
In most agentic systems, 80% of the token cost comes from 20% of the interactions - usually the tool descriptions in the system prompt and the raw outputs of web searches or file reads. Optimizing these two sources alone can cut costs by 3-5\(\times\).
7.19 Rewarding Communities of Agents
When multiple agents collaborate, a fundamental question arises: how do you assign credit? If a swarm of 10 agents collaborates on a task and achieves a good result, which agents contributed most? Which ones were deadweight? And how do you incentivize agents to develop useful specializations over time?
This is the multi-agent credit assignment problem, and it has deep roots in both economics (how do you pay members of a team?) and reinforcement learning (how do you assign rewards in multi-agent RL?).
7.19.1 Reward Shaping for Agent Communities
In multi-agent RL, reward shaping involves designing reward functions that encourage both individual competence and collective cooperation. A na"ive approach - rewarding all agents equally based on the team's outcome - leads to free-riding (lazy agents coasting on the work of productive ones). A purely individual reward - rewarding each agent based on its own output - leads to competition and misalignment.
The most promising approaches use Shapley values from cooperative game theory: each agent's contribution is measured as the marginal value it adds when joined to every possible coalition of other agents. This is computationally expensive but provably fair.
7.19.2 Emergent Specialization
When agents in a community are rewarded for collective outcomes over many episodes, something remarkable happens: they develop spontaneous specialization. Even if all agents start identical, the optimal strategy for the group is for each agent to develop a distinct skill. This mirrors the division of labor in human economies - Adam Smith's pin factory, realized in silicon.
Research on emergent communication in multi-agent systems (Foerster et al. 2016) shows that agents can develop their own communication protocols, specialized roles, and even something resembling a rudimentary “culture” - shared norms and conventions that emerge from interaction rather than being programmed.
Some researchers envision future agent systems as economies: agents offer services, negotiate prices, and form contracts. Just as human economies allocate resources through markets, agent economies could allocate computational resources through emergent market mechanisms. The A2A protocol is a first step toward this vision - it provides the infrastructure for agents to discover, negotiate with, and transact with each other.
7.20 The Five Levels of AGI
In 2023, Google DeepMind published a paper (Morris et al. 2023) proposing a framework for classifying AI systems along the path toward AGI. The framework defines five levels:
- Level 1 - Chatbots: Conversational AI. Equal to or better than an unskilled human at conversation. (ChatGPT circa 2022.)
- Level 2 - Reasoners: AI that can solve problems requiring multi-step reasoning. Equal to a skilled adult. (GPT-4, Claude 3.5, o1.)
- Level 3 - Agents: AI that can take actions in the real or digital world over extended time periods. (Current frontier: Devin, Claude Computer Use, SWE-Agent.)
- Level 4 - Innovators: AI that can generate genuinely novel ideas, make scientific discoveries, and create new knowledge. (Emerging: The AI Scientist.)
- Level 5 - Organizations: AI that can perform the work of an entire organization - coordinating multiple agents, managing resources, pursuing long-term goals, and adapting strategy. (Not yet achieved.)
The jump from Level 3 to Level 5 is staggering. A Level 3 agent can fix a bug. A Level 5 “agent organization” could, in principle, run a startup - from identifying the market opportunity, to designing the product, writing the code, deploying it, handling customer support, and iterating based on feedback.
The honest answer is: barely. Current agents can handle tasks that take minutes to a few hours, in constrained domains, with significant human oversight. Reliable multi-day autonomy, robust error recovery, and genuine real-world action execution remain unsolved. Level 3 is where the hardest engineering problems live - and it is where most of the research and product development energy is focused right now. Read the original paper: “Levels of AGI: Operationalizing Progress on the Path to AGI” (Morris et al. 2023).
7.21 Agents in Finance
The financial industry is one of the most natural domains for agentic AI. Financial workflows are data-intensive, time-sensitive, highly structured, and enormously valuable - exactly the characteristics where agents excel.
7.21.1 Quantitative Research Agents
Quantitative research - the systematic analysis of financial data to identify trading opportunities - is being transformed by AI agents. An agentic quant researcher can:
- Ingest data from multiple sources: market data feeds, SEC filings, earnings call transcripts, news articles, social media sentiment.
- Generate hypotheses about relationships between variables (“Does unusual options activity predict earnings surprises?”).
- Backtest strategies by writing and executing Python code against historical data.
- Evaluate results using standard financial metrics (Sharpe ratio, maximum drawdown, alpha).
- Iterate on strategies that show promise, refining parameters and adding risk controls.
FinRobot (Yang et al. 2024) is a multi-agent framework specifically designed for financial applications. It provides specialized agents for market analysis, portfolio management, and risk assessment, each with access to financial data APIs and analytical tools.
7.21.2 Compliance and Risk Agents
Financial regulations are extraordinarily complex - thousands of pages of rules across multiple jurisdictions, updated constantly. Compliance agents can:
- Monitor transactions for suspicious patterns (anti-money-laundering).
- Audit portfolios against regulatory constraints (concentration limits, approved investment lists).
- Generate regulatory reports automatically.
- Track regulatory changes and flag impacts on existing operations.
7.21.3 Robo-Advisors and Portfolio Management
AI-powered robo-advisors have existed for years, but agentic robo-advisors take the concept further. Instead of following a fixed rules-based allocation, an agentic advisor can research current market conditions, read analyst reports, form views on asset classes, and dynamically adjust portfolios - all while explaining its reasoning to the client in natural language.
Financial agents operate in a domain where errors have direct monetary consequences. Hallucination is not just wrong - it is expensive. Any financial agent must include robust guardrails: human approval for trades above a threshold, hard constraints on portfolio allocations, and comprehensive audit logs. The “agentic mindset” of “let the model figure it out” is dangerous when applied without safeguards to real money.
7.22 Latent Space Communication
Here is a question that borders on the philosophical: why do agents communicate in natural language at all?
When Agent A needs to send information to Agent B, the current paradigm works like this:
- Agent A forms an internal representation (activations in its neural network).
- Agent A decodes this representation into natural language tokens (e.g., “The database query returned 42 rows”).
- The text is sent to Agent B as input.
- Agent B encodes the text back into internal representations.
- Agent B reasons over these representations.
Steps 2-4 are a bottleneck. Natural language is a lossy compression of internal model states. When Agent A converts its rich, high-dimensional understanding into a sequence of tokens, information is lost. When Agent B re-encodes those tokens, it reconstructs a representation that is similar to Agent A's original state, but not identical. This is like two humans communicating by describing their thoughts in words - it works, but it is far less efficient than if they could directly share their mental states.
7.22.1 The Latent Space Communication Hypothesis
What if agents could skip the language bottleneck entirely and communicate directly through their internal representations - their latent states? This is the idea behind latent space communication, sometimes informally called “machine telepathy.”
Instead of Agent A generating text and Agent B reading it, Agent A would send its final-layer activations (or a compressed version of them) directly to Agent B, which would inject them into its own processing pipeline. No tokenization, no detokenization, no information loss from the language bottleneck.
There are several fundamental obstacles to latent space communication:
- Representation incompatibility: Different models - even different versions of the same model - have different internal representation spaces. Agent A's latent states are meaningless to Agent B unless their embedding spaces are aligned.
- No shared “language” of latents: Natural language is a universal interface precisely because it is standardized. Latent spaces are not.
- Interpretability loss: When agents communicate in text, humans can inspect and audit the communication. Latent space messages are opaque, making debugging and oversight much harder.
- Security concerns: Latent space messages could be used to smuggle adversarial signals between agents in ways that are impossible to detect through text-based monitoring.
7.22.2 Current Research Directions
Despite these challenges, several research directions are making latent communication more feasible:
Shared embedding spaces. Models like CLIP (Radford et al. 2021) and ImageBind (Girdhar et al. 2023) demonstrate that different modalities can be aligned into a shared latent space through contrastive learning. The same technique could potentially align the latent spaces of different LLMs, enabling cross-model latent communication.
Soft tokens and continuous prompts. Work on prefix tuning and prompt tuning (Li and Liang 2021) shows that continuous vectors (“soft tokens”) can be injected into a model's input alongside regular discrete tokens. These soft tokens can carry information that is richer than what can be expressed in natural language. One agent could generate soft tokens that are consumed by another.
CALM: Composition of Augmented Language Models. The CALM framework (Bansal et al. 2024) demonstrates that two language models can be composed by learning a small cross-attention module that connects the representations of one model to the other. This allows one model to “read” another model's internal states, enabling a form of latent communication without requiring the models to share the same architecture.
Thinking tokens and hidden reasoning. Recent reasoning models (o1, DeepSeek-R1) generate extended internal “thinking” chains that are not shown to the user. In a multi-agent setting, these thinking tokens could be shared directly between agents - not as text, but as the continuous representations that produced them - enabling a richer form of inter-agent communication.
Imagine a future where agents are not separate programs communicating through text, but nodes in a larger neural network of agents. Each agent is a module that processes information, transforms it, and passes activations - not strings - to the next module. The entire system would be differentiable, allowing end-to-end training of multi-agent systems through backpropagation, just as we train multi-layer neural networks today. This is speculative, but the theoretical foundations - differentiable communication channels, shared embedding spaces, learned cross-model attention - already exist.
7.22.3 Emergent Communication
Even without explicit latent space engineering, agents can develop their own communication protocols. Research on emergent communication (Foerster et al. 2016) in multi-agent reinforcement learning has shown that when agents are trained together on cooperative tasks, they spontaneously develop compact, efficient “languages” that bear little resemblance to natural language but are highly effective for coordination.
These emergent languages tend to be:
- Compositional: The meaning of a message is determined by the meanings of its parts and how they are combined - just like human language.
- Efficient: Agents develop symbols for frequently needed concepts and compress redundant information.
- Task-specific: The language evolves to support exactly the information exchange needed for the task, with no wasted expressiveness.
The connection to latent space communication is direct: these emergent languages are, in a sense, a naturally discovered latent communication protocol. The agents have learned to compress their internal states into compact messages that another agent can decode - they have invented their own form of machine telepathy.
If agents can communicate through latent representations, develop emergent languages, specialize through reward shaping, and self-organize into swarms - have we built a hive mind? The parallels with biological systems are striking. Ant colonies communicate through pheromones (a form of “latent communication”), develop specialized castes, and achieve collective intelligence far exceeding that of any individual ant. The difference is that our artificial hive minds can be designed, debugged, and scaled deliberately. Whether this is exciting or terrifying depends on your perspective.
7.23 Generalist Agents
The ultimate vision of agentic AI is the generalist agent - a single system that can handle any task across any domain, adapting its strategy, tools, and communication style to the situation at hand. Rather than deploying a cybersecurity agent, a coding agent, a finance agent, and a scientific research agent separately, the generalist agent dynamically assembles the right capabilities for each task.
7.23.1 What Makes a Generalist Agent?
A true generalist agent requires:
- Broad world knowledge: A frontier LLM (GPT-4o, Claude Opus, Gemini Ultra) as its reasoning core.
- Dynamic tool selection: Access to a large library of tools, with the ability to discover and learn new tools on the fly. The MCP protocol enables this - the agent can connect to any MCP-compatible tool server.
- Domain adaptation: The ability to shift its behavior based on context. When working on code, it follows software engineering best practices; when analyzing financial data, it applies risk management frameworks; when doing security testing, it follows ethical guidelines.
- Meta-cognition: The ability to recognize when it is out of its depth and either seek help (from a human or a specialist agent) or decline the task.
7.23.2 Current Generalist Systems
Several systems approximate the generalist agent vision:
Claude Code (Anthropic) is designed as a general-purpose coding agent that can also browse the web, analyze data, and interact with various tools through MCP. Its extended thinking capability allows it to tackle complex, multi-step problems across domains.
Manus emerged as one of the first “general-purpose AI agents” that can handle tasks across domains - from research and data analysis to software development and content creation - all within a single interface. It operates by autonomously planning, browsing the web, writing and executing code, and producing deliverables.
OpenAI's Operator extends GPT-4o with the ability to use a web browser, effectively creating a generalist agent that can perform any task that a human could do through a browser - booking travel, filling out forms, researching products, and managing accounts.
Specialist agents are more reliable and efficient within their domain. A cybersecurity agent built with curated security tools will outperform a generalist on penetration testing. But the generalist offers flexibility - it can handle the long tail of tasks that no one anticipated. The emerging pattern is generalist orchestrator, specialist workers: a generalist agent that understands the task, selects the right specialist(s), and coordinates the work.
7.24 Key Research Papers on Agents
The following papers represent essential reading for understanding the agentic AI landscape:
- ReAct (Yao, Zhao, et al. 2023) - The foundational paper on combining reasoning and acting in LLM agents.
- Toolformer (Schick et al. 2023) - Teaching LLMs to decide when and how to call tools.
- A Survey on LLM-based Agents (L. Wang et al. 2024) - A comprehensive survey covering architectures, capabilities, and applications.
- Foundation Agents (Wang et al. 2025) - A forward-looking analysis of where agentic AI is heading.
- Agents vs. Agentic (Xi et al. 2025) - Clarifying the taxonomy and distinguishing true autonomy from tool-augmented generation.
- MetaGPT (Hong et al. 2023) - The state of the art in multi-agent software development.
- The AI Scientist (Lu et al. 2024) - Fully autonomous scientific discovery.
- HuggingGPT (Shen et al. 2023) - The LLM as a controller that orchestrates specialist models.
- Emergent Communication (Foerster et al. 2016) - How agents develop their own languages.
- Levels of AGI (Morris et al. 2023) - A framework for measuring progress toward AGI.
7.25 The Future of Agentic AI
The trajectory of agentic AI points toward increasingly autonomous systems (Xi et al. 2025):
Computer-using agents like UFO (Zhang et al. 2024) and Browser Use (Müller and Žunič 2024) are evolving from demonstrations to production tools. The day when an AI can reliably use any software - including legacy systems with no API - is approaching fast.
Self-improving agents are the next frontier: agents that can evaluate their own performance, identify weaknesses, generate training data from their mistakes, and fine-tune their underlying models. This creates a virtuous cycle where the agent gets better through use.
Long-horizon autonomy is the grand challenge. Today's agents work best on tasks that take minutes to hours. Moving to agents that can pursue goals over days or weeks - managing their own context, memory, error recovery, and resource allocation - requires fundamental advances in planning and state management.
Agent marketplaces and ecosystems will emerge where specialized agents can be discovered, composed, and deployed dynamically. Need an agent that can do tax accounting? Combine it with one that can access government databases. Need one that can negotiate contracts? Chain it with a legal analysis agent. The A2A protocol is the early infrastructure for this vision.
Building agentic systems requires a different mindset from traditional software engineering. You are not writing deterministic code; you are designing an environment in which a probabilistic reasoning engine can succeed. This means thinking about guardrails, fallbacks, evaluation criteria, and human oversight - not just features and functions.