A multi-agent conversational system requires awareness of the memory limitations of language models. Large models (GPT-4, Claude, Mistral, Gemini) do offer very wide context windows, but they still operate on a “sliding window” basis. This is illustrated in Fig. 1 – when the context window fills up, new tokens “push” older ones out of the model’s memory window, causing earlier information to be lost. In practice, this means that over time, during a long conversation the model can forget earlier turns, start repeating itself, or make coherence errors. This phenomenon is sometimes called Context Degradation Syndrome. After just a few dozen to a few hundred exchanges, the model can “lose the thread” and generate increasingly imprecise answers ([Context Degradation Syndrome: When Large Language Models Lose the Plot](https://jameshoward.us/2024/11/26/context-degradation-syndrome-when-large-language-models-lose-the-plot#:~:text=As%20conversations%20len
In multi-agent systems, agents can collaborate in two primary ways: handoff (transferring control) or agent-as-tool (agent as a tool). In the handoff pattern, one agent completes its part of the work and passes the entire context to the next “specialist” agent instead of continuing to process it itself (Handoffs — AutoGen) (Multi-agent systems – Agent Development Kit). Conversely, the agent-as-tool pattern has the main agent invoke a secondary agent like a function or API call—then integrates its response into the ongoing conversation ([Multi-agent systems – Agent Development Kit](https://google.github.io/adk-docs/agents/multi-agents/#:~:text=,invocation%20like%20any%20other%20to
When designing APIs intended for intelligent agents (conversational LLMs, multi-agent orchestration, etc.), it’s helpful to treat them as “machine interfaces”—clear and unambiguous not only to developers, but to algorithms as well. A good starting point is to produce a full OpenAPI specification for your service (for example using FastAPI, which automatically generates Swagger/OpenAPI docs). The OpenAPI standard lets agents read the entire API definition—what resources and parameters are available, how to authenticate, what inputs to send, and what responses to expect ([Building an AI agent with OpenAPI: LangChain vs. Haystack]). Crafting complete, AI-ready documentation is critical ([Is Your API AI-ready? Our Guidelines and Best Practices]).
- Rich descriptions and metadata. Every endpoint and parameter should have an exhaustive description—not just a repeat of its name, but an explanation of “what this endpoint does,” “what data it expects,” and
Chat-based AI apps often need to perform long-running operations (e.g. document generation, data analysis) that can’t block the user interaction. To keep the system responsive, these tasks must run in the background, outside the normal request–response cycle. A common pattern is to introduce an asynchronous task queue: the user’s request is immediately enqueued (e.g. returning “task accepted”) ([Background Tasks – FastAPI])([Using FastAPI with SocketIO to Display Real-Time Progress of Celery Tasks | by Fadi Shaar | Medium]), and a separate process (or cluster of workers) executes the job. This way, FastAPI can instantly handle further requests while heavy work proceeds independently, keeping the API responsive for other clients ([Using FastAPI with SocketIO to Display Real-Time Progress of Celery Tasks | by Fadi Shaar | Medium])([Background Tasks – FastAPI]).
For example, a contract-generation chatbot can immediately confirm receipt (“Document generation in progr
Modern conversational AI systems often split functionality into multiple tools or sub-agents, each specialized for a task (e.g. search, booking, math, etc.). When a user sends a query, the system must interpret intent and dispatch it to the right tool/agent. There are two broad approaches: letting a general-purpose LLM handle intent detection itself, or using a dedicated router component. In practice, many practitioners use a hybrid: an initial “router” classifies the intent and then a specialized agent or tool handles the task. Below we survey best practices and examples of each approach, referencing frameworks like LangChain and Semantic Router.
A common approach is to have the LLM itself decide which tool or chain to invoke. For example, one can prompt the model to output a JSON field indicating the desired “tool” or “function” (using OpenAI’s function-calling or ChatGPT Pl
The development of sophisticated multi-agent systems introduces significant challenges in managing the flow of data and context between individual agents. As the complexity of these systems grows, with multiple agents collaborating to achieve a common goal, the potential for errors, inefficiencies, and unpredictable behavior due to mismanaged data also increases. Uncontrolled data flow can lead to agents receiving irrelevant or incorrectly formatted information, hindering their ability to perform their designated tasks effectively. The OpenAI Agents SDK is designed to address these challenges by providing a set of primitives, including handoffs, which facilitate the intelligent transfer of control between agents.1 This SDK aims to enable the construction of complex,
The integration of OpenAI's reasoning models (o-series) with the Agents SDK presents intriguing possibilities for developers who want to observe an agent's thinking process in real-time. While there are limitations to accessing the complete "train of thought," there are several methods to stream insights into an agent's reasoning as it works.
OpenAI's reasoning models (o1, o3, o4 series) utilize a special type of processing called "reasoning tokens" in addition to standard input and output tokens. These reasoning tokens represent the model's internal thinking process as it breaks down problems and considers multiple approaches[^9].