Notes on LLM-based Autonomous Agents: Hype vs. Reality, May 2024.
While general LLM agents promise flexibility, devs find them very unreliable for production applications.
There has been a lot of hype around the promise of LLM-based autonomous aget workflows. In mid 2024, all major LLMs are capable of tool use and function calling, enabling the LLM to perform sequences of tasks with autonomy.
But reality is proving more challenging than anticipated.
The WebArena leaderboard, which benchmarks LLM agents against real-world tasks, shows that even the best-performing models have a success rate of only 35.8%.
After seeing many attempts to AI agents, I believe it's too early, too expensive, too slow, too unreliable. It feels like many AI agent startups are waiting for a model breakthrough that will start the race to productize agents.
- Reliability: As we all know, LLMs are prone to hallucinations and inconsistencies. Chaining multiple AI steps compounds these issues, especially for tasks requiring exact outputs.
- Performance and costs: GPT-4o, Gemini-1.5, and Claude Opus are working quite well with tool usage/function calling, but they are still slow and expensive, particularly if you need to do loops and automatic retries.
- Legal concerns: Companies may be held liable for the mistakes of their agents.
- User trust: The "black box" nature of AI agents and stories like the above makes it hard for users to understand and trust their outputs. Gaining user trust for sensitive tasks involving payments or personal information will be hard (paying bills, shopping, etc.).
Several startups are tackling the AI agent space, but most are still experimental or invite-only:
- adept.ai - $350M funding, the leadership team was recently acqui-hired by Amazon
- MultiOn - funding unknown, their API-first approach seems promising
- HyperWrite - $2.8M funding, started with an AI writing assistant and expanded into the agent space
- minion.ai - Chat-based RPA interface that is now in open beta. They also have a native iOS app.
Only MultiOn seems to be pursuing the "give it instructions and watch it go" approach, which is more in line with the promise of AI agents. All others are going down the record-and-replay RPA route, which may be necessary for reliability at this stage.
These tech demos are impressive, but we'll see how well these agent capabilities will work when released publicly and tested against real-world scenarios instead of hand-picked demo cases.
AI agents are overhyped and most of them are simply not ready for mission-critical work. However, the underlying models and architectures continue to advance quickly, and we can expect to see more successful real-world applications.
The most promising path forward likely looks like this:
- The near-term focus should be on augmenting existing tools with AI rather than offering a broad fully-autonomous standalone service.
- Human-in-the-loop approaches that keep humans involved for oversight and handling edge cases.
- Setting realistic expectations about current capabilities and limitations.
By combining tightly constrained LLMs, good evaluation data, human-in-the-loop oversight, and traditional engineering methods, we can achieve reliably good results for automating medium-complex tasks.
Will AI agents automate tedious repetitive work, such as web scraping, form filling, and data entry? Yes, absolutely.
Will AI agents autonomously book your vacation without your intervention? Unlikely, at least in the near future.
Is CALM approach in the "Task-Oriented Dialogue with In-Context Learning" paper the missing recipe?