The Future of Evals - Ankur Goyal, Braintrust

TL;DR The speaker, Ankur Goyal from Braintrust, discusses the evolution of AI evaluations (evals), highlighting their current manual nature despite the advanced automation in AI products. He notes that Braintrust users run a significant number of evals daily, often spending hours manually analyzing dashboards to determine necessary code or prompt changes. This manual process is poised for a revolution with the introduction of Loop, an AI agent integrated into Braintrust. Loop leverages recent breakthroughs in frontier models, particularly Claude 4 (which performs 6x better than previous models), to automatically optimize prompts, complex agents, datasets, and scorers. The agent provides side-by-side suggestions in the UI, supporting various models, and aims to transform evaluations from a manual task into an automated, efficient process. The core message is that the future of evals lies in intelligent automation, driven by advanced AI models.

Information Mind Map

🧠 The Future of Evals: Revolutionizing AI Evaluation with Brain Trust's Loop

🔑 The Current State of Evals (The Problem)

Brain Trust's Journey: Almost two years working with leading AI companies.
High Volume of Evals:
- Average organization: ~13 EVELs/day
- Some advanced customers: >3,000 EVELs/day
- Time spent: >2 hours/day in the product on evals.
Manual Process:
- Despite building automated AI products/agents, evaluation remains largely manual.
- Primary method: Looking at a dashboard.
- Requires human decision: "What changes can I make to my code or prompts so that this eval does better?"
- Implication: A bottleneck in the AI development workflow.

🔑 Introducing Loop: The Automated Solution

What is Loop?: An agent built directly into Brain Trust.
Enabling Breakthrough:
- Only possible due to recent advancements in frontier models.
- Brain Trust has run quarterly evals on these models for two years.
- Models were "not very good" at improving prompts/data/scorers until very, very recently.
- Key Model: Claude 4 identified as a "real breakthrough moment."
  - Performance: Performs almost 6 times better than the previous leading model.
Loop's Core Capabilities:
- Automatic Optimization:
  - Optimizes prompts (from simple to complex agents).
  - Helps build better data sets.
  - Helps build better scorers.
- Holistic Approach: Emphasizes that "it's really the combination of these three things (prompts, data sets, scorers) that make for really great evals."

🔑 Loop's Features & Usability

Availability:
- Can be used today.
- For existing Brain Trust users: Flip on a feature flag called Loop.
- For new users: Sign up for the product.
Model Flexibility:
- Default model: Claude 4.
- User selectable: Any model with access (e.g., OpenAI, Gemini, custom LLMs).
Integrated UI/UX:
- Runs directly inside of Brain Trust.
- Side-by-side View: When Loop suggests an edit (to data, scoring ideas, or prompts), users can see it directly in the UI.
  - Benefit: Maintains the importance of looking at data and prompts while working.
- "Go for It" Toggle: For adventurous users, a toggle to let Loop optimize automatically without manual review.
  - Effectiveness: "Which actually works really well."

🔑 The Vision & Call to Action

Recap: Evals have been critical but "incredibly manual."
Future Outlook: Over the next year, evals will be "completely revolutionized" by frontier models.
Brain Trust's Role: Excited to incorporate these advancements.
Engagement:
- Try out the Brain Trust product.
- Try out Loop and provide feedback.
- Get in touch for discussions.
Hiring: Brain Trust is hiring for roles in:
- UI
- AI
- Infrastructure
Contact: Scan QR code (mentioned in video).

Jarvis-Legatus/mindmap_MC55hdWLq4o.md

Select an option

No results found