TL;DR
The speaker, Ankur Goyal from Braintrust, discusses the evolution of AI evaluations (evals), highlighting their current manual nature despite the advanced automation in AI products. He notes that Braintrust users run a significant number of evals daily, often spending hours manually analyzing dashboards to determine necessary code or prompt changes. This manual process is poised for a revolution with the introduction of Loop, an AI agent integrated into Braintrust. Loop leverages recent breakthroughs in frontier models, particularly Claude 4 (which performs 6x better than previous models), to automatically optimize prompts, complex agents, datasets, and scorers. The agent provides side-by-side suggestions in the UI, supporting various models, and aims to transform evaluations from a manual task into an automated, efficient process. The core message is that the future of evals lies in intelligent automation, driven by advanced AI models.
Information Mind Map
- Brain Trust's Journey: Almost two years working with leading AI companies.
- High Volume of Evals:
- Average organization:
~13 EVELs/day - Some advanced customers:
>3,000 EVELs/day - Time spent:
>2 hours/dayin the product on evals.
- Average organization:
- Manual Process:
- Despite building automated AI products/agents, evaluation remains largely manual.
- Primary method: Looking at a dashboard.
- Requires human decision: "What changes can I make to my
codeorpromptsso that this eval does better?" - Implication: A bottleneck in the AI development workflow.
- What is Loop?: An
agentbuilt directly into Brain Trust. - Enabling Breakthrough:
- Only possible due to recent advancements in
frontier models. - Brain Trust has run quarterly evals on these models for two years.
- Models were "not very good" at improving prompts/data/scorers until
very, very recently. - Key Model:
Claude 4identified as a "real breakthrough moment."- Performance: Performs almost
6 times betterthan the previous leading model.
- Performance: Performs almost
- Only possible due to recent advancements in
- Loop's Core Capabilities:
- Automatic Optimization:
- Optimizes
prompts(from simple to complex agents). - Helps build better
data sets. - Helps build better
scorers.
- Optimizes
- Holistic Approach: Emphasizes that "it's really the combination of these three things (
prompts,data sets,scorers) that make for really great evals."
- Automatic Optimization:
- Availability:
- Can be used
today. - For existing Brain Trust users: Flip on a
feature flagcalledLoop. - For new users: Sign up for the product.
- Can be used
- Model Flexibility:
- Default model:
Claude 4. - User selectable: Any model with access (e.g.,
OpenAI,Gemini, customLLMs).
- Default model:
- Integrated UI/UX:
- Runs directly
inside of Brain Trust. - Side-by-side View: When Loop suggests an edit (to data, scoring ideas, or prompts), users can see it
directly in the UI.- Benefit: Maintains the importance of looking at data and prompts while working.
- "Go for It" Toggle: For adventurous users, a toggle to let Loop optimize automatically without manual review.
- Effectiveness: "Which actually works really well."
- Runs directly
- Recap: Evals have been critical but "incredibly manual."
- Future Outlook: Over the next year, evals will be "completely revolutionized" by frontier models.
- Brain Trust's Role: Excited to incorporate these advancements.
- Engagement:
- Try out the Brain Trust product.
- Try out
Loopand provide feedback. - Get in touch for discussions.
- Hiring: Brain Trust is hiring for roles in:
UIAIInfrastructure
- Contact: Scan
QR code(mentioned in video).