2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI

TL;DR John Dickerson, CEO of Mozilla AI and co-founder of Arthur AI, argues that 2025 will finally be "the year of evaluations" for AI, unlike previous years where it remained a niche concern. This shift is attributed to a "perfect storm" of three concurrent factors: the launch of ChatGPT in late 2022, which made AI capabilities understandable and "wow-worthy" for non-technical C-suite executives (CEOs, CFOs, CISOs); a simultaneous enterprise budget freeze that paradoxically funneled limited discretionary funds specifically into GenAI "pet projects"; and the current deployment of autonomous or semi-autonomous agentic AI systems that act for humans, introducing significant complexity and risk requiring quantitative evaluation. This confluence has led to unprecedented C-suite alignment on the critical need for AI evaluation, driving hockey-stick growth for evaluation companies and shifting focus to monitoring entire multi-agent systems rather than just individual models.

Information Mind Map

🧠 2025: The Year of AI Evals (Finally)

👤 Speaker Introduction & Background

Speaker: John Dickerson, CEO Mozilla AI (formerly Co-founder & Chief Scientist at Arthur AI)
Mozilla AI Mission: Open-source AI tooling & stack, enabling open-source community to engage with leaders like Sam Altman.
Arthur AI Experience: Six years in observability, evaluation, and security across:
- Traditional ML
- Deep Learning Revolution
- GenAI Revolution
- Agentic Revolution
Current Outlook: Companies in this space are now seeing "hockey stick growth."

💡 Core Thesis: AI/ML Monitoring & Evaluation as Two Sides of the Same Sword

Interdependence: Cannot do monitoring/observability without measurement; measurement is core to evaluation.
Historical Context (Pre-2025):
- Evaluation was not top of mind for the C-suite (CEO, CFO, CISO).
- ML models often "spit out numbers" ingested into larger, opaque systems.
- The complexity of these systems "erased" the top-of-mind need for model-level evaluation for decision-makers.
- ML monitoring existed since ~2012 (e.g., H2O, Algorithmia, Seldon, Y Labs, Aporeia, Arise, Arthur, Galileo, Fiddler, Protect AI, Snowflake, Databricks, Datadog, SageMaker, Vertex AI).
- Challenge: Tenuous connection to downstream business KPIs (dollars saved/earned). Selling was primarily into the CIO.
- Common Pitch Deck Slide: "This is the year a CEO gets fired due to an ML screw-up" – never happened to speaker's knowledge.
- Example (JPMC): Jamie Diamond's 2022 report showed "comically small" AI/ML spend ($100M from 2017-2021 in consumer world).

📈 The Catalyst: A "Perfect Storm" for Evaluation

Three Concurrent Factors (The Triangle):
1. AI Became Understandable to C-Suite:
  - Event: ChatGPT launch (November 30, 2022).
  - Impact: CEOs, CFOs, CISOs (less technical) could interact easily with a UI and be "wowed" by AI.
  - Shifted AI from a technical problem to a business opportunity/risk.
2. Perfectly Timed Budget Freeze:
  - Context: Deep fears of impending recession (late 2022, before ChatGPT launch).
  - Effect: Most enterprises froze or shrunk IT budgets for 2023.
  - Exception: Discretionary budget was unlocked specifically for GenAI as the "CEO's pet project."
3. Emergence of Agentic Systems:
  - Definition: Systems now acting for humans/teams (autonomously or semi-autonomously), not just providing inputs.
  - Consequence: Introduces significant complexity and risk, forcing evaluation to be top-of-mind.
Result: Big takeoff for evaluation companies (e.g., Brain Trust, Arthur, Arise AI, Galileo).

⏳ Evolution of AI Adoption & Evaluation Need (2023-2025 Timeline)

2023: The Year of GenAI Science Projects
- Austerity due to frozen budgets.
- Only new allocated money went to GenAI.
- Focus on GenAI as a "cool technology."
2024: GenAI Applications in Production
- Internal chat apps, hiring tools.
- Business-suited folks (non-ML/data science) began asking about:
  - ROI
  - Governance
  - Risk
  - Compliance
  - Brand optics
- Closer to C-suite caring about evaluation (e.g., quantitative risk estimates).
2025: Shipping & Scaling of AI/Agentic Systems
- Scale-ups and revenue growth for frontier model providers and evaluation companies.
- C-suite comfortable allocating large, real budgets for AI.
- Technology has become "amazing."
- Community, open source, venture capital, big tech are all heavily invested.
- Crucially, ML systems are moving toward autonomy.

🤖 The Agentic Revolution

2025: Clearly the "Year of the Agent."
Agent Definition (since 1950s):
- Perceive environment.
- Learn.
- Abstract & Generalize.
- Key Difference from Traditional ML: Reason and Act.
Impact: Introduces significant complexity and risk into systems, which is positive for evaluation companies.

🤝 C-Suite Alignment on Evaluation

Core Need: Connecting product value to downstream business KPIs (risk mitigation, revenue gains, cost reduction).
Evaluation's Role: Quantifying things for these KPIs.
C-Suite Roles & Their Buy-in:
- CEO: Understands GenAI and agentic system capabilities; comfortable talking to experts, allocating budget, and reporting to board/shareholders.
- CFO: Cares about bottom line; needs quantitative evaluation for allocation and budget planning (Excel spreadsheets).
- CISO: Sees AI as a huge security risk and opportunity (e.g., hallucination detection, prompt injection). More willing to write smaller checks for startups (less overhead/process than CIO).
- CIO: Remains on board, wants to keep job.
- CTO: Demands standards and data-driven decisions; numbers from evaluation are crucial.
Overall: All key C-suite members are now aligned on the need to understand AI evaluation.

📈 Market Impact & Future Outlook

Industry Shift: All evaluation, observability, monitoring, and security companies have shifted to multi-agent systems monitoring.
- Principle: Monitor the whole system, not just individual models.
Revenue Growth: Leaked revenue numbers for evaluation startups (e.g., Glean, Galileo, Brain Trust) from mid-April (lagged 6-8 months) are no longer representative; current numbers are significantly higher.
Prediction: Early 2026 leaks will show revenue no longer lags at AI evaluation startups, confirming 2025 as the year for AI evaluation.

❓ Q&A

Q1: Domain Expertise in GenAI Evaluation (e.g., Financial Analysis)

Problem: GenAI evaluations often require domain expertise (e.g., validating a discounted cash flow spreadsheet). Traditional ML was structured data; GenAI is unstructured, requiring human-like quality measurement.
Solution (Current): Hiring human experts for validation (e.g., Merkor provides experts at $50-$200/hour).
- Experts sit alongside multi-agent systems, performing "expensive human validation."
- Justification: High stakes (make/lose money, job loss if wrong) justify the cost.
Future Outlook (5 years): What happens when this expert data is incorporated into the systems themselves?
- Competitive Moat: Data set creation and environment creation are paramount in the eval space. Investing in high-quality competitive environments (e.g., DCF environment) can be a significant capex and competitive advantage.

Q2: Timeline for LLM-Driven Evaluation

LLM-as-a-Judge Paradigm: Already used in practice.
Challenges: Known biases (e.g., conciseness, helpfulness) compared to humans (as per speaker's paper in iClar).
Benefit: Solves data set creation to some extent; LLM as a "poor man's version" of human judging.
Crucial Need: Still requires validation to ensure it's not going off in a "weird bias direction."

🌐 Mozilla AI Offering

Product: Any Agent (open source, not monetized)
Description: A "light LLM" for multi-agent systems.
Functionality: Implements various multi-agent system frameworks under a unified interface.
Call to Action: Encourages playing around with it for those interested in multi-agent system frameworks.

Jarvis-Legatus/mindmap_CQGuvf6gSrM.md

Select an option

No results found