TL;DR John Dickerson, CEO of Mozilla AI and co-founder of Arthur AI, argues that 2025 will finally be "the year of evaluations" for AI, unlike previous years where it remained a niche concern. This shift is attributed to a "perfect storm" of three concurrent factors: the launch of ChatGPT in late 2022, which made AI capabilities understandable and "wow-worthy" for non-technical C-suite executives (CEOs, CFOs, CISOs); a simultaneous enterprise budget freeze that paradoxically funneled limited discretionary funds specifically into GenAI "pet projects"; and the current deployment of autonomous or semi-autonomous agentic AI systems that act for humans, introducing significant complexity and risk requiring quantitative evaluation. This confluence has led to unprecedented C-suite alignment on the critical need for AI evaluation, driving hockey-stick growth for evaluation companies and shifting focus to monitoring entire multi-agent systems rather than just individual models.
Information Mind Map
- Speaker: John Dickerson, CEO Mozilla AI (formerly Co-founder & Chief Scientist at
Arthur AI) - Mozilla AI Mission: Open-source AI tooling & stack, enabling open-source community to engage with leaders like
Sam Altman. - Arthur AI Experience: Six years in observability, evaluation, and security across:
- Traditional ML
- Deep Learning Revolution
- GenAI Revolution
- Agentic Revolution
- Current Outlook: Companies in this space are now seeing "hockey stick growth."
- Interdependence: Cannot do monitoring/observability without measurement; measurement is core to evaluation.
- Historical Context (Pre-2025):
- Evaluation was not top of mind for the C-suite (CEO, CFO, CISO).
- ML models often "spit out numbers" ingested into larger, opaque systems.
- The complexity of these systems "erased" the top-of-mind need for model-level evaluation for decision-makers.
- ML monitoring existed since ~2012 (e.g.,
H2O,Algorithmia,Seldon,Y Labs,Aporeia,Arise,Arthur,Galileo,Fiddler,Protect AI,Snowflake,Databricks,Datadog,SageMaker,Vertex AI). - Challenge: Tenuous connection to downstream business KPIs (dollars saved/earned). Selling was primarily into the
CIO. - Common Pitch Deck Slide: "This is the year a CEO gets fired due to an ML screw-up" – never happened to speaker's knowledge.
- Example (JPMC): Jamie Diamond's 2022 report showed "comically small" AI/ML spend ($100M from 2017-2021 in consumer world).
- Three Concurrent Factors (The Triangle):
- AI Became Understandable to C-Suite:
- Event:
ChatGPTlaunch (November 30, 2022). - Impact: CEOs, CFOs, CISOs (less technical) could interact easily with a UI and be "wowed" by AI.
- Shifted AI from a technical problem to a business opportunity/risk.
- Event:
- Perfectly Timed Budget Freeze:
- Context: Deep fears of impending recession (late 2022, before ChatGPT launch).
- Effect: Most enterprises froze or shrunk IT budgets for 2023.
- Exception: Discretionary budget was unlocked specifically for
GenAIas the "CEO's pet project."
- Emergence of Agentic Systems:
- Definition: Systems now acting for humans/teams (autonomously or semi-autonomously), not just providing inputs.
- Consequence: Introduces significant complexity and risk, forcing evaluation to be top-of-mind.
- AI Became Understandable to C-Suite:
- Result: Big takeoff for evaluation companies (e.g.,
Brain Trust,Arthur,Arise AI,Galileo).
- 2023: The Year of GenAI Science Projects
- Austerity due to frozen budgets.
- Only new allocated money went to
GenAI. - Focus on
GenAIas a "cool technology."
- 2024: GenAI Applications in Production
- Internal chat apps, hiring tools.
- Business-suited folks (non-ML/data science) began asking about:
ROI- Governance
- Risk
- Compliance
- Brand optics
- Closer to C-suite caring about evaluation (e.g., quantitative risk estimates).
- 2025: Shipping & Scaling of AI/Agentic Systems
- Scale-ups and revenue growth for frontier model providers and evaluation companies.
- C-suite comfortable allocating large, real budgets for AI.
- Technology has become "amazing."
- Community, open source, venture capital, big tech are all heavily invested.
- Crucially, ML systems are moving toward autonomy.
- 2025: Clearly the "Year of the Agent."
- Agent Definition (since 1950s):
- Perceive environment.
- Learn.
- Abstract & Generalize.
- Key Difference from Traditional ML: Reason and Act.
- Impact: Introduces significant complexity and risk into systems, which is positive for evaluation companies.
- Core Need: Connecting product value to downstream business
KPIs(risk mitigation, revenue gains, cost reduction). - Evaluation's Role: Quantifying things for these KPIs.
- C-Suite Roles & Their Buy-in:
- CEO: Understands
GenAIand agentic system capabilities; comfortable talking to experts, allocating budget, and reporting to board/shareholders. - CFO: Cares about bottom line; needs quantitative evaluation for allocation and budget planning (Excel spreadsheets).
- CISO: Sees AI as a huge security risk and opportunity (e.g.,
hallucination detection,prompt injection). More willing to write smaller checks for startups (less overhead/process than CIO). - CIO: Remains on board, wants to keep job.
- CTO: Demands standards and data-driven decisions; numbers from evaluation are crucial.
- CEO: Understands
- Overall: All key C-suite members are now aligned on the need to understand AI evaluation.
- Industry Shift: All evaluation, observability, monitoring, and security companies have shifted to
multi-agent systems monitoring.- Principle: Monitor the whole system, not just individual models.
- Revenue Growth: Leaked revenue numbers for evaluation startups (e.g.,
Glean,Galileo,Brain Trust) from mid-April (lagged 6-8 months) are no longer representative; current numbers are significantly higher. - Prediction: Early 2026 leaks will show revenue no longer lags at AI evaluation startups, confirming 2025 as the year for AI evaluation.
- Problem:
GenAIevaluations often require domain expertise (e.g., validating adiscounted cash flow spreadsheet). Traditional ML was structured data; GenAI is unstructured, requiring human-like quality measurement. - Solution (Current): Hiring human experts for validation (e.g.,
Merkorprovides experts at $50-$200/hour).- Experts sit alongside multi-agent systems, performing "expensive human validation."
- Justification: High stakes (make/lose money, job loss if wrong) justify the cost.
- Future Outlook (5 years): What happens when this expert data is incorporated into the systems themselves?
- Competitive Moat: Data set creation and environment creation are paramount in the eval space. Investing in high-quality competitive environments (e.g.,
DCFenvironment) can be a significantcapexand competitive advantage.
- Competitive Moat: Data set creation and environment creation are paramount in the eval space. Investing in high-quality competitive environments (e.g.,
- LLM-as-a-Judge Paradigm: Already used in practice.
- Challenges: Known biases (e.g., conciseness, helpfulness) compared to humans (as per speaker's paper in
iClar). - Benefit: Solves data set creation to some extent; LLM as a "poor man's version" of human judging.
- Crucial Need: Still requires validation to ensure it's not going off in a "weird bias direction."
- Product:
Any Agent(open source, not monetized) - Description: A "light LLM" for multi-agent systems.
- Functionality: Implements various multi-agent system frameworks under a unified interface.
- Call to Action: Encourages playing around with it for those interested in multi-agent system frameworks.