Skip to content

Instantly share code, notes, and snippets.

@Jarvis-Legatus
Created August 11, 2025 14:01
Show Gist options
  • Select an option

  • Save Jarvis-Legatus/08dc48cfc23017b7f7ac366ca7e8e02e to your computer and use it in GitHub Desktop.

Select an option

Save Jarvis-Legatus/08dc48cfc23017b7f7ac366ca7e8e02e to your computer and use it in GitHub Desktop.
Mind map for YouTube video: 2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI

2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI

TL;DR John Dickerson, CEO of Mozilla AI and co-founder of Arthur AI, argues that 2025 will finally be "the year of evaluations" for AI, unlike previous years where it remained a niche concern. This shift is attributed to a "perfect storm" of three concurrent factors: the launch of ChatGPT in late 2022, which made AI capabilities understandable and "wow-worthy" for non-technical C-suite executives (CEOs, CFOs, CISOs); a simultaneous enterprise budget freeze that paradoxically funneled limited discretionary funds specifically into GenAI "pet projects"; and the current deployment of autonomous or semi-autonomous agentic AI systems that act for humans, introducing significant complexity and risk requiring quantitative evaluation. This confluence has led to unprecedented C-suite alignment on the critical need for AI evaluation, driving hockey-stick growth for evaluation companies and shifting focus to monitoring entire multi-agent systems rather than just individual models.


Information Mind Map

🧠 2025: The Year of AI Evals (Finally)

👤 Speaker Introduction & Background

  • Speaker: John Dickerson, CEO Mozilla AI (formerly Co-founder & Chief Scientist at Arthur AI)
  • Mozilla AI Mission: Open-source AI tooling & stack, enabling open-source community to engage with leaders like Sam Altman.
  • Arthur AI Experience: Six years in observability, evaluation, and security across:
    • Traditional ML
    • Deep Learning Revolution
    • GenAI Revolution
    • Agentic Revolution
  • Current Outlook: Companies in this space are now seeing "hockey stick growth."

💡 Core Thesis: AI/ML Monitoring & Evaluation as Two Sides of the Same Sword

  • Interdependence: Cannot do monitoring/observability without measurement; measurement is core to evaluation.
  • Historical Context (Pre-2025):
    • Evaluation was not top of mind for the C-suite (CEO, CFO, CISO).
    • ML models often "spit out numbers" ingested into larger, opaque systems.
    • The complexity of these systems "erased" the top-of-mind need for model-level evaluation for decision-makers.
    • ML monitoring existed since ~2012 (e.g., H2O, Algorithmia, Seldon, Y Labs, Aporeia, Arise, Arthur, Galileo, Fiddler, Protect AI, Snowflake, Databricks, Datadog, SageMaker, Vertex AI).
    • Challenge: Tenuous connection to downstream business KPIs (dollars saved/earned). Selling was primarily into the CIO.
    • Common Pitch Deck Slide: "This is the year a CEO gets fired due to an ML screw-up" – never happened to speaker's knowledge.
    • Example (JPMC): Jamie Diamond's 2022 report showed "comically small" AI/ML spend ($100M from 2017-2021 in consumer world).

📈 The Catalyst: A "Perfect Storm" for Evaluation

  • Three Concurrent Factors (The Triangle):
    1. AI Became Understandable to C-Suite:
      • Event: ChatGPT launch (November 30, 2022).
      • Impact: CEOs, CFOs, CISOs (less technical) could interact easily with a UI and be "wowed" by AI.
      • Shifted AI from a technical problem to a business opportunity/risk.
    2. Perfectly Timed Budget Freeze:
      • Context: Deep fears of impending recession (late 2022, before ChatGPT launch).
      • Effect: Most enterprises froze or shrunk IT budgets for 2023.
      • Exception: Discretionary budget was unlocked specifically for GenAI as the "CEO's pet project."
    3. Emergence of Agentic Systems:
      • Definition: Systems now acting for humans/teams (autonomously or semi-autonomously), not just providing inputs.
      • Consequence: Introduces significant complexity and risk, forcing evaluation to be top-of-mind.
  • Result: Big takeoff for evaluation companies (e.g., Brain Trust, Arthur, Arise AI, Galileo).

⏳ Evolution of AI Adoption & Evaluation Need (2023-2025 Timeline)

  • 2023: The Year of GenAI Science Projects
    • Austerity due to frozen budgets.
    • Only new allocated money went to GenAI.
    • Focus on GenAI as a "cool technology."
  • 2024: GenAI Applications in Production
    • Internal chat apps, hiring tools.
    • Business-suited folks (non-ML/data science) began asking about:
      • ROI
      • Governance
      • Risk
      • Compliance
      • Brand optics
    • Closer to C-suite caring about evaluation (e.g., quantitative risk estimates).
  • 2025: Shipping & Scaling of AI/Agentic Systems
    • Scale-ups and revenue growth for frontier model providers and evaluation companies.
    • C-suite comfortable allocating large, real budgets for AI.
    • Technology has become "amazing."
    • Community, open source, venture capital, big tech are all heavily invested.
    • Crucially, ML systems are moving toward autonomy.

🤖 The Agentic Revolution

  • 2025: Clearly the "Year of the Agent."
  • Agent Definition (since 1950s):
    • Perceive environment.
    • Learn.
    • Abstract & Generalize.
    • Key Difference from Traditional ML: Reason and Act.
  • Impact: Introduces significant complexity and risk into systems, which is positive for evaluation companies.

🤝 C-Suite Alignment on Evaluation

  • Core Need: Connecting product value to downstream business KPIs (risk mitigation, revenue gains, cost reduction).
  • Evaluation's Role: Quantifying things for these KPIs.
  • C-Suite Roles & Their Buy-in:
    • CEO: Understands GenAI and agentic system capabilities; comfortable talking to experts, allocating budget, and reporting to board/shareholders.
    • CFO: Cares about bottom line; needs quantitative evaluation for allocation and budget planning (Excel spreadsheets).
    • CISO: Sees AI as a huge security risk and opportunity (e.g., hallucination detection, prompt injection). More willing to write smaller checks for startups (less overhead/process than CIO).
    • CIO: Remains on board, wants to keep job.
    • CTO: Demands standards and data-driven decisions; numbers from evaluation are crucial.
  • Overall: All key C-suite members are now aligned on the need to understand AI evaluation.

📈 Market Impact & Future Outlook

  • Industry Shift: All evaluation, observability, monitoring, and security companies have shifted to multi-agent systems monitoring.
    • Principle: Monitor the whole system, not just individual models.
  • Revenue Growth: Leaked revenue numbers for evaluation startups (e.g., Glean, Galileo, Brain Trust) from mid-April (lagged 6-8 months) are no longer representative; current numbers are significantly higher.
  • Prediction: Early 2026 leaks will show revenue no longer lags at AI evaluation startups, confirming 2025 as the year for AI evaluation.

❓ Q&A

Q1: Domain Expertise in GenAI Evaluation (e.g., Financial Analysis)

  • Problem: GenAI evaluations often require domain expertise (e.g., validating a discounted cash flow spreadsheet). Traditional ML was structured data; GenAI is unstructured, requiring human-like quality measurement.
  • Solution (Current): Hiring human experts for validation (e.g., Merkor provides experts at $50-$200/hour).
    • Experts sit alongside multi-agent systems, performing "expensive human validation."
    • Justification: High stakes (make/lose money, job loss if wrong) justify the cost.
  • Future Outlook (5 years): What happens when this expert data is incorporated into the systems themselves?
    • Competitive Moat: Data set creation and environment creation are paramount in the eval space. Investing in high-quality competitive environments (e.g., DCF environment) can be a significant capex and competitive advantage.

Q2: Timeline for LLM-Driven Evaluation

  • LLM-as-a-Judge Paradigm: Already used in practice.
  • Challenges: Known biases (e.g., conciseness, helpfulness) compared to humans (as per speaker's paper in iClar).
  • Benefit: Solves data set creation to some extent; LLM as a "poor man's version" of human judging.
  • Crucial Need: Still requires validation to ensure it's not going off in a "weird bias direction."

🌐 Mozilla AI Offering

  • Product: Any Agent (open source, not monetized)
  • Description: A "light LLM" for multi-agent systems.
  • Functionality: Implements various multi-agent system frameworks under a unified interface.
  • Call to Action: Encourages playing around with it for those interested in multi-agent system frameworks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment