Last active
August 9, 2025 23:26
-
-
Save camwest/309e7040f47c4dde0515173ff4d3fd61 to your computer and use it in GitHub Desktop.
LLM Evals Are Just Tests - Code Examples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| it("detects when themes haven't been added", async () => { | |
| const response = await generateAIResponse(context) | |
| const eval = await evaluatePendingCommandAwareness(response) | |
| expect(eval.score).toBeGreaterThan(0.85) | |
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| - name: Download Main Branch Scorecards | |
| uses: dawidd6/action-download-artifact@v3 | |
| with: | |
| workflow: evals2-tests.yml | |
| branch: main | |
| name: eval-scorecards | |
| path: scorecards-main/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| const results = await Promise.all( | |
| Array.from({ length: 10 }, async () => { | |
| const response = await generateAIResponse(context) | |
| const eval = await evaluatePendingCommandAwareness(response) | |
| return eval.hasGuidanceText && | |
| \!eval.hasStartResearchCommand && | |
| eval.mentionsPendingThemes && | |
| eval.offersNextSteps | |
| }) | |
| ) | |
| const successRate = results.filter(r => r).length / results.length | |
| expect(successRate).toBeGreaterThan(0.85) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| const { object } = await generateObject({ | |
| model: googleAI("gemini-2.5-pro"), | |
| schema: z.object({ | |
| hasGuidanceText: z.boolean(), | |
| mentionsPendingThemes: z.boolean(), | |
| // ... other criteria | |
| }), | |
| prompt: `Evaluate this response: "${response}"...` | |
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| const start = performance.now() | |
| const result = await fn() | |
| const duration = performance.now() - start |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <\!-- EVALS2_SCORECARD --> | |
| <details> | |
| <summary><b>π’ Eval Scorecards: 10/10 passed</b></summary> | |
| Test Suite / Metric Current Main Ξ β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Available Actions | |
| Start Research After Execution 100% 90% +10% β π’ | |
| Start Research Includes Leaf Themes 100% 100% 0% β βͺ | |
| Start Research Excludes Parent Themes 100% 100% 0% β βͺ | |
| Start Research Uses Current Theme Context 96% 96% +1% β π’ | |
| Start Research Avoids Find Stocks Phrase 100% 100% 0% β βͺ | |
| Start Research Generates Valid Command 100% 100% 0% β βͺ | |
| Command Generation | |
| First response accuracy (command -> command) 100% 100% 0% β βͺ | |
| Second response accuracy (TL;DR -> analysis) 100% 100% 0% β βͺ | |
| Overall accuracy (both correct) 100% 100% 0% β βͺ | |
| Context Awareness | |
| Pending Command Awareness 100% 100% 0% β βͺ | |
| User Override 100% 100% 0% β βͺ | |
| Continuation Responses | |
| AddTheme Continuation 100% 100% 0% β βͺ | |
| StartResearch Continuation 100% 100% 0% β βͺ | |
| Research After Theme 100% 100% 0% β βͺ | |
| First Interaction | |
| Success rate 100% 100% 0% β βͺ | |
| StartResearch Guard Rails | |
| Avoids False Analysis Claims 100% 100% 0% β βͺ | |
| Explains Thematic Matching Only 100% 100% 0% β βͺ | |
| No Targeted Analysis Promises 100% 100% 0% β βͺ | |
| Provides Honest Limitations 100% 100% 0% β βͺ | |
| Stock Defense Quality | |
| Information density 88% 74% +14% β π’ | |
| Institutional credibility 95% 95% +0% β π’ | |
| Content quality 97% 96% +1% β π’ | |
| Evidence prioritization 89% 84% +5% β π’ | |
| Protocol URL Correctness 100% 100% 0% β βͺ | |
| Protocol URL Usage 100% 100% 0% β βͺ | |
| Theme Analysis Quality | |
| Overall quality score 100% 73% +27% β π’ | |
| Human-friendly quality 63% 58% +5% β π’ | |
| Number translation judgment 96% 73% +23% β π’ | |
| References analysis 100% 78% +22% β π’ | |
| Theme Generation Quality | |
| Overall quality score 91% 96% -5% β π΄ | |
| Reasoning quality score 91% 95% -4% β π΄ | |
| Variance score (avoiding rigid counts) 13% 10% +3% β π’ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Updated: 2025-08-09T04:40:28.297Z | |
| </details> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| await writeScorecard("pending-command-awareness", { | |
| title: "Pending Command Awareness", | |
| metrics: [{ | |
| name: "Success Rate", | |
| value: successRate * 100, | |
| threshold: 85, | |
| unit: "%", | |
| passed: successRate > 0.85 | |
| }], | |
| overallPassed: successRate > 0.85, | |
| timestamp: new Date().toISOString() | |
| }) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment