Skip to content

Instantly share code, notes, and snippets.

@camwest
Last active August 9, 2025 23:26
Show Gist options
  • Save camwest/309e7040f47c4dde0515173ff4d3fd61 to your computer and use it in GitHub Desktop.
Save camwest/309e7040f47c4dde0515173ff4d3fd61 to your computer and use it in GitHub Desktop.
LLM Evals Are Just Tests - Code Examples
it("detects when themes haven't been added", async () => {
const response = await generateAIResponse(context)
const eval = await evaluatePendingCommandAwareness(response)
expect(eval.score).toBeGreaterThan(0.85)
})
- name: Download Main Branch Scorecards
uses: dawidd6/action-download-artifact@v3
with:
workflow: evals2-tests.yml
branch: main
name: eval-scorecards
path: scorecards-main/
const results = await Promise.all(
Array.from({ length: 10 }, async () => {
const response = await generateAIResponse(context)
const eval = await evaluatePendingCommandAwareness(response)
return eval.hasGuidanceText &&
\!eval.hasStartResearchCommand &&
eval.mentionsPendingThemes &&
eval.offersNextSteps
})
)
const successRate = results.filter(r => r).length / results.length
expect(successRate).toBeGreaterThan(0.85)
const { object } = await generateObject({
model: googleAI("gemini-2.5-pro"),
schema: z.object({
hasGuidanceText: z.boolean(),
mentionsPendingThemes: z.boolean(),
// ... other criteria
}),
prompt: `Evaluate this response: "${response}"...`
})
const start = performance.now()
const result = await fn()
const duration = performance.now() - start
<\!-- EVALS2_SCORECARD -->
<details>
<summary><b>🟒 Eval Scorecards: 10/10 passed</b></summary>
Test Suite / Metric Current Main Ξ” βœ“
────────────────────────────────────────────────────────────────────────────────
Available Actions
Start Research After Execution 100% 90% +10% βœ… 🟒
Start Research Includes Leaf Themes 100% 100% 0% βœ… βšͺ
Start Research Excludes Parent Themes 100% 100% 0% βœ… βšͺ
Start Research Uses Current Theme Context 96% 96% +1% βœ… 🟒
Start Research Avoids Find Stocks Phrase 100% 100% 0% βœ… βšͺ
Start Research Generates Valid Command 100% 100% 0% βœ… βšͺ
Command Generation
First response accuracy (command -> command) 100% 100% 0% βœ… βšͺ
Second response accuracy (TL;DR -> analysis) 100% 100% 0% βœ… βšͺ
Overall accuracy (both correct) 100% 100% 0% βœ… βšͺ
Context Awareness
Pending Command Awareness 100% 100% 0% βœ… βšͺ
User Override 100% 100% 0% βœ… βšͺ
Continuation Responses
AddTheme Continuation 100% 100% 0% βœ… βšͺ
StartResearch Continuation 100% 100% 0% βœ… βšͺ
Research After Theme 100% 100% 0% βœ… βšͺ
First Interaction
Success rate 100% 100% 0% βœ… βšͺ
StartResearch Guard Rails
Avoids False Analysis Claims 100% 100% 0% βœ… βšͺ
Explains Thematic Matching Only 100% 100% 0% βœ… βšͺ
No Targeted Analysis Promises 100% 100% 0% βœ… βšͺ
Provides Honest Limitations 100% 100% 0% βœ… βšͺ
Stock Defense Quality
Information density 88% 74% +14% βœ… 🟒
Institutional credibility 95% 95% +0% βœ… 🟒
Content quality 97% 96% +1% βœ… 🟒
Evidence prioritization 89% 84% +5% βœ… 🟒
Protocol URL Correctness 100% 100% 0% βœ… βšͺ
Protocol URL Usage 100% 100% 0% βœ… βšͺ
Theme Analysis Quality
Overall quality score 100% 73% +27% βœ… 🟒
Human-friendly quality 63% 58% +5% βœ… 🟒
Number translation judgment 96% 73% +23% βœ… 🟒
References analysis 100% 78% +22% βœ… 🟒
Theme Generation Quality
Overall quality score 91% 96% -5% βœ… πŸ”΄
Reasoning quality score 91% 95% -4% βœ… πŸ”΄
Variance score (avoiding rigid counts) 13% 10% +3% βœ… 🟒
────────────────────────────────────────────────────────────────────────────────
Updated: 2025-08-09T04:40:28.297Z
</details>
await writeScorecard("pending-command-awareness", {
title: "Pending Command Awareness",
metrics: [{
name: "Success Rate",
value: successRate * 100,
threshold: 85,
unit: "%",
passed: successRate > 0.85
}],
overallPassed: successRate > 0.85,
timestamp: new Date().toISOString()
})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment