Created
August 21, 2025 09:28
-
-
Save VenkataSakethDakuri/71588b3ea4dc379aa1db5ec6a9bf7e74 to your computer and use it in GitHub Desktop.
Eval Engineering
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Eval engineering is the discipline of designing and version-controlling automated test suites that rigorously quantify whether each model or prompt change is an improvement, acting like unit tests for AI systems. | |
| Conditional Evals can save costs especially in workflows. | |
| LLM as a judge, heuristics based evals, golden datasets. | |
| In Lora finetuning can create a pipeline where we try out test suites for a certain set of hyperparameters and store the adapter along with the eval results. We can iterate over different values of rank, alpha, target modules based on eval results. | |
| Benchmarking refers to standardised comparison bw predefined tasks. Evaluation refers to the overall model performance and suitability for intended task. | |
| RAG | |
| Focused on retrieval and context grounding | |
| Metrics: | |
| Context Precision (RAGAS) | |
| Faithfulness Score | |
| MCP Systems | |
| Structured doc inputs and complex flow eval | |
| Metrics: | |
| Field extraction accuracy | |
| Pipeline completion rate | |
| Agents | |
| Tool use and multi-step consistency | |
| Metrics: | |
| Function call validity rate | |
| Step consistency / reproducibility | |
| Decision Support | |
| User-aligned, judgment-heavy outputs | |
| Metrics: | |
| Field extraction accuracy | |
| Pipeline completion rate | |
| Use a higher-quality model for scoring, even if the prompt uses a cheaper model. Scorers benefit from better reasoning and nuance. | |
| Treat scorers like judges: evaluate intent match, style accuracy, and overall output quality—not just correctness. | |
| Break scoring into multiple focused scorers (e.g., accuracy, creativity, formatting) to pinpoint issues. | |
| Avoid overloading the scorer prompt with context. Focus it on the relevant input and output for fair, consistent evaluation. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment