Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save VenkataSakethDakuri/71588b3ea4dc379aa1db5ec6a9bf7e74 to your computer and use it in GitHub Desktop.
Save VenkataSakethDakuri/71588b3ea4dc379aa1db5ec6a9bf7e74 to your computer and use it in GitHub Desktop.
Eval Engineering
Eval engineering is the discipline of designing and version-controlling automated test suites that rigorously quantify whether each model or prompt change is an improvement, acting like unit tests for AI systems.
Conditional Evals can save costs especially in workflows.
LLM as a judge, heuristics based evals, golden datasets.
In Lora finetuning can create a pipeline where we try out test suites for a certain set of hyperparameters and store the adapter along with the eval results. We can iterate over different values of rank, alpha, target modules based on eval results.
Benchmarking refers to standardised comparison bw predefined tasks. Evaluation refers to the overall model performance and suitability for intended task.
RAG
Focused on retrieval and context grounding
Metrics:
Context Precision (RAGAS)
Faithfulness Score
MCP Systems
Structured doc inputs and complex flow eval
Metrics:
Field extraction accuracy
Pipeline completion rate
Agents
Tool use and multi-step consistency
Metrics:
Function call validity rate
Step consistency / reproducibility
Decision Support
User-aligned, judgment-heavy outputs
Metrics:
Field extraction accuracy
Pipeline completion rate
Use a higher-quality model for scoring, even if the prompt uses a cheaper model. Scorers benefit from better reasoning and nuance.
Treat scorers like judges: evaluate intent match, style accuracy, and overall output quality—not just correctness.
Break scoring into multiple focused scorers (e.g., accuracy, creativity, formatting) to pinpoint issues.
Avoid overloading the scorer prompt with context. Focus it on the relevant input and output for fair, consistent evaluation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment