VenkataSakethDakuri · August 21, 2025 09:28
diff --git a/gistfile1.txt b/gistfile1.txt
 Eval engineering is the discipline of designing and version-controlling automated test suites that rigorously quantify whether each model or prompt change is an improvement, acting like unit tests for AI systems.

 Conditional Evals can save costs especially in workflows.
 LLM as a judge, heuristics based evals, golden datasets.

 In Lora finetuning can create a pipeline where we try out test suites for a certain set of hyperparameters and store the adapter along with the eval results. We can iterate over different values of rank, alpha, target modules based on eval results. 

 Benchmarking refers to standardised comparison bw predefined tasks. Evaluation refers to the overall model performance and suitability for intended task.

 RAG
 Focused on retrieval and context grounding

 Metrics:
 Context Precision (RAGAS)
 Faithfulness Score

 MCP Systems
 Structured doc inputs and complex flow eval

 Metrics:
 Field extraction accuracy
 Pipeline completion rate

 Agents
 Tool use and multi-step consistency

 Metrics:
 Function call validity rate
 Step consistency / reproducibility

 Decision Support
 User-aligned, judgment-heavy outputs

 Metrics:
 Field extraction accuracy
 Pipeline completion rate



 Use a higher-quality model for scoring, even if the prompt uses a cheaper model. Scorers benefit from better reasoning and nuance.

 Treat scorers like judges: evaluate intent match, style accuracy, and overall output quality—not just correctness.

 Break scoring into multiple focused scorers (e.g., accuracy, creativity, formatting) to pinpoint issues.

 Avoid overloading the scorer prompt with context. Focus it on the relevant input and output for fair, consistent evaluation.
	Eval engineering is the discipline of designing and version-controlling automated test suites that rigorously quantify whether each model or prompt change is an improvement, acting like unit tests for AI systems.

	Conditional Evals can save costs especially in workflows.
	LLM as a judge, heuristics based evals, golden datasets.

	In Lora finetuning can create a pipeline where we try out test suites for a certain set of hyperparameters and store the adapter along with the eval results. We can iterate over different values of rank, alpha, target modules based on eval results.

	Benchmarking refers to standardised comparison bw predefined tasks. Evaluation refers to the overall model performance and suitability for intended task.

	RAG
	Focused on retrieval and context grounding

	Metrics:
	Context Precision (RAGAS)
	Faithfulness Score

	MCP Systems
	Structured doc inputs and complex flow eval

	Metrics:
	Field extraction accuracy
	Pipeline completion rate

	Agents
	Tool use and multi-step consistency

	Metrics:
	Function call validity rate
	Step consistency / reproducibility

	Decision Support
	User-aligned, judgment-heavy outputs

	Metrics:
	Field extraction accuracy
	Pipeline completion rate



	Use a higher-quality model for scoring, even if the prompt uses a cheaper model. Scorers benefit from better reasoning and nuance.

	Treat scorers like judges: evaluate intent match, style accuracy, and overall output quality—not just correctness.

	Break scoring into multiple focused scorers (e.g., accuracy, creativity, formatting) to pinpoint issues.

	Avoid overloading the scorer prompt with context. Focus it on the relevant input and output for fair, consistent evaluation.