LLM Evaluation Analysis

Model Name	Score Accuracy	Score Helpfulness	Score Specificity	Score Clarity
anthropic/claude-3.5-sonnet	1.419014	1.440141	0.957746	0.992958
google/gemma-2-9b-it	1.197183	1.232394	0.802817	0.985915
meta-llama/llama-3.1-8b-instruct	1.116197	1.140845	0.757042	0.961268
mistralai/mistral-nemo	1.183099	1.214789	0.806338	0.950704
openai/gpt-4o-mini	1.281690	1.338028	0.859155	0.985915

Model Name	Failed to Score Sum	Total Questions	Margin of Error
anthropic/claude-3.5-sonnet	0	200	0.0
google/gemma-2-9b-it	0	200	0.0
meta-llama/llama-3.1-8b-instruct	0	200	0.0
mistralai/mistral-nemo	0	200	0.0
openai/gpt-4o-mini	0	200	0.0

prykon/llm-eval.md