| Model Name | Score Accuracy | Score Helpfulness | Score Specificity | Score Clarity |
|---|---|---|---|---|
| anthropic/claude-3.5-sonnet | 1.419014 | 1.440141 | 0.957746 | 0.992958 |
| google/gemma-2-9b-it | 1.197183 | 1.232394 | 0.802817 | 0.985915 |
| meta-llama/llama-3.1-8b-instruct | 1.116197 | 1.140845 | 0.757042 | 0.961268 |
| mistralai/mistral-nemo | 1.183099 | 1.214789 | 0.806338 | 0.950704 |
| openai/gpt-4o-mini | 1.281690 | 1.338028 | 0.859155 | 0.985915 |
| Model Name | Score Final |
|---|---|
| anthropic/claude-3.5-sonnet | 4.809859 |
| openai/gpt-4o-mini | 4.464789 |
| google/gemma-2-9b-it | 4.218310 |
| mistralai/mistral-nemo | 4.154930 |
| meta-llama/llama-3.1-8b-instruct | 3.975352 |
| Model Name | Failed to Score Sum | Total Questions | Margin of Error |
|---|---|---|---|
| anthropic/claude-3.5-sonnet | 0 | 200 | 0.0 |
| google/gemma-2-9b-it | 0 | 200 | 0.0 |
| meta-llama/llama-3.1-8b-instruct | 0 | 200 | 0.0 |
| mistralai/mistral-nemo | 0 | 200 | 0.0 |
| openai/gpt-4o-mini | 0 | 200 | 0.0 |

