Model Name | Score Accuracy | Score Helpfulness | Score Specificity | Score Clarity |
---|---|---|---|---|
anthropic/claude-3.5-sonnet | 1.419014 | 1.440141 | 0.957746 | 0.992958 |
google/gemma-2-9b-it | 1.197183 | 1.232394 | 0.802817 | 0.985915 |
meta-llama/llama-3.1-8b-instruct | 1.116197 | 1.140845 | 0.757042 | 0.961268 |
mistralai/mistral-nemo | 1.183099 | 1.214789 | 0.806338 | 0.950704 |
openai/gpt-4o-mini | 1.281690 | 1.338028 | 0.859155 | 0.985915 |
Model Name | Score Final |
---|---|
anthropic/claude-3.5-sonnet | 4.809859 |
openai/gpt-4o-mini | 4.464789 |
google/gemma-2-9b-it | 4.218310 |
mistralai/mistral-nemo | 4.154930 |
meta-llama/llama-3.1-8b-instruct | 3.975352 |
Model Name | Failed to Score Sum | Total Questions | Margin of Error |
---|---|---|---|
anthropic/claude-3.5-sonnet | 0 | 200 | 0.0 |
google/gemma-2-9b-it | 0 | 200 | 0.0 |
meta-llama/llama-3.1-8b-instruct | 0 | 200 | 0.0 |
mistralai/mistral-nemo | 0 | 200 | 0.0 |
openai/gpt-4o-mini | 0 | 200 | 0.0 |