Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
---|---|---|---|---|
Anthropic_RLFH_ORDP_40k | 30.55 | Error: File does not exist | 45.38 | 36.75 |
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 21.26 | ± | 2.57 |
acc_norm | 22.83 | ± | 2.64 | ||
agieval_logiqa_en | 0 | acc | 28.11 | ± | 1.76 |
acc_norm | 32.72 | ± | 1.84 | ||
agieval_lsat_ar | 0 | acc | 17.39 | ± | 2.50 |
acc_norm | 19.13 | ± | 2.60 | ||
agieval_lsat_lr | 0 | acc | 36.27 | ± | 2.13 |
acc_norm | 29.02 | ± | 2.01 | ||
agieval_lsat_rc | 0 | acc | 46.84 | ± | 3.05 |
acc_norm | 33.83 | ± | 2.89 | ||
agieval_sat_en | 0 | acc | 57.28 | ± | 3.45 |
acc_norm | 44.17 | ± | 3.47 | ||
agieval_sat_en_without_passage | 0 | acc | 37.38 | ± | 3.38 |
acc_norm | 28.64 | ± | 3.16 | ||
agieval_sat_math | 0 | acc | 40.45 | ± | 3.32 |
acc_norm | 34.09 | ± | 3.20 |
Average: 30.55%
Average: Error: File does not exist%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 28.03 | ± | 1.57 |
mc2 | 45.38 | ± | 1.42 |
Average: 45.38%
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 56.32 | ± | 3.61 |
bigbench_date_understanding | 0 | multiple_choice_grade | 68.83 | ± | 2.41 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 31.40 | ± | 2.89 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 19.78 | ± | 2.11 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 26.00 | ± | 1.96 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 20.00 | ± | 1.51 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 43.33 | ± | 2.87 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 31.00 | ± | 2.07 |
bigbench_navigate | 0 | multiple_choice_grade | 50.90 | ± | 1.58 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 56.75 | ± | 1.11 |
bigbench_ruin_names | 0 | multiple_choice_grade | 27.90 | ± | 2.12 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 25.15 | ± | 1.37 |
bigbench_snarks | 0 | multiple_choice_grade | 45.86 | ± | 3.71 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 51.32 | ± | 1.59 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 25.60 | ± | 1.38 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 21.92 | ± | 1.17 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 16.06 | ± | 0.88 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 43.33 | ± | 2.87 |
Average: 36.75%
Average score: Not available due to errors
Elapsed time: 01:57:48