CultriX-Github/Unsloth_PhiTome-3.5-4k-Nous.md Secret

Created September 8, 2024 13:22

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/CultriX-Github/a1b9d2d36aa387f7857d04fd4e23ef98.js"></script>
Save CultriX-Github/a1b9d2d36aa387f7857d04fd4e23ef98 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

Unsloth_PhiTome-3.5-4k-Nous.md

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
Unsloth_PhiTome-3.5-4k	23.18	36.52	50.05	29.51	34.81

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	17.72	±	2.40
		acc_norm	19.29	±	2.48
agieval_logiqa_en	0	acc	21.35	±	1.61
		acc_norm	25.50	±	1.71
agieval_lsat_ar	0	acc	18.70	±	2.58
		acc_norm	21.74	±	2.73
agieval_lsat_lr	0	acc	13.33	±	1.51
		acc_norm	21.18	±	1.81
agieval_lsat_rc	0	acc	24.54	±	2.63
		acc_norm	22.30	±	2.54
agieval_sat_en	0	acc	27.67	±	3.12
		acc_norm	24.27	±	2.99
agieval_sat_en_without_passage	0	acc	27.67	±	3.12
		acc_norm	24.76	±	3.01
agieval_sat_math	0	acc	27.27	±	3.01
		acc_norm	26.36	±	2.98

Average: 23.18%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	19.37	±	1.15
		acc_norm	24.23	±	1.25
arc_easy	0	acc	30.64	±	0.95
		acc_norm	30.56	±	0.95
boolq	1	acc	42.32	±	0.86
hellaswag	0	acc	26.70	±	0.44
		acc_norm	27.84	±	0.45
openbookqa	0	acc	18.00	±	1.72
		acc_norm	29.00	±	2.03
piqa	0	acc	52.67	±	1.16
		acc_norm	52.18	±	1.17
winogrande	0	acc	49.49	±	1.41

Average: 36.52%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	23.62	±	1.49
		mc2	50.05	±	1.69

Average: 50.05%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	56.84	±	3.60
bigbench_date_understanding	0	multiple_choice_grade	15.72	±	1.90
bigbench_disambiguation_qa	0	multiple_choice_grade	30.62	±	2.88
bigbench_geometric_shapes	0	multiple_choice_grade	10.03	±	1.59
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	19.80	±	1.78
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	14.43	±	1.33
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	34.00	±	2.74
bigbench_movie_recommendation	0	multiple_choice_grade	25.20	±	1.94
bigbench_navigate	0	multiple_choice_grade	50.10	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	16.30	±	0.83
bigbench_ruin_names	0	multiple_choice_grade	49.55	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	16.83	±	1.18
bigbench_snarks	0	multiple_choice_grade	52.49	±	3.72
bigbench_sports_understanding	0	multiple_choice_grade	46.86	±	1.59
bigbench_temporal_sequences	0	multiple_choice_grade	24.80	±	1.37
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	19.92	±	1.13
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	13.71	±	0.82
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	34.00	±	2.74

Average: 29.51%

Average score: 34.81%

Elapsed time: 01:38:30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment