mlabonne/FuseChat-7B-VaRM-Nous.md Secret

Created February 28, 2024 15:14

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mlabonne/a8290164a67379c33374baa9e8f37f08.js"></script>
Save mlabonne/a8290164a67379c33374baa9e8f37f08 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

FuseChat-7B-VaRM-Nous.md

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
FuseChat-7B-VaRM	41.91	72.02	46.76	42.96	50.91

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.95	±	2.82
		acc_norm	29.13	±	2.86
agieval_logiqa_en	0	acc	37.79	±	1.90
		acc_norm	38.56	±	1.91
agieval_lsat_ar	0	acc	25.65	±	2.89
		acc_norm	23.48	±	2.80
agieval_lsat_lr	0	acc	48.04	±	2.21
		acc_norm	44.90	±	2.20
agieval_lsat_rc	0	acc	56.51	±	3.03
		acc_norm	57.25	±	3.02
agieval_sat_en	0	acc	73.79	±	3.07
		acc_norm	73.30	±	3.09
agieval_sat_en_without_passage	0	acc	40.29	±	3.43
		acc_norm	36.41	±	3.36
agieval_sat_math	0	acc	36.82	±	3.26
		acc_norm	32.27	±	3.16

Average: 41.91%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	56.14	±	1.45
		acc_norm	59.90	±	1.43
arc_easy	0	acc	82.70	±	0.78
		acc_norm	81.23	±	0.80
boolq	1	acc	86.88	±	0.59
hellaswag	0	acc	63.15	±	0.48
		acc_norm	81.76	±	0.39
openbookqa	0	acc	29.80	±	2.05
		acc_norm	41.00	±	2.20
piqa	0	acc	81.72	±	0.90
		acc_norm	82.92	±	0.88
winogrande	0	acc	70.48	±	1.28

Average: 72.02%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	31.33	±	1.62
		mc2	46.76	±	1.51

Average: 46.76%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	57.37	±	3.60
bigbench_date_understanding	0	multiple_choice_grade	64.77	±	2.49
bigbench_disambiguation_qa	0	multiple_choice_grade	59.69	±	3.06
bigbench_geometric_shapes	0	multiple_choice_grade	28.41	±	2.38
		exact_str_match	27.30	±	2.35
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	26.40	±	1.97
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	19.86	±	1.51
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	49.33	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	37.40	±	2.17
bigbench_navigate	0	multiple_choice_grade	51.60	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	64.75	±	1.07
bigbench_ruin_names	0	multiple_choice_grade	44.20	±	2.35
bigbench_salient_translation_error_detection	0	multiple_choice_grade	20.64	±	1.28
bigbench_snarks	0	multiple_choice_grade	65.19	±	3.55
bigbench_sports_understanding	0	multiple_choice_grade	62.07	±	1.55
bigbench_temporal_sequences	0	multiple_choice_grade	31.40	±	1.47
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	24.24	±	1.21
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	16.63	±	0.89
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	49.33	±	2.89

Average: 42.96%

Average score: 50.91%

Elapsed time: 02:05:01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment