mlabonne/Kunoichi-DPO-v2-7B-Nous.md

Created March 22, 2024 18:27

Star (0) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mlabonne/895ff5171e998abfdf2a41a4f9c84450.js"></script>
Save mlabonne/895ff5171e998abfdf2a41a4f9c84450 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

Kunoichi-DPO-v2-7B-Nous.md

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
Kunoichi-DPO-v2-7B	44.79	75.05	65.68	47.65	58.29

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	26.38	±	2.77
		acc_norm	24.02	±	2.69
agieval_logiqa_en	0	acc	38.71	±	1.91
		acc_norm	38.40	±	1.91
agieval_lsat_ar	0	acc	25.65	±	2.89
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	52.55	±	2.21
		acc_norm	52.75	±	2.21
agieval_lsat_rc	0	acc	64.68	±	2.92
		acc_norm	64.68	±	2.92
agieval_sat_en	0	acc	79.13	±	2.84
		acc_norm	78.64	±	2.86
agieval_sat_en_without_passage	0	acc	42.72	±	3.45
		acc_norm	43.20	±	3.46
agieval_sat_math	0	acc	34.09	±	3.20
		acc_norm	32.73	±	3.17

Average: 44.79%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	61.95	±	1.42
		acc_norm	62.80	±	1.41
arc_easy	0	acc	84.18	±	0.75
		acc_norm	80.22	±	0.82
boolq	1	acc	87.74	±	0.57
hellaswag	0	acc	68.57	±	0.46
		acc_norm	85.67	±	0.35
openbookqa	0	acc	37.60	±	2.17
		acc_norm	48.60	±	2.24
piqa	0	acc	81.72	±	0.90
		acc_norm	83.03	±	0.88
winogrande	0	acc	77.27	±	1.18

Average: 75.05%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	49.94	±	1.75
		mc2	65.68	±	1.54

Average: 65.68%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	59.47	±	3.57
bigbench_date_understanding	0	multiple_choice_grade	63.96	±	2.50
bigbench_disambiguation_qa	0	multiple_choice_grade	34.88	±	2.97
bigbench_geometric_shapes	0	multiple_choice_grade	22.28	±	2.20
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	34.40	±	2.13
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.29	±	1.60
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	54.00	±	2.88
bigbench_movie_recommendation	0	multiple_choice_grade	47.00	±	2.23
bigbench_navigate	0	multiple_choice_grade	50.70	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	70.70	±	1.02
bigbench_ruin_names	0	multiple_choice_grade	52.68	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	40.88	±	1.56
bigbench_snarks	0	multiple_choice_grade	75.69	±	3.20
bigbench_sports_understanding	0	multiple_choice_grade	74.24	±	1.39
bigbench_temporal_sequences	0	multiple_choice_grade	58.60	±	1.56
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.96	±	1.19
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.94	±	0.92
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	54.00	±	2.88

Average: 47.65%

Average score: 58.29%

Elapsed time: 01:59:46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment