LLM benchmarks.md

Group	Benchmark	Summary Explanation	Link
English	MMLU (EM)	Measures multi-task learning across diverse knowledge domains to evaluate language models' general academic proficiency.	MMLU Benchmark
English	MMLU-Redux (EM)	A reduced version of MMLU focusing on key topics or subsets of academic questions.	No dedicated link available.
English	MMLU-Pro (EM)	A professional-level variant of MMLU targeting advanced or domain-specific academic knowledge.	No dedicated link available.
English	DROP (3-shot F1)	Evaluates discrete reasoning over paragraphs by answering questions requiring numerical reasoning and inference.	DROP Benchmark
English	IF-Eval (Prompt Strict)	Measures a model's ability to follow strict instructions accurately and consistently in response to specific prompts.	No direct link available.
English	GPQA-Diamond (Pass@1)	A benchmark for general-purpose question answering with high precision and focus on complex queries.	No direct link available.
English	SimpleQA (Correct)	Evaluates the ability to answer straightforward, factual questions from open-domain datasets.	SimpleQA Dataset
English	FRAMES (Acc.)	Tests dialogue models' ability to track and understand information during multi-turn conversations.	FRAMES Dataset
English	LongBench v2 (Acc.)	Assesses long-form reasoning and the ability to comprehend and answer extended or multi-step questions.	No direct link available.
Code	HumanEval-Mul (Pass@1)	Evaluates functional correctness of code written in multiple programming languages given natural language instructions.	HumanEval
Code	LiveCodeBench (Pass@1-COT)	Benchmarks live coding tasks with chain-of-thought reasoning to assess iterative problem-solving.	No specific link available.
Code	LiveCodeBench (Pass@1)	Measures the accuracy of solving programming problems in a live coding setup.	No specific link available.
Code	Codeforces (Percentile)	Ranks models based on their performance in competitive programming problems hosted on Codeforces.	Codeforces
Code	SWE Verified (Resolved)	Assesses the resolution of specific software engineering problems that require verified solutions.	No dedicated link available.
Code	Aider-Edit (Acc.)	Measures the accuracy of models in making context-aware edits to existing codebases.	No specific link available.
Code	Aider-Polyglot (Acc.)	Evaluates performance on multilingual coding tasks to assess polyglot programming capabilities.	No specific link available.
Math	AIME 2024 (Pass@1)	Tests mathematical problem-solving on challenging problems from the American Invitational Mathematics Examination.	AIME
Math	MATH-500 (EM)	Evaluates the ability to solve 500 diverse math problems covering various mathematical domains.	Mathematics Dataset
Math	CNMO 2024 (Pass@1)	Likely measures mathematical problem-solving ability in the context of the Chinese National Mathematical Olympiad.	No specific link available.
Chinese	CLUEWSC (EM)	Tests Chinese language understanding using a Winograd Schema Challenge for common-sense reasoning.	CLUE Benchmark
Chinese	C-Eval (EM)	Evaluates models' abilities in Chinese across various academic and professional disciplines.	C-Eval Benchmark
Chinese	C-SimpleQA (Correct)	Focuses on simple question-answering tasks in the Chinese language for straightforward factual queries.	No specific link available.

bjpcjp/LLM benchmarks.md