Group | Benchmark | Summary Explanation | Link |
---|---|---|---|
English | MMLU (EM) | Measures multi-task learning across diverse knowledge domains to evaluate language models' general academic proficiency. | MMLU Benchmark |
English | MMLU-Redux (EM) | A reduced version of MMLU focusing on key topics or subsets of academic questions. | No dedicated link available. |
English | MMLU-Pro (EM) | A professional-level variant of MMLU targeting advanced or domain-specific academic knowledge. | No dedicated link available. |
English | DROP (3-shot F1) | Evaluates discrete reasoning over paragraphs by answering questions requiring numerical reasoning and inference. | DROP Benchmark |
English | IF-Eval (Prompt Strict) | Measures a model's ability to follow strict instructions accurately and consistently in response to specific prompts. | No direct link available. |
English | GPQA-Diamond (Pass@1) | A benchmark for general-purpose question answering with high precision and focus on complex queries. | No direct link available. |
English | SimpleQA (Correct) | Evaluates the ability to answer straightforward, factual questions from open-domain datasets. | SimpleQA Dataset |
English | FRAMES (Acc.) | Tests dialogue models' ability to track and understand information during multi-turn conversations. | FRAMES Dataset |
English | LongBench v2 (Acc.) | Assesses long-form reasoning and the ability to comprehend and answer extended or multi-step questions. | No direct link available. |
Code | HumanEval-Mul (Pass@1) | Evaluates functional correctness of code written in multiple programming languages given natural language instructions. | HumanEval |
Code | LiveCodeBench (Pass@1-COT) | Benchmarks live coding tasks with chain-of-thought reasoning to assess iterative problem-solving. | No specific link available. |
Code | LiveCodeBench (Pass@1) | Measures the accuracy of solving programming problems in a live coding setup. | No specific link available. |
Code | Codeforces (Percentile) | Ranks models based on their performance in competitive programming problems hosted on Codeforces. | Codeforces |
Code | SWE Verified (Resolved) | Assesses the resolution of specific software engineering problems that require verified solutions. | No dedicated link available. |
Code | Aider-Edit (Acc.) | Measures the accuracy of models in making context-aware edits to existing codebases. | No specific link available. |
Code | Aider-Polyglot (Acc.) | Evaluates performance on multilingual coding tasks to assess polyglot programming capabilities. | No specific link available. |
Math | AIME 2024 (Pass@1) | Tests mathematical problem-solving on challenging problems from the American Invitational Mathematics Examination. | AIME |
Math | MATH-500 (EM) | Evaluates the ability to solve 500 diverse math problems covering various mathematical domains. | Mathematics Dataset |
Math | CNMO 2024 (Pass@1) | Likely measures mathematical problem-solving ability in the context of the Chinese National Mathematical Olympiad. | No specific link available. |
Chinese | CLUEWSC (EM) | Tests Chinese language understanding using a Winograd Schema Challenge for common-sense reasoning. | CLUE Benchmark |
Chinese | C-Eval (EM) | Evaluates models' abilities in Chinese across various academic and professional disciplines. | C-Eval Benchmark |
Chinese | C-SimpleQA (Correct) | Focuses on simple question-answering tasks in the Chinese language for straightforward factual queries. | No specific link available. |
Created
January 19, 2025 21:06
-
-
Save bjpcjp/8c194e29ceee768d3988da6a2dd3b625 to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment