Skip to content

Instantly share code, notes, and snippets.

@bjpcjp
Created January 19, 2025 21:06
Show Gist options
  • Save bjpcjp/8c194e29ceee768d3988da6a2dd3b625 to your computer and use it in GitHub Desktop.
Save bjpcjp/8c194e29ceee768d3988da6a2dd3b625 to your computer and use it in GitHub Desktop.
Group Benchmark Summary Explanation Link
English MMLU (EM) Measures multi-task learning across diverse knowledge domains to evaluate language models' general academic proficiency. MMLU Benchmark
English MMLU-Redux (EM) A reduced version of MMLU focusing on key topics or subsets of academic questions. No dedicated link available.
English MMLU-Pro (EM) A professional-level variant of MMLU targeting advanced or domain-specific academic knowledge. No dedicated link available.
English DROP (3-shot F1) Evaluates discrete reasoning over paragraphs by answering questions requiring numerical reasoning and inference. DROP Benchmark
English IF-Eval (Prompt Strict) Measures a model's ability to follow strict instructions accurately and consistently in response to specific prompts. No direct link available.
English GPQA-Diamond (Pass@1) A benchmark for general-purpose question answering with high precision and focus on complex queries. No direct link available.
English SimpleQA (Correct) Evaluates the ability to answer straightforward, factual questions from open-domain datasets. SimpleQA Dataset
English FRAMES (Acc.) Tests dialogue models' ability to track and understand information during multi-turn conversations. FRAMES Dataset
English LongBench v2 (Acc.) Assesses long-form reasoning and the ability to comprehend and answer extended or multi-step questions. No direct link available.
Code HumanEval-Mul (Pass@1) Evaluates functional correctness of code written in multiple programming languages given natural language instructions. HumanEval
Code LiveCodeBench (Pass@1-COT) Benchmarks live coding tasks with chain-of-thought reasoning to assess iterative problem-solving. No specific link available.
Code LiveCodeBench (Pass@1) Measures the accuracy of solving programming problems in a live coding setup. No specific link available.
Code Codeforces (Percentile) Ranks models based on their performance in competitive programming problems hosted on Codeforces. Codeforces
Code SWE Verified (Resolved) Assesses the resolution of specific software engineering problems that require verified solutions. No dedicated link available.
Code Aider-Edit (Acc.) Measures the accuracy of models in making context-aware edits to existing codebases. No specific link available.
Code Aider-Polyglot (Acc.) Evaluates performance on multilingual coding tasks to assess polyglot programming capabilities. No specific link available.
Math AIME 2024 (Pass@1) Tests mathematical problem-solving on challenging problems from the American Invitational Mathematics Examination. AIME
Math MATH-500 (EM) Evaluates the ability to solve 500 diverse math problems covering various mathematical domains. Mathematics Dataset
Math CNMO 2024 (Pass@1) Likely measures mathematical problem-solving ability in the context of the Chinese National Mathematical Olympiad. No specific link available.
Chinese CLUEWSC (EM) Tests Chinese language understanding using a Winograd Schema Challenge for common-sense reasoning. CLUE Benchmark
Chinese C-Eval (EM) Evaluates models' abilities in Chinese across various academic and professional disciplines. C-Eval Benchmark
Chinese C-SimpleQA (Correct) Focuses on simple question-answering tasks in the Chinese language for straightforward factual queries. No specific link available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment