Recent research reveals a troubling phenomenon in AI evaluation: leading language models have been "cheating" on benchmarks designed to test their capabilities. The paper "Benchmarking Benchmark Leakage in Large Language Models" ( BenBench) [1] demonstrates how benchmark dataset leakage has become increasingly prevalent, undermining fair comparisons between models. This occurs when models are trained on data that includes benchmark test sets, allowing them to memorize answers rather than demonstrate genuine understanding.
The researchers introduced a detection pipeline utilizing Perplexity and N-gram accuracy metrics to identify potential data leakage in models from major companies including Alibaba, Google, Meta, Microsoft, Mistral AI, and