How an Average LLM Can Fool Benchmarks and Match GPT-4

How can an average LLM fool benchmarks and achieve performance comparable to GPT-4?

In this scientific article, researchers show how a 13 billion parameter Llama can artificially achieve very high performance, equivalent to GPT-4.

The key lies in contamination: the fact of training the model or adjusting it on data very similar to the test benchmarks, without getting caught. There are contamination tests to mitigate this but they are not infallible.

Current contamination detection methods are based on the search for n-gram overlaps. Simple variations of the test data (paraphrases, translations) can easily circumvent these decontamination measures.

In this study, the researchers propose a more robust decontamination method based on LLMs.

They applied their method to popular datasets, revealing significant previously unknown overlap from the tests. For example, in pretraining sets like RedPajamaData-1T and StarCoderData, they identified an 8 to 18% overlap with HumanEval.

They also found such contamination even in the synthetic datasets generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination.

Link to the article: https://arxiv.org/abs/2311.04850

Leave a comment