The Fleeting Value of LLM Benchmarks

You can’t improve what you don’t measure. However, the relevance of measurement can degrade over time.

Regarding LLMs or machine learning in general, we often rely on benchmarks to rank models…

Yet they have an ephemeral evaluation capability.

They are necessary because they allow, on the one hand, to compare models with each other and, on the other hand, to quantify the progress made when undertaking to improve them.

We can mention the most well-known for LLMs in particular: MMLU HellaSwag, HumanEval…And for embedding models: STS, Glue MS Marco etc.

However, their evaluation power is temporary. At first, they are relevant, but over time, we see a score inflation. New models achieve better results, without necessarily representing a fundamental breakthrough.

You could say that this is a more or less indirect form of overfitting: over time, model designers get to know the benchmark datasets better and better. They consciously or unconsciously adjust the training parameters and datasets to maximize scores.

Some approaches even go so far as to train the models directly on the benchmark datasets, which is obviously cheating. To catch them out, contamination tests can be done but this is not always enough.

Indeed, simple variations of the test data (paraphrases, translations) can easily circumvent these decontamination measures.

It is therefore quite difficult to measure the share of real progress and the share of “overfitting” of the models, solely on benchmark scores.

Ideally, they should be renewed on a regular basis but the work involved would be substantial.

Benchmarks are useful for preselecting models, but nothing beats building evaluation datasets tailored to your specific needs.

Leave a comment