Added AIME Support #2892

Zephyr271828 · 2025-04-08T22:52:41Z

What does this PR do?

This PR adds the AIME (1983-2024) dataset to the lm_eval_harness library, as requested in issue #2766.

Implementation

The implementation of the AIME dataset largely follows the implementation of gsm8k-platinum, which is also a math qa dataset and uses exact match to verify the correctness of answers.
In my test, the meta Llama-3.1-8B-Instruct model achieves 111/933 accuracy on all the AIME problems, which corresponds to a previous observation that Llama-3.1-8B-Instruct is capable of answering ~10% of the AIME problems.
I'll be happy if more tests on AIME are conducted. Feel free to mention any problems when using AIME or propose possible improvements!

Checklist

For adding novel benchmarks/datasets to the library:

Is the task an existing benchmark in the literature?
Have you referenced the original paper that introduced the task?
If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

Gresham429 · 2025-04-09T13:31:40Z

Thank you for your selfless contribution. If possible, I would prefer to split the AIME problems into multiple subsets based on the year. Additionally, it seems that 2024-I is missing.

StellaAthena · 2025-04-09T13:40:11Z

Thank you for the contribution! I think AIME 2024 specifically is something people eval on a bunch, and having that as a callable subset would be valuable. I'm less sure every year needs to be evaluated independently, but maybe having that as an option would be good.

The DeepSeek R1 paper reports a bunch of models on AIME 2024, can you see if this PR reproduces their numbers?

See also this reference: agentica-project/rllm#3

Zephyr271828 · 2025-04-09T13:52:42Z

Hi @Gresham429 @StellaAthena Thank you for your valuable suggestions! I'll first try to divide AIME into subsets, and then see if I can reproduce previous results. As for the missing AIME 2024-I, I'll double-check the data source.

seldereyy · 2025-04-23T11:42:50Z

It would be great to also support metrics like pass@1 and cons@64. Often technical reports include these metrics, since there are only 30 tasks released each year. To calculate them, each question needs to be run 64 times (repeats: 64) and for sure we will need to evaluate only selected year.

Zephyr271828 · 2025-04-23T13:56:59Z

It would be great to also support metrics like pass@1 and cons@64. Often technical reports include these metrics, since there are only 30 tasks released each year. To calculate them, each question needs to be run 64 times (repeats: 64) and for sure we will need to evaluate only selected year.

Thank you, this sounds like a great suggestion! Will try to add it.
Sorry @Gresham429 @StellaAthena @seldereyy I'm a little bit exhausted by some other stuff recently. Is it ok that I fix this PR in May?

Zephyr271828 added 3 commits April 8, 2025 12:38

working on aime

a9e7bfd

fixing bugs in aime

ea17d52

fixing bugs in aime

8957e97

Zephyr271828 requested review from baberabb and StellaAthena as code owners April 8, 2025 22:52

Merge branch 'EleutherAI:main' into aime

316d622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added AIME Support #2892

Added AIME Support #2892

Zephyr271828 commented Apr 8, 2025

Gresham429 commented Apr 9, 2025

StellaAthena commented Apr 9, 2025

Zephyr271828 commented Apr 9, 2025

seldereyy commented Apr 23, 2025

Zephyr271828 commented Apr 23, 2025

Added AIME Support #2892

Are you sure you want to change the base?

Added AIME Support #2892

Conversation

Zephyr271828 commented Apr 8, 2025

What does this PR do?

Implementation

Checklist

Gresham429 commented Apr 9, 2025

StellaAthena commented Apr 9, 2025

Zephyr271828 commented Apr 9, 2025

seldereyy commented Apr 23, 2025

Zephyr271828 commented Apr 23, 2025