Skip to content

Added AIME Support #2892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Added AIME Support #2892

wants to merge 4 commits into from

Conversation

Zephyr271828
Copy link

What does this PR do?

This PR adds the AIME (1983-2024) dataset to the lm_eval_harness library, as requested in issue #2766.

Implementation

The implementation of the AIME dataset largely follows the implementation of gsm8k-platinum, which is also a math qa dataset and uses exact match to verify the correctness of answers.
In my test, the meta Llama-3.1-8B-Instruct model achieves 111/933 accuracy on all the AIME problems, which corresponds to a previous observation that Llama-3.1-8B-Instruct is capable of answering ~10% of the AIME problems.
I'll be happy if more tests on AIME are conducted. Feel free to mention any problems when using AIME or propose possible improvements!

Checklist

For adding novel benchmarks/datasets to the library:

  • Is the task an existing benchmark in the literature?
  • Have you referenced the original paper that introduced the task?
  • If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

@Gresham429
Copy link

Thank you for your selfless contribution. If possible, I would prefer to split the AIME problems into multiple subsets based on the year. Additionally, it seems that 2024-I is missing.

@StellaAthena
Copy link
Member

Thank you for the contribution! I think AIME 2024 specifically is something people eval on a bunch, and having that as a callable subset would be valuable. I'm less sure every year needs to be evaluated independently, but maybe having that as an option would be good.

The DeepSeek R1 paper reports a bunch of models on AIME 2024, can you see if this PR reproduces their numbers?

See also this reference: agentica-project/rllm#3

@Zephyr271828
Copy link
Author

Hi @Gresham429 @StellaAthena Thank you for your valuable suggestions! I'll first try to divide AIME into subsets, and then see if I can reproduce previous results. As for the missing AIME 2024-I, I'll double-check the data source.

@seldereyy
Copy link

It would be great to also support metrics like pass@1 and cons@64. Often technical reports include these metrics, since there are only 30 tasks released each year. To calculate them, each question needs to be run 64 times (repeats: 64) and for sure we will need to evaluate only selected year.

@Zephyr271828
Copy link
Author

It would be great to also support metrics like pass@1 and cons@64. Often technical reports include these metrics, since there are only 30 tasks released each year. To calculate them, each question needs to be run 64 times (repeats: 64) and for sure we will need to evaluate only selected year.

Thank you, this sounds like a great suggestion! Will try to add it.
Sorry @Gresham429 @StellaAthena @seldereyy I'm a little bit exhausted by some other stuff recently. Is it ok that I fix this PR in May?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants