-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Added AIME Support #2892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added AIME Support #2892
Conversation
Thank you for your selfless contribution. If possible, I would prefer to split the AIME problems into multiple subsets based on the year. Additionally, it seems that 2024-I is missing. |
Thank you for the contribution! I think AIME 2024 specifically is something people eval on a bunch, and having that as a callable subset would be valuable. I'm less sure every year needs to be evaluated independently, but maybe having that as an option would be good. The DeepSeek R1 paper reports a bunch of models on AIME 2024, can you see if this PR reproduces their numbers? See also this reference: agentica-project/rllm#3 |
Hi @Gresham429 @StellaAthena Thank you for your valuable suggestions! I'll first try to divide AIME into subsets, and then see if I can reproduce previous results. As for the missing AIME 2024-I, I'll double-check the data source. |
It would be great to also support metrics like pass@1 and cons@64. Often technical reports include these metrics, since there are only 30 tasks released each year. To calculate them, each question needs to be run 64 times (repeats: 64) and for sure we will need to evaluate only selected year. |
Thank you, this sounds like a great suggestion! Will try to add it. |
What does this PR do?
This PR adds the AIME (1983-2024) dataset to the lm_eval_harness library, as requested in issue #2766.
Implementation
The implementation of the AIME dataset largely follows the implementation of gsm8k-platinum, which is also a math qa dataset and uses exact match to verify the correctness of answers.
In my test, the meta Llama-3.1-8B-Instruct model achieves 111/933 accuracy on all the AIME problems, which corresponds to a previous observation that Llama-3.1-8B-Instruct is capable of answering ~10% of the AIME problems.
I'll be happy if more tests on AIME are conducted. Feel free to mention any problems when using AIME or propose possible improvements!
Checklist
For adding novel benchmarks/datasets to the library: