Adds Math-500 and AIME24 evals #4

edbeeching · 2025-01-24T20:29:34Z

Evaluating models (internal)

For small models use --data_parallel=$NUM_GPUS, for large models shard with --tensor_parallel=$NUM_GPUS
Example for evaluating deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

NUM_GPUS=1
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
MODEL_ARGS="pretrained=$MODEL_ID,dtype=bfloat16,data_parallel=$NUM_GPUS,max_model_length=4096,gpu_memory_utilisation=0.8"
TASK=aime24 # or math
OUTPUT_DIR=evals/$MODEL

lighteval $MODEL_ARGS $TASK --use-chat-template --custom-tasks src/open_r1/eval/$TASK.py --output-dir $OUTPUT_DIR --system-prompt="Please reason step by step, and put your final answer within \boxed{}."

lewtun

LGTM!

* adds evals * up max model len --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

adds evals

1d7a158

lewtun approved these changes Jan 24, 2025

View reviewed changes

edbeeching and others added 2 commits January 24, 2025 21:05

up max model len

4142817

Merge branch 'main' into ed-evals

0914613

lewtun merged commit 9c39897 into main Jan 24, 2025
0 of 2 checks passed

lewtun deleted the ed-evals branch January 24, 2025 22:09

pyh314 mentioned this pull request Feb 2, 2025

NCCL problem occured when multiple GPU cards are saving model.safetensors #160

Open

GitMonkey0 pushed a commit to GitMonkey0/open-r1 that referenced this pull request Feb 24, 2025

Adds Math-500 and AIME24 evals (huggingface#4)

9239a8e

* adds evals * up max model len --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Math-500 and AIME24 evals #4

Adds Math-500 and AIME24 evals #4

edbeeching commented Jan 24, 2025 •

edited

Loading

lewtun left a comment

Adds Math-500 and AIME24 evals #4

Adds Math-500 and AIME24 evals #4

Conversation

edbeeching commented Jan 24, 2025 • edited Loading

Evaluating models (internal)

lewtun left a comment

Choose a reason for hiding this comment

edbeeching commented Jan 24, 2025 •

edited

Loading