Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2

zaddy6 · 2025-02-07T00:43:36Z

Reproduction
When using deepspeed, checkpoint saving fails

    --output_dir outputs/Llama-3.1-8B-Instruct-zerox \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --max_prompt_length 512 \
    --max_completion_length 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 3e-6 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --logging_steps 1 \
    --num_generations 2 \
    --save_steps 2 \
    --max_steps 1000 \
    --torch_dtype bfloat16 \
    --use_vllm \
    --vllm_gpu_memory_utilization 0.7 \
    --bf16```

outputs:

```[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/simple_grpo/src/train_zero.py", line 275, in <module>
[rank1]:     main(training_args, model_args)
[rank1]:   File "/workspace/simple_grpo/src/train_zero.py", line 268, in main
[rank1]:     trainer.train()
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2185, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank1]:     self._maybe_log_save_evaluate(
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3035, in _maybe_log_save_evaluate
[rank1]:     self._save_checkpoint(model, trial)
[rank1]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3160, in _save_checkpoint
[rank1]:     shutil.rmtree(checkpoint_dir)
[rank1]:   File "/opt/conda/lib/python3.11/shutil.py", line 752, in rmtree
[rank1]:     _rmtree_safe_fd(fd, path, onerror)
[rank1]:   File "/opt/conda/lib/python3.11/shutil.py", line 703, in _rmtree_safe_fd
[rank1]:     onerror(os.unlink, fullname, sys.exc_info())
[rank1]:   File "/opt/conda/lib/python3.11/shutil.py", line 701, in _rmtree_safe_fd
[rank1]:     os.unlink(entry.name, dir_fd=topfd)
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_2.pth'
[rank4]: Traceback (most recent call last):
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3157, in _save_checkpoint
[rank4]:     os.renames(output_dir, checkpoint_dir)
[rank4]:   File "<frozen os>", line 272, in renames
[rank4]: FileExistsError: [Errno 17] File exists: 'outputs/Llama-3.1-8B-Instruct-zerox/tmp-checkpoint-g_fa26gf' -> 'outputs/Llama-3.1-8B-Instruct-zerox/checkpoint-2'

[rank4]: During handling of the above exception, another exception occurred:

[rank4]: Traceback (most recent call last):
[rank4]:   File "/workspace/simple_grpo/src/train_zero.py", line 275, in <module>
[rank4]:     main(training_args, model_args)
[rank4]:   File "/workspace/simple_grpo/src/train_zero.py", line 268, in main
[rank4]:     trainer.train()
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2185, in train
[rank4]:     return inner_training_loop(
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank4]:     self._maybe_log_save_evaluate(
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3035, in _maybe_log_save_evaluate
[rank4]:     self._save_checkpoint(model, trial)
[rank4]:   File "/workspace/simple_grpo/venv/lib/python3.11/site-packages/transformers/trainer.py", line 3160, in _save_checkpoint
[rank4]:     shutil.rmtree(checkpoint_dir)
[rank4]:   File "/opt/conda/lib/python3.11/shutil.py", line 752, in rmtree
[rank4]:     _rmtree_safe_fd(fd, path, onerror)
[rank4]:   File "/opt/conda/lib/python3.11/shutil.py", line 683, in _rmtree_safe_fd
[rank4]:     onerror(os.rmdir, fullname, sys.exc_info())
[rank4]:   File "/opt/conda/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
[rank4]:     os.rmdir(entry.name, dir_fd=topfd)
[rank4]: FileNotFoundError: [Errno 2] No such file or directory: 'global_step2'

System Info
Platform: Linux-5.15.0-130-generic-x86_64-with-glibc2.35
Python version: 3.11.9
PyTorch version: 2.5.1
CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
Transformers version: 4.48.2
Accelerate version: 1.3.0
Accelerate config: not found
Datasets version: 3.2.0
HF Hub version: 0.28.1
TRL version: 0.15.0.dev0
bitsandbytes version: not installed
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: 0.5.2
LLM-Blender version: not installed
OpenAI version: 1.61.1
PEFT version: 0.14.0```

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2

Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2

zaddy6 commented Feb 7, 2025

Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2

Load/Savings Checkpoint Fails using DeepSpeed - GRPO #2

Comments

zaddy6 commented Feb 7, 2025