Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我调整了一下训练的配置,显示我在cpu训练 #47

Open
wzk2239115 opened this issue Feb 27, 2025 · 9 comments
Open

我调整了一下训练的配置,显示我在cpu训练 #47

wzk2239115 opened this issue Feb 27, 2025 · 9 comments

Comments

@wzk2239115
Copy link

export CUDA_VISIBLE_DEVICES=6,7

accelerate launch
--num_processes 2
--config_file deepspeed_zero3.yaml
train_Datawhale-R1.py
--config Datawhale-R1.yaml \

这是我设置的卡,我想用物理卡6和卡7

然后我设置了Datawhale-R1.yaml

GRPO 算法参数

beta: 0.001 # KL 惩罚因子,调整过,参见下文介绍
max_prompt_length: 256 # 输入 prompt 最大长度,本实验基本不会有太大变化
max_completion_length: 4096 # 输出回答长度,包含推理思维链,设为 4K 比较合适
num_generations: 8
use_vllm: true # 启用 vllm 来加速推理
vllm_device: cuda:0 # 留出一张卡来启用 vllm 推理,参见下文介绍
vllm_gpu_memory_utilization: 0.9
相当于我用物理卡6进行vllm推理,也就是cuda:0

但是我显示警告:
[2025-02-27 15:14:28,807] [INFO] [config.py:734:init] Config mesh_device None world_size = 2
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').

@anine09
Copy link
Contributor

anine09 commented Feb 27, 2025

Hi @wzk2239115 ,请确保你的 Pytorch 能够正常识别 CUDA 设备,看看你的 torch.cuda.is_available() 的输出

@wzk2239115
Copy link
Author

显示为True,我也调用了torch.cuda.device_count(),显示为2

@anine09
Copy link
Contributor

anine09 commented Feb 27, 2025

那你的代码实际运行的时候有没有在卡上面跑

@wzk2239115
Copy link
Author

显示是[2025-02-27 15:28:53,614] [INFO] [config.py:734:init] Config mesh_device None world_size = 2
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').

Image

6,7 是这个任务在用,感觉不太对劲,但不知道为啥

@anine09
Copy link
Contributor

anine09 commented Feb 27, 2025

哦我知道了,我们文章里有提到,我们这个代码需要留出一张卡作为 vllm 推理卡,所以如果你只有开两张卡的话,你需要将 --num_processes 设置为 1

@wzk2239115
Copy link
Author

我修改了--num_process=1会直接异常退出,显示[rank0]: Traceback (most recent call last): [rank0]: File "/home/roots/clouditera/grpo/RL_sec/Datawhale-R1/train_Datawhale-R1.py", line 412, in <module> [rank0]: main() [rank0]: File "/home/roots/clouditera/grpo/RL_sec/Datawhale-R1/train_Datawhale-R1.py", line 409, in main [rank0]: grpo_function(model_args, dataset_args, training_args, callbacks=callbacks) [rank0]: File "/home/roots/clouditera/grpo/RL_sec/Datawhale-R1/train_Datawhale-R1.py", line 338, in grpo_function [rank0]: trainer = GRPOTrainer( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/unlockDeepseek/lib/python3.12/site-packages/trl/trainer/grpo_trainer.py", line 346, in __init__ [rank0]: raise ValueError( [rank0]: ValueError: The global train batch size (1 x 1) must be evenly divisible by the number of generations per prompt (8). Given the current train batch size, the valid values for the number of generations are: [].
所以我调整成2来运行

@anine09
Copy link
Contributor

anine09 commented Feb 27, 2025

我需要看看你的训练配置文件,你似乎修改了 batch size

@wzk2239115
Copy link
Author

好的,是Datawhale-R1.yaml文件吧
`# 模型参数
model_name_or_path: /home/roots/grpo/Qwen2.5-3B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
bf16: true
tf32: true
output_dir: /home/roots/grpo/RL_sec/Datawhale-R1/output

数据集参数

dataset_id_or_path: /home/roots/grpo/RL_sec/cseval

Swanlab 训练流程记录参数

swanlab: true # 是否开启 Swanlab
workspace: kzw99999
project: sec-R1-by_wzk
experiment_name: qwen2.5-3B-lr:5e-7_beta:0.001

训练参数

max_steps: 450 # 最大训练步长
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 5.0e-7 # 学习率,调整过,参见下文介绍
lr_scheduler_type: cosine # 学习率衰减方案
warmup_ratio: 0.03 # 学习率预热比率(对于整个步长),好用!
seed: 2025 # 随机种子,方便实验复现

GRPO 算法参数

beta: 0.001 # KL 惩罚因子,调整过,参见下文介绍
max_prompt_length: 256 # 输入 prompt 最大长度,本实验基本不会有太大变化
max_completion_length: 4096 # 输出回答长度,包含推理思维链,设为 4K 比较合适
num_generations: 8
use_vllm: true # 启用 vllm 来加速推理
vllm_device: cuda:0 # 留出一张卡来启用 vllm 推理,参见下文介绍
vllm_gpu_memory_utilization: 0.9

Logging arguments

logging_strategy: steps
logging_steps: 1
save_strategy: "steps"
save_steps: 50 # 每隔多少步保存一次
`

@anine09
Copy link
Contributor

anine09 commented Feb 27, 2025

看报错信息,你需要保证 per_device_train_batch_size % num_generations == 0,你修改下这两个配置试试,--num_processes 依然保持为 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants