-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我调整了一下训练的配置,显示我在cpu训练 #47
Comments
Hi @wzk2239115 ,请确保你的 Pytorch 能够正常识别 CUDA 设备,看看你的 |
显示为True,我也调用了 |
那你的代码实际运行的时候有没有在卡上面跑 |
哦我知道了,我们文章里有提到,我们这个代码需要留出一张卡作为 vllm 推理卡,所以如果你只有开两张卡的话,你需要将 |
我修改了--num_process=1会直接异常退出,显示 |
我需要看看你的训练配置文件,你似乎修改了 batch size |
好的,是Datawhale-R1.yaml文件吧 数据集参数dataset_id_or_path: /home/roots/grpo/RL_sec/cseval Swanlab 训练流程记录参数swanlab: true # 是否开启 Swanlab 训练参数max_steps: 450 # 最大训练步长 GRPO 算法参数beta: 0.001 # KL 惩罚因子,调整过,参见下文介绍 Logging argumentslogging_strategy: steps |
看报错信息,你需要保证 |
export CUDA_VISIBLE_DEVICES=6,7
accelerate launch
--num_processes 2
--config_file deepspeed_zero3.yaml
train_Datawhale-R1.py
--config Datawhale-R1.yaml \
这是我设置的卡,我想用物理卡6和卡7
然后我设置了Datawhale-R1.yaml
GRPO 算法参数
beta: 0.001 # KL 惩罚因子,调整过,参见下文介绍
max_prompt_length: 256 # 输入 prompt 最大长度,本实验基本不会有太大变化
max_completion_length: 4096 # 输出回答长度,包含推理思维链,设为 4K 比较合适
num_generations: 8
use_vllm: true # 启用 vllm 来加速推理
vllm_device: cuda:0 # 留出一张卡来启用 vllm 推理,参见下文介绍
vllm_gpu_memory_utilization: 0.9
相当于我用物理卡6进行vllm推理,也就是cuda:0
但是我显示警告:
[2025-02-27 15:14:28,807] [INFO] [config.py:734:init] Config mesh_device None world_size = 2
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')
.The text was updated successfully, but these errors were encountered: