Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed版本训练报错RuntimeError: The size of tensor a (780) must match the size of tensor b (781) #40

Open
tize-72 opened this issue Feb 18, 2025 · 6 comments

Comments

@tize-72
Copy link

tize-72 commented Feb 18, 2025

不是unsloth版本的也会训练报维度错误,这个不知道有人遇到过吗

@hellobiek
Copy link

hellobiek commented Feb 18, 2025

一样遇到,但是我的是 unsloth 版本

@Mrkkew
Copy link

Mrkkew commented Feb 19, 2025

遇到了如下的错误:
[rank3]: File "/home/jovyan/work/tanzichang/miniconda/envs/vl_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2171, in train
[rank3]: return inner_training_loop(
[rank3]: File "/home/jovyan/work/tanzichang/miniconda/envs/vl_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: File "/home/jovyan/work/tanzichang/miniconda/envs/vl_vllm/lib/python3.10/site-packages/transformers/trainer.py", line 3675, in training_step
[rank3]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank3]: File "/home/jovyan/work/tanzichang/miniconda/envs/vl_vllm/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 495, in compute_loss
[rank3]: rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32, device=device)
[rank3]: RuntimeError: The expanded size of the tensor (10) must match the existing size (11) at non-singleton dimension 0. Target sizes: [10]. Tensor sizes: [11]
请问是怎么回事

@SwayDy
Copy link

SwayDy commented Feb 19, 2025

我也遇到了:RuntimeError: The size of tensor a (1024) must match the size of tensor b (1025) at non-singleton dimension 1

@SwayDy
Copy link

SwayDy commented Feb 19, 2025

我也遇到了:RuntimeError: The size of tensor a (1024) must match the size of tensor b (1025) at non-singleton dimension 1

train_Datawhale-R1_unsloth.py中第201行:
max_seq_length=training_args.max_completion_length, # 设置最大序列长度
改成:
max_seq_length=training_args.max_prompt_length + training_args.max_completion_length, # 设置最大序列长度
就能跑了

@xiaoan17
Copy link

我也遇到了:RuntimeError: The size of tensor a (1024) must match the size of tensor b (1025) at non-singleton dimension 1我也遇到了:RuntimeError:张量 a 的大小(1024)必须与张量 b 的大小(1025)在非单元素维度 1 上匹配

train_Datawhale-R1_unsloth.py中第201行: max_seq_length=training_args.max_completion_length, # 设置最大序列长度 改成: max_seq_length=training_args.max_prompt_length + training_args.max_completion_length, # 设置最大序列长度 就能跑了

十分感谢,修改之后可以正常运行了。

@anine09
Copy link
Contributor

anine09 commented Feb 25, 2025

我也遇到了:RuntimeError: The size of tensor a (1024) must match the size of tensor b (1025) at non-singleton dimension 1

train_Datawhale-R1_unsloth.py中第201行: max_seq_length=training_args.max_completion_length, # 设置最大序列长度 改成: max_seq_length=training_args.max_prompt_length + training_args.max_completion_length, # 设置最大序列长度 就能跑了

我没有 get 到为什么会出现这种情况, @tize-72 你这样修改后能在 DeepSpeed 的版本上运行吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants