generated from datawhalechina/repo-template
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练时loss一直为0 #21
Comments
其实我也没有什么头绪,我自己遇到了类似的情况,前90步loss一直为0,我们还在讨论 |
Hi @QinHsiu ,首先请确保你没有使用“思考长度奖励函数”,这可能会造成严重的重复生成问题,另外 loss 为 0 可以确定是 GRPO 本身的特性,见上面讨论中提到 TRL 官方解释 |
那重复问题应该就是由于模型自身能力带来的,其他的复现报告也经常发现重复,我个人会建议你试试最长子序列匹配的惩罚,暂时没有什么好方案,可能换更大的模型有效,但是这都只是我的个人猜测 另外 @QinHsiu 你可以改一下我们仓库的最新代码,开启了 flash-attn,效率更高,就在你展示的代码附近 |
好的,谢谢提醒,感谢你们的开源工作! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
拉去训练任务后,loss一直为0,会是什么原因导致的呢?
The text was updated successfully, but these errors were encountered: