-
Notifications
You must be signed in to change notification settings - Fork 1.6k
submit deepspeed task卡住 #5775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
12 |
运行6小时后报错:debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"dispatch resource timeout",grpc_status:13, |
大佬好,我使用AnsibleFATE_2.1.1_LLM_2.1.0_release_offline部署,有两台GPU机器,使用flow test测试可以通过,但执行我自己的训练代码时会遇到任务卡住不动,日志里没有报错,且使用nvidia-smi未看到任何存在的进程,从输出日志中看到执行 【start submit deepspeed task deepspeed_202505221013311247640_nn_0_0_guest_9999】后,便没有日志更新了。完整日志如下
The text was updated successfully, but these errors were encountered: