Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAY模式下程序一直报OOM #601

Open
3 tasks done
charonkk opened this issue Feb 28, 2025 · 4 comments
Open
3 tasks done

RAY模式下程序一直报OOM #601

charonkk opened this issue Feb 28, 2025 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@charonkk
Copy link

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 4638cdb55ff58408ad461398b336d3dfb658d0cf03000000 Worker ID: 4372695af468e6caaaadb9c45bac2262cf4af671c857df34cd023a77 Node ID: b7bfd95a3426844644557f6a21a007819407435c2674b2994dd07065 Worker IP address: 100.102.190.142 Worker port: 10080 Worker PID: 41158 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Running Dataset. Active & requested resources: 1/512 CPU, 256.0MB/186.3GB object store: : 0.00 row [52:58, ? row/s]

  • ReadJSON->SplitBlocks(1024): Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 6.3KB object store: : 11.0 row [52:58, 5.46 row/s]
  • MapBatches(process_batch_arrow)->...->MapBatches(partial): Tasks: 1; Queued blocks: 0; Resources: 1.0 CPU, 256.0MB object store: : 0.00 row [52:58, ? row/s]
  • limit=1: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [52:58, ? row/s]

Additional 额外信息

@charonkk charonkk added the question Further information is requested label Feb 28, 2025
@charonkk
Copy link
Author

我不知道这个512个cpu的需求是如何而来?可以设置参数减小其值吗?

@charonkk
Copy link
Author

这个是跑ray的demo时所打屏的日志

@charonkk
Copy link
Author

我的设备是npu

@pan-x-c
Copy link
Collaborator

pan-x-c commented Feb 28, 2025

ray默认情况应该会启动cpu数量个worker,日志中的现象可能是ray没有正确识别环境中cpu数量导致的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants