-
Notifications
You must be signed in to change notification settings - Fork 118
Pyxis install breaking with race condition #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Worth exploring if this is related to this PR #632 |
|
Okay doing some more digging, the simplest solution is to run Based on conversation with SDE, it seems like the sequence of events (race condition) is:
A potential solution would be to move the |
Thanks Aman. For a short term solution, can you please add instructions to the workshop here to run We should test if changing install_enroot_pyxis installation to earlier in the lifecycle_script.py script will prevent this issue all together. |
When creating HyperPod clusters with 2 ml.g5.8xlarge instances, we are seeing errors trying to run containers with Pyxis + Enroot.
Cloudwatch does not show an error with the execution of the install_enroot_lifeycle.sh lifecycle script.
Reinstalling the enroot + pyxis on all the nodes solves this
The text was updated successfully, but these errors were encountered: