Skip to content

Pyxis install breaking with race condition #638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nghtm opened this issue Apr 9, 2025 · 4 comments
Open

Pyxis install breaking with race condition #638

nghtm opened this issue Apr 9, 2025 · 4 comments
Assignees

Comments

@nghtm
Copy link
Collaborator

nghtm commented Apr 9, 2025

When creating HyperPod clusters with 2 ml.g5.8xlarge instances, we are seeing errors trying to run containers with Pyxis + Enroot.

srun: unrecognized option '--container-image'

Cloudwatch does not show an error with the execution of the install_enroot_lifeycle.sh lifecycle script.

Reinstalling the enroot + pyxis on all the nodes solves this

@nghtm
Copy link
Collaborator Author

nghtm commented Apr 9, 2025

Worth exploring if this is related to this PR #632

@nghtm
Copy link
Collaborator Author

nghtm commented Apr 9, 2025

0: slurmstepd: error: *** STEP 9.0 ON ip-10-1-15-17 CANCELLED AT 2025-04-09T14:42:25 ***
1: [Auto Resume] Error: JobID: 9 StepID: 0 TaskID: 1 Task failed on node ip-10-1-52-158
1: [Auto Resume] Info: JobID: 9 StepID: 0 TaskID: 1 Successfully terminated the step since task exited with status: 256 
0: slurmstepd: error: pyxis: child 19181 terminated with signal 9
srun: error: ip-10-1-15-17: task 0: Exited with exit code 1
1: slurmstepd: error: pyxis: child 18826 terminated with signal 9
srun: error: ip-10-1-52-158: task 1: Exited with exit code 1
[Auto Resume] Info: JobID: 9 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes: [ip-10-1-15-17,ip-10-1-52-158]
[Auto Resume] Info: JobID: 9 StepID: 0 Response from cluster agent: JobId=9, ResumeAction=NONE
[Auto Resume] Info: JobID: 9 StepID: 0 No hardware issues were detected. Cluster agent recommends to no-opt auto-resume.

@amanshanbhag
Copy link
Collaborator

amanshanbhag commented Apr 25, 2025

Okay doing some more digging, the simplest solution is to run scontrol reconfigure on the compute nodes.

Based on conversation with SDE, it seems like the sequence of events (race condition) is:

  1. slurmctld comes up (because controller node LCS runs a bit ahead because of size)
  2. slurmd comes up on the compute instances. At this point, the SLURM configuration (slurm.conf) doesn't have the Pyxis configuration.
  3. Compute node fetches plugstack.conf from the controller
  4. Install Pyxis on controller. At this point, slurmd still uses the old (cached) Slurm configuration (and plugstack.conf) --> i.e., despite plugstack.conf existing on the controller with Pyxis (and being referred to by the controller's slurm.conf), the compute nodes still use an older, cached, version of slurm.conf and pyxis.conf (and thus a potentially older version of pyxis).

A potential solution would be to move the enroot, pyxis installation on the controller to BEFORE the compute nodes grab the configurations from the controller.

@nghtm
Copy link
Collaborator Author

nghtm commented Apr 28, 2025

Thanks Aman. For a short term solution, can you please add instructions to the workshop here to run scontrol reconfigure.

We should test if changing install_enroot_pyxis installation to earlier in the lifecycle_script.py script will prevent this issue all together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants