Pyxis install breaking with race condition #638

nghtm · 2025-04-09T14:36:32Z

When creating HyperPod clusters with 2 ml.g5.8xlarge instances, we are seeing errors trying to run containers with Pyxis + Enroot.

srun: unrecognized option '--container-image'

Cloudwatch does not show an error with the execution of the install_enroot_lifeycle.sh lifecycle script.

Reinstalling the enroot + pyxis on all the nodes solves this

The text was updated successfully, but these errors were encountered:

nghtm · 2025-04-09T14:39:10Z

Worth exploring if this is related to this PR #632

nghtm · 2025-04-09T14:43:28Z

0: slurmstepd: error: *** STEP 9.0 ON ip-10-1-15-17 CANCELLED AT 2025-04-09T14:42:25 ***
1: [Auto Resume] Error: JobID: 9 StepID: 0 TaskID: 1 Task failed on node ip-10-1-52-158
1: [Auto Resume] Info: JobID: 9 StepID: 0 TaskID: 1 Successfully terminated the step since task exited with status: 256 
0: slurmstepd: error: pyxis: child 19181 terminated with signal 9
srun: error: ip-10-1-15-17: task 0: Exited with exit code 1
1: slurmstepd: error: pyxis: child 18826 terminated with signal 9
srun: error: ip-10-1-52-158: task 1: Exited with exit code 1
[Auto Resume] Info: JobID: 9 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes: [ip-10-1-15-17,ip-10-1-52-158]
[Auto Resume] Info: JobID: 9 StepID: 0 Response from cluster agent: JobId=9, ResumeAction=NONE
[Auto Resume] Info: JobID: 9 StepID: 0 No hardware issues were detected. Cluster agent recommends to no-opt auto-resume.

amanshanbhag · 2025-04-25T22:51:29Z

Okay doing some more digging, the simplest solution is to run scontrol reconfigure on the compute nodes.

Based on conversation with SDE, it seems like the sequence of events (race condition) is:

slurmctld comes up (because controller node LCS runs a bit ahead because of size)
slurmd comes up on the compute instances. At this point, the SLURM configuration (slurm.conf) doesn't have the Pyxis configuration.
Compute node fetches plugstack.conf from the controller
Install Pyxis on controller. At this point, slurmd still uses the old (cached) Slurm configuration (and plugstack.conf) --> i.e., despite plugstack.conf existing on the controller with Pyxis (and being referred to by the controller's slurm.conf), the compute nodes still use an older, cached, version of slurm.conf and pyxis.conf (and thus a potentially older version of pyxis).

A potential solution would be to move the enroot, pyxis installation on the controller to BEFORE the compute nodes grab the configurations from the controller.

nghtm · 2025-04-28T01:43:15Z

Thanks Aman. For a short term solution, can you please add instructions to the workshop here to run scontrol reconfigure.

We should test if changing install_enroot_pyxis installation to earlier in the lifecycle_script.py script will prevent this issue all together.

nghtm assigned amanshanbhag Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyxis install breaking with race condition #638

Pyxis install breaking with race condition #638

nghtm commented Apr 9, 2025

nghtm commented Apr 9, 2025

nghtm commented Apr 9, 2025

amanshanbhag commented Apr 25, 2025 •

edited

Loading

nghtm commented Apr 28, 2025

Pyxis install breaking with race condition #638

Pyxis install breaking with race condition #638

Comments

nghtm commented Apr 9, 2025

nghtm commented Apr 9, 2025

nghtm commented Apr 9, 2025

amanshanbhag commented Apr 25, 2025 • edited Loading

nghtm commented Apr 28, 2025

amanshanbhag commented Apr 25, 2025 •

edited

Loading