Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207

Redmar-van-den-Berg · 2025-01-24T12:53:58Z

The snakemake process hangs forever after the sacct command receives a "Connection refused". It looks like Snakemake stops trying to query the status of the job after a few retries. In fact, the submitted job has already completed succesfully, and I'm able to manually query it's job status using the sacct command.

Software Versions
snakemake-minimal=8.24.1
snakemake-executor-plugin-slurm=0.14.2
slurm 23.02.8

Describe the bug
After snakemake was unable to connect to the slurm database, it stops trying to connect, even after the connection issue has been resolved.

Logs

Job 6 has been submitted with SLURM jobid 20123883 (log: .snakemake/slurm_logs/rule_qc_seq_cutadapt/samplename/20123883.log).                                                       
The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Minimal example

Additional context

The text was updated successfully, but these errors were encountered:

Redmar-van-den-Berg · 2025-01-24T13:02:09Z

I tried to give the Snakemake process a nudge with kill -SIGTERM ${snakemake_pid}, but it only prints:
Will exit after finishing currently running jobs (scheduler).

and doesn't appear to query for job status again.

cmeesters · 2025-01-24T16:24:12Z

Thank you for bringing this issue to my attention. I believe, however, it cannot be attributed to the plugin itself: sacct is an internal part of any complete SLURM installation. It appears it can be found. Otherwise, we would read a different error message. If however, the SLURM DB is down, you should approach your admins.

Have you been running Snakemake on your cluster successfully before?

Redmar-van-den-Berg · 2025-01-27T06:26:36Z

The slurm db was down temporarily (or it was a network issue), but Snakemake did not continue after it came back up again, even after sacct shows the job having completed successfully.

I used to run Snakemake with my own slurm wrapper, which can be found here. I treated any failure of the sacct command as the status being running, and once the slurm db came online again it would pick up that the jobs had completed and Snakemake would continue on.

Could this also be implemented in the executor plugin? I think transient network issues when connecting to the slurm db should not hang a pipeline indefinitely.

cmeesters · 2025-01-28T12:47:41Z

Definitively worth consideration. I need to teach the coming days and will try to pick up on this the next week. And I therefore will leave the issue open.

However, I would very much prefer to catch a specific exception, e.g. slurm_persist_conn_open_without_init: failed to open persistent connection to host, then to consider every unkown or error as "running".

cmeesters · 2025-03-12T15:53:15Z

Sorry, I could not get back to you any sooner - the plugin considers only a status as "failed" if the status is in

fail_stati = (
            "BOOT_FAIL",
            "CANCELLED",
            "DEADLINE",
            "FAILED",
            "NODE_FAIL",
            "OUT_OF_MEMORY",
            "TIMEOUT",
            "ERROR",
        )

It is just a black-list, whereas you provide a black and white list. It think our codes are in esses equivalent in treating job statuses.

However, I opened a PR (#232), which reacts on a faulty sacct-attempt and just waits (up to 10 minutes). If a SLURM-DB cannot react over an extended period of time, something more serious is wrong with the cluster, IMO. Would you like to test the PR? Are there other error messages to be considered?

Redmar-van-den-Berg · 2025-03-13T13:28:07Z

I'm not an expert on slurm, but I have seen cases on our cluster where the SLURM-DB was unavailable for a long time, I think a couple of hours. (Someone inadvertently DOSed the database by submitting something like a million jobs.).

While new jobs could not be scheduled, running jobs just happily kept on going since the execution nodes do not rely on the SLURM-DB being there. Some jobs could run for weeks, so they could still be running by the time SLURM-DB comes back online again. You don't want to kill a job that has been running for a week over a network issue that takes 15 minutes to resolve.

Another problem with throwing an error after a set amount of time is what Snakemake should do about the error? The job could still be running on the execution nodes, and hence writing to the output files, but Snakemake is unaware. And if the user restarts the pipeline, you could get into a situation where multiple jobs on multiple nodes are writing into the same file.

Slurm has a whole bunch of states, but as far as Snakemake is concerned there are only three states:

Not done yet
Successful
Failed

As long as a job is not successful or failed, we should query the cluster again in a bit to see if the job has changed to either of those two states. This elegantly handles the case where the SLURM-DB is not available: we just wait for a bit and ask again, until we get either successful or failed.

cmeesters · 2025-03-13T21:33:53Z

What you describe is essentially a dysfunctional cluster state. There is just no way for a user process to decide when and if it might recover. Waiting a littler longer? Ok. Waiting virtually indefinitely? No. That has all sorts of repercussions.

I will just implement:

tolerating this particular error (or perhaps more in the future)
a flag to increase the number of status retrieval attempts

The plugin ought to wait a bit, if a connection to the SLURM database cannot be established (the `sacct` command fails). This PR addresses issue report #207.  ## Summary by CodeRabbit - **New Features** - Introduced an optional integer field for configuring the number of attempts to query the status of active jobs, enhancing user control over job monitoring. - Enhanced system reliability by introducing an automatic retry mechanism for job status checks, improving overall monitoring stability. - **Bug Fixes** - Improved error handling for job status checks, providing clearer logging for potential SLURM database issues and other errors.  --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>

cmeesters · 2025-03-14T14:07:23Z

Now in the main branch: A new flag --slurm-status-attempts. Can be configured in your global profile, too. Set it to a high number and the workflow will more or less silently wait forever.

It is not yet released, because we are working on a huge overhaul of the documentation to be included.

Redmar-van-den-Berg · 2025-03-17T07:01:04Z

Thanks for adding a configuration option for this! Would it be possible to specify a negative number to indicate waiting indefinitely? That way both use cases are covered in a single setting.

cmeesters · 2025-03-17T16:41:34Z

If we assume a high number, say 1000, that would mean at least 1000 times waiting for a query. If nothing happens, this would increase to a 180 sec per time. Plus, you can set the rate limiter of Snakemake. In total, this is indefinitely, already.

I rather suggest, increasing resources for the SLURM setup ;-).

cmeesters mentioned this issue Mar 12, 2025

feat: tolerant status checks #232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207

Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207

Redmar-van-den-Berg commented Jan 24, 2025

Redmar-van-den-Berg commented Jan 24, 2025

Uh oh!

cmeesters commented Jan 24, 2025

Uh oh!

Redmar-van-den-Berg commented Jan 27, 2025 •

edited

Loading

Uh oh!

cmeesters commented Jan 28, 2025

Uh oh!

cmeesters commented Mar 12, 2025

Uh oh!

Redmar-van-den-Berg commented Mar 13, 2025

Uh oh!

cmeesters commented Mar 13, 2025

Uh oh!

cmeesters commented Mar 14, 2025

Uh oh!

Redmar-van-den-Berg commented Mar 17, 2025

Uh oh!

cmeesters commented Mar 17, 2025

Uh oh!

Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207

Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207

Comments

Redmar-van-den-Berg commented Jan 24, 2025

Redmar-van-den-Berg commented Jan 24, 2025

Uh oh!

cmeesters commented Jan 24, 2025

Uh oh!

Redmar-van-den-Berg commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmeesters commented Jan 28, 2025

Uh oh!

cmeesters commented Mar 12, 2025

Uh oh!

Redmar-van-den-Berg commented Mar 13, 2025

Uh oh!

cmeesters commented Mar 13, 2025

Uh oh!

cmeesters commented Mar 14, 2025

Uh oh!

Redmar-van-den-Berg commented Mar 17, 2025

Uh oh!

cmeesters commented Mar 17, 2025

Uh oh!

Redmar-van-den-Berg commented Jan 27, 2025 •

edited

Loading