Skip to content

Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Redmar-van-den-Berg opened this issue Jan 24, 2025 · 10 comments

Comments

@Redmar-van-den-Berg
Copy link

The snakemake process hangs forever after the sacct command receives a "Connection refused". It looks like Snakemake stops trying to query the status of the job after a few retries. In fact, the submitted job has already completed succesfully, and I'm able to manually query it's job status using the sacct command.

Software Versions
snakemake-minimal=8.24.1
snakemake-executor-plugin-slurm=0.14.2
slurm 23.02.8

Describe the bug
After snakemake was unable to connect to the slurm database, it stops trying to connect, even after the connection issue has been resolved.

Logs

Job 6 has been submitted with SLURM jobid 20123883 (log: .snakemake/slurm_logs/rule_qc_seq_cutadapt/samplename/20123883.log).                                                       
The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-01-22T10:00 --endtime now --name 8ca29162-60ef-4968-8633-c28df38e937f                              
Error message: sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm-hpc:6819: Connection refused                                                                                
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Minimal example

Additional context

@Redmar-van-den-Berg
Copy link
Author

I tried to give the Snakemake process a nudge with kill -SIGTERM ${snakemake_pid}, but it only prints:
Will exit after finishing currently running jobs (scheduler).

and doesn't appear to query for job status again.

@cmeesters
Copy link
Member

Thank you for bringing this issue to my attention. I believe, however, it cannot be attributed to the plugin itself: sacct is an internal part of any complete SLURM installation. It appears it can be found. Otherwise, we would read a different error message. If however, the SLURM DB is down, you should approach your admins.

Have you been running Snakemake on your cluster successfully before?

@Redmar-van-den-Berg
Copy link
Author

Redmar-van-den-Berg commented Jan 27, 2025

The slurm db was down temporarily (or it was a network issue), but Snakemake did not continue after it came back up again, even after sacct shows the job having completed successfully.

I used to run Snakemake with my own slurm wrapper, which can be found here. I treated any failure of the sacct command as the status being running, and once the slurm db came online again it would pick up that the jobs had completed and Snakemake would continue on.

Could this also be implemented in the executor plugin? I think transient network issues when connecting to the slurm db should not hang a pipeline indefinitely.

@cmeesters
Copy link
Member

Definitively worth consideration. I need to teach the coming days and will try to pick up on this the next week. And I therefore will leave the issue open.

However, I would very much prefer to catch a specific exception, e.g. slurm_persist_conn_open_without_init: failed to open persistent connection to host, then to consider every unkown or error as "running".

@cmeesters
Copy link
Member

Sorry, I could not get back to you any sooner - the plugin considers only a status as "failed" if the status is in

fail_stati = (
            "BOOT_FAIL",
            "CANCELLED",
            "DEADLINE",
            "FAILED",
            "NODE_FAIL",
            "OUT_OF_MEMORY",
            "TIMEOUT",
            "ERROR",
        )

It is just a black-list, whereas you provide a black and white list. It think our codes are in esses equivalent in treating job statuses.

However, I opened a PR (#232), which reacts on a faulty sacct-attempt and just waits (up to 10 minutes). If a SLURM-DB cannot react over an extended period of time, something more serious is wrong with the cluster, IMO. Would you like to test the PR? Are there other error messages to be considered?

@Redmar-van-den-Berg
Copy link
Author

I'm not an expert on slurm, but I have seen cases on our cluster where the SLURM-DB was unavailable for a long time, I think a couple of hours. (Someone inadvertently DOSed the database by submitting something like a million jobs.).

While new jobs could not be scheduled, running jobs just happily kept on going since the execution nodes do not rely on the SLURM-DB being there. Some jobs could run for weeks, so they could still be running by the time SLURM-DB comes back online again. You don't want to kill a job that has been running for a week over a network issue that takes 15 minutes to resolve.

Another problem with throwing an error after a set amount of time is what Snakemake should do about the error? The job could still be running on the execution nodes, and hence writing to the output files, but Snakemake is unaware. And if the user restarts the pipeline, you could get into a situation where multiple jobs on multiple nodes are writing into the same file.

Slurm has a whole bunch of states, but as far as Snakemake is concerned there are only three states:

  • Not done yet
  • Successful
  • Failed

As long as a job is not successful or failed, we should query the cluster again in a bit to see if the job has changed to either of those two states. This elegantly handles the case where the SLURM-DB is not available: we just wait for a bit and ask again, until we get either successful or failed.

@cmeesters
Copy link
Member

What you describe is essentially a dysfunctional cluster state. There is just no way for a user process to decide when and if it might recover. Waiting a littler longer? Ok. Waiting virtually indefinitely? No. That has all sorts of repercussions.

I will just implement:

  • tolerating this particular error (or perhaps more in the future)
  • a flag to increase the number of status retrieval attempts

johanneskoester pushed a commit that referenced this issue Mar 14, 2025
The plugin ought to wait a bit, if a connection to the SLURM database
cannot be established (the `sacct` command fails).

This PR addresses issue report #207.



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced an optional integer field for configuring the number of
attempts to query the status of active jobs, enhancing user control over
job monitoring.
- Enhanced system reliability by introducing an automatic retry
mechanism for job status checks, improving overall monitoring stability.
  
- **Bug Fixes**
- Improved error handling for job status checks, providing clearer
logging for potential SLURM database issues and other errors.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
@cmeesters
Copy link
Member

Now in the main branch: A new flag --slurm-status-attempts. Can be configured in your global profile, too. Set it to a high number and the workflow will more or less silently wait forever.

It is not yet released, because we are working on a huge overhaul of the documentation to be included.

@Redmar-van-den-Berg
Copy link
Author

Thanks for adding a configuration option for this! Would it be possible to specify a negative number to indicate waiting indefinitely? That way both use cases are covered in a single setting.

@cmeesters
Copy link
Member

If we assume a high number, say 1000, that would mean at least 1000 times waiting for a query. If nothing happens, this would increase to a 180 sec per time. Plus, you can set the rate limiter of Snakemake. In total, this is indefinitely, already.

I rather suggest, increasing resources for the SLURM setup ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants