-
Notifications
You must be signed in to change notification settings - Fork 28
Snakemake hangs forever with executor SLURM with slurm_persist_conn_open_without_init #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I tried to give the Snakemake process a nudge with and doesn't appear to query for job status again. |
Thank you for bringing this issue to my attention. I believe, however, it cannot be attributed to the plugin itself: Have you been running Snakemake on your cluster successfully before? |
The slurm db was down temporarily (or it was a network issue), but Snakemake did not continue after it came back up again, even after I used to run Snakemake with my own slurm wrapper, which can be found here. I treated any failure of the Could this also be implemented in the executor plugin? I think transient network issues when connecting to the slurm db should not hang a pipeline indefinitely. |
Definitively worth consideration. I need to teach the coming days and will try to pick up on this the next week. And I therefore will leave the issue open. However, I would very much prefer to catch a specific exception, e.g. |
Sorry, I could not get back to you any sooner - the plugin considers only a status as "failed" if the status is in fail_stati = (
"BOOT_FAIL",
"CANCELLED",
"DEADLINE",
"FAILED",
"NODE_FAIL",
"OUT_OF_MEMORY",
"TIMEOUT",
"ERROR",
) It is just a black-list, whereas you provide a black and white list. It think our codes are in esses equivalent in treating job statuses. However, I opened a PR (#232), which reacts on a faulty sacct-attempt and just waits (up to 10 minutes). If a SLURM-DB cannot react over an extended period of time, something more serious is wrong with the cluster, IMO. Would you like to test the PR? Are there other error messages to be considered? |
I'm not an expert on slurm, but I have seen cases on our cluster where the SLURM-DB was unavailable for a long time, I think a couple of hours. (Someone inadvertently DOSed the database by submitting something like a million jobs.). While new jobs could not be scheduled, running jobs just happily kept on going since the execution nodes do not rely on the SLURM-DB being there. Some jobs could run for weeks, so they could still be running by the time SLURM-DB comes back online again. You don't want to kill a job that has been running for a week over a network issue that takes 15 minutes to resolve. Another problem with throwing an error after a set amount of time is what Snakemake should do about the error? The job could still be running on the execution nodes, and hence writing to the output files, but Snakemake is unaware. And if the user restarts the pipeline, you could get into a situation where multiple jobs on multiple nodes are writing into the same file. Slurm has a whole bunch of states, but as far as Snakemake is concerned there are only three states:
As long as a job is not |
What you describe is essentially a dysfunctional cluster state. There is just no way for a user process to decide when and if it might recover. Waiting a littler longer? Ok. Waiting virtually indefinitely? No. That has all sorts of repercussions. I will just implement:
|
The plugin ought to wait a bit, if a connection to the SLURM database cannot be established (the `sacct` command fails). This PR addresses issue report #207. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an optional integer field for configuring the number of attempts to query the status of active jobs, enhancing user control over job monitoring. - Enhanced system reliability by introducing an automatic retry mechanism for job status checks, improving overall monitoring stability. - **Bug Fixes** - Improved error handling for job status checks, providing clearer logging for potential SLURM database issues and other errors. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
Now in the main branch: A new flag It is not yet released, because we are working on a huge overhaul of the documentation to be included. |
Thanks for adding a configuration option for this! Would it be possible to specify a negative number to indicate waiting indefinitely? That way both use cases are covered in a single setting. |
If we assume a high number, say 1000, that would mean at least 1000 times waiting for a query. If nothing happens, this would increase to a 180 sec per time. Plus, you can set the rate limiter of Snakemake. In total, this is indefinitely, already. I rather suggest, increasing resources for the SLURM setup ;-). |
The snakemake process hangs forever after the
sacct
command receives a "Connection refused". It looks like Snakemake stops trying to query the status of the job after a few retries. In fact, the submitted job has already completed succesfully, and I'm able to manually query it's job status using thesacct
command.Software Versions
snakemake-minimal=8.24.1
snakemake-executor-plugin-slurm=0.14.2
slurm 23.02.8
Describe the bug
After snakemake was unable to connect to the slurm database, it stops trying to connect, even after the connection issue has been resolved.
Logs
Minimal example
Additional context
The text was updated successfully, but these errors were encountered: