Skip to content

Commit c0f8fee

Browse files
cmeestersgithub-actions[bot]johanneskoester
authored
feat: multicluster (#56)
Putative fix for issue #53 --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
1 parent 6a5ed46 commit c0f8fee

File tree

2 files changed

+79
-6
lines changed

2 files changed

+79
-6
lines changed

docs/further.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ You can use the following specifications:
119119
| `--ntasks` | `tasks` | number of concurrent tasks / ranks |
120120
| `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) |
121121
| `--nodes` | `nodes` | number of nodes |
122+
| `--clusters` | `cluster` | comma separated string of clusters |
122123

123124
Each of these can be part of a rule, e.g.:
124125

@@ -159,6 +160,10 @@ set-resources:
159160
cpus_per_task: 40
160161
```
161162
163+
## Multicluster Support
164+
165+
For reasons of scheduling multicluster support is provided by the `clusters` flag in resources sections. Note, that you have to write `clusters`, not `cluster`!
166+
162167
## Additional Custom Job Configuration
163168

164169
SLURM installations can support custom plugins, which may add support
@@ -323,6 +328,57 @@ Some environments provide a shell within a SLURM job, for instance, IDEs started
323328

324329
If the plugin detects to be running within a job, it will therefore issue a warning and stop for 5 seconds.
325330

331+
## Retries - Or Trying again when a Job failed
332+
333+
Some cluster jobs may fail. In this case Snakemake can be instructed to try another submit before the entire workflow fails, in this example up to 3 times:
334+
335+
```console
336+
snakemake --retries=3
337+
```
338+
339+
If a workflow fails entirely (e.g. when there are cluster failures), it can be resumed as any other Snakemake workflow:
340+
341+
```console
342+
snakemake --rerun-incomplete
343+
```
344+
345+
To prevent failures due to faulty parameterization, we can dynamically adjust the runtime behaviour:
346+
347+
## Dynamic Parameterization
348+
349+
Using dynamic parameterization we can react on different different inputs and prevent our HPC jobs from failing.
350+
351+
### Adjusting Memory Requirements
352+
353+
Input size of files may vary. [If we have an estimate for the RAM requirement due to varying input file sizes, we can use this to dynamically adjust our jobs.](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-resources)
354+
355+
### Adjusting Runtime
356+
357+
Runtime adjustments can be made in a Snakefile:
358+
359+
```Python
360+
def get_time(wildcards, attempt):
361+
return f"{1 * attempt}h"
362+
363+
rule foo:
364+
input: ...
365+
output: ...
366+
resources:
367+
runtime=get_time
368+
...
369+
```
370+
371+
or in a workflow profile
372+
373+
```YAML
374+
set-resources:
375+
foo:
376+
runtime: f"{1 * attempt}h"
377+
```
378+
379+
Be sure to use sensible settings for your cluster and make use of parallel execution (e.g. threads) and [global profiles](#using-profiles) to avoid I/O contention.
380+
381+
326382
## Summary:
327383

328384
When put together, a frequent command line looks like:

snakemake_executor_plugin_slurm/__init__.py

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from io import StringIO
88
import os
99
import re
10+
import shlex
1011
import subprocess
1112
import time
1213
from dataclasses import dataclass, field
@@ -136,6 +137,9 @@ def run_job(self, job: JobExecutorInterface):
136137
call += self.get_account_arg(job)
137138
call += self.get_partition_arg(job)
138139

140+
if job.resources.get("clusters"):
141+
call += f" --clusters {job.resources.clusters}"
142+
139143
if job.resources.get("runtime"):
140144
call += f" -t {job.resources.runtime}"
141145
else:
@@ -200,7 +204,11 @@ def run_job(self, job: JobExecutorInterface):
200204
f"SLURM job submission failed. The error message was {e.output}"
201205
)
202206

203-
slurm_jobid = out.split(" ")[-1]
207+
# multicluster submissions yield submission infos like
208+
# "Submitted batch job <id> on cluster <name>".
209+
# To extract the job id in this case we need to match any number
210+
# in between a string - which might change in future versions of SLURM.
211+
slurm_jobid = re.search(r"\d+", out).group()
204212
slurm_logfile = slurm_logfile.replace("%j", slurm_jobid)
205213
self.logger.info(
206214
f"Job {job.jobid} has been submitted with SLURM jobid {slurm_jobid} "
@@ -264,15 +272,22 @@ async def check_active_jobs(
264272
# in line 218 - once v20.11 is definitively not in use any more,
265273
# the more readable version ought to be re-adapted
266274

275+
# -X: only show main job, no substeps
276+
sacct_command = f"""sacct -X --parsable2 \
277+
--clusters all \
278+
--noheader --format=JobIdRaw,State \
279+
--starttime {sacct_starttime} \
280+
--endtime now --name {self.run_uuid}"""
281+
282+
# for better redability in verbose output
283+
sacct_command = " ".join(shlex.split(sacct_command))
284+
267285
# this code is inspired by the snakemake profile:
268286
# https://github.com/Snakemake-Profiles/slurm
269287
for i in range(status_attempts):
270288
async with self.status_rate_limiter:
271289
(status_of_jobs, sacct_query_duration) = await self.job_stati(
272-
# -X: only show main job, no substeps
273-
f"sacct -X --parsable2 --noheader --format=JobIdRaw,State "
274-
f"--starttime {sacct_starttime} "
275-
f"--endtime now --name {self.run_uuid}"
290+
sacct_command
276291
)
277292
if status_of_jobs is None and sacct_query_duration is None:
278293
self.logger.debug(f"could not check status of job {self.run_uuid}")
@@ -364,8 +379,10 @@ def cancel_jobs(self, active_jobs: List[SubmittedJobInfo]):
364379
# about 30 sec, but can be longer in extreme cases.
365380
# Under 'normal' circumstances, 'scancel' is executed in
366381
# virtually no time.
382+
scancel_command = f"scancel {jobids} --clusters=all"
383+
367384
subprocess.check_output(
368-
f"scancel {jobids}",
385+
scancel_command,
369386
text=True,
370387
shell=True,
371388
timeout=60,

0 commit comments

Comments
 (0)