Skip to content

Jobs submitted multiple times in different groups #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
marcsingleton opened this issue Apr 10, 2025 · 1 comment
Open

Jobs submitted multiple times in different groups #253

marcsingleton opened this issue Apr 10, 2025 · 1 comment

Comments

@marcsingleton
Copy link

Software Versions
snakemake 9.1.9
snakemake-executor-plugin-slurm 1.1.0

Describe the bug
When using job groups and checkpoints together, the executor will submit duplicate jobs in different groups if the workflow is resumed after a checkpoint.

Minimal example
In this example, checkpoint A produces a variable number of files. Rule B operates on those files individually, and rule gather_B forces execution by substituting wildcards.

from pathlib import Path

output_path = Path('output')

checkpoint A:
    output:
        dir = directory(output_path / 'A')
    shell:
        'mkdir -p {output}; for id in {{1..10}}; do touch {output}/file_$id.txt; done'

rule B:
    input:
        Path(rules.A.output.dir) / 'file_{id}.txt'
    output:
        dir = directory(output_path / 'B/{id}/')
    group:
        'group_B'
    resources:
        gpus = 1
    shell:
        'mkdir -p {output}; echo id={wildcards.id} hostname=$(hostname) >> log.txt; sleep 10'

def gather_B_input(wildcards):
    As = checkpoints.A.get(**wildcards).output.dir
    A = Path(As) / 'file_{id}.txt'
    ids = glob_wildcards(A).id
    return sorted(expand(rules.B.output.dir, id=ids))

rule gather_B:
    input:
        gather_B_input
    output:
        output_path / 'gather_B.txt'
    shell:
        f'echo {{input}} > {{output}}'

With the following profile

executor: slurm
jobs: 5

resources:
    gpus: 1

set-resource-scopes:
    gpus: local

groups:
    group_B: group_B

group-components:
    group_B: 4

set-resources:
  B:
    slurm_partition: gpus
    runtime: 1h

the workflow should run 3 groups jobs with time limits of 4, 4, and 2 hours. (The gpu configuration is a holdover from the original workflow, but I don't think it's relevant to the issue.) Instead, if snakemake gather_B is run after running snakemake rule_A, SLURM reports 4 jobs are created:

JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
647974    gpus 77291256  mdsingl  RUNNING       0:15   4:00:00      1 gn125
647975    gpus 77291256  mdsingl  RUNNING       0:15   4:00:00      1 gn126
647972    gpus 77291256  mdsingl  RUNNING       0:16   2:00:00      1 gn65
647973    gpus 77291256  mdsingl  RUNNING       0:16  10:00:00      1 gn66

These jobs are actually run multiple times as shown by the output from the gather_B rule:

id=6 hostname=gn126
id=9 hostname=gn65
id=7 hostname=gn125
id=9 hostname=gn66
id=4 hostname=gn126
id=3 hostname=gn65

id=4 hostname=gn66
id=2 hostname=gn126
id=5 hostname=gn125
id=1 hostname=gn66
id=10 hostname=gn126
id=6 hostname=gn66

id=7 hostname=gn66
id=2 hostname=gn66
id=3 hostname=gn66
id=10 hostname=gn66
id=5 hostname=gn66
id=8 hostname=gn66

Snakemake also reports more than 100% completion as it repeats the jobs.

Finished jobid: 11 (Rule: B)
20 of 11 steps (182%) done
[Wed Apr  9 22:19:25 2025]
Finished jobid: 0 (Rule: gather_B)
21 of 11 steps (191%) done

Interestingly, this issue doesn't occur if the workflow is run start to finish in one command.

@marcsingleton
Copy link
Author

Is there any advice on a workaround for this issue? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant