Skip to content

SLURM executor appears to rebuild complete DAG for each individual job in large workflows #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mat10d opened this issue Feb 27, 2025 · 10 comments

Comments

@mat10d
Copy link

mat10d commented Feb 27, 2025

Software Versions

snakemake --version
8.27.1
mamba list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 0.15.0             pyhdfd78af_0    bioconda
snakemake-executor-plugin-slurm-jobstep 0.2.1              pyhdfd78af_0    bioconda
sinfo --version
slurm 23.11.5

Bug
When running Snakemake with the SLURM executor, each individual rule job appears to reload the Snakefile and rebuild the entire DAG. For large workflows, this causes significant memory usage and processing time overhead for every job. Our DAG takes ~109 seconds to build and consumes over 2GB of memory, which is then repeated for each of thousands of jobs.

Bash script run on login node, via tmux session:

snakemake --executor slurm --use-conda \
    --workflow-profile "slurm/" \
    --snakefile "/lab/barcheese01/mdiberna/brieflow/workflow/Snakefile_well_debug" \
    --configfile "config/config.yml" \
    --latency-wait 30 \
    --until extract_metadata_sbs extract_metadata_phenotype

This project is based on the Snakemake suggested workflow and consists of a Snakefile_well_debug.txt
that loads rules, targets, and then deploys them.

Expected behavior
We expect that the DAG should be built once in the main Snakemake process (in the tmux session), and individual jobs should only need to execute their specific rule without rebuilding the entire workflow graph. If this were the case, the print statements that we have (testing memory usage) would just be seen in the tmux session that this is run in. This is especially important for large workflows where DAG building is resource-intensive, for both memory and runtime aspects.

What we're seeing
While the Snakefile print statements do populate in the tmux session, they also appear in every .out file, for example: 5580823.txt that is generated by slurm job submission. Furthermore, we see that the expected memory usage of the individual jobs is effectively the sum of both what happens in the main Snakefile (stably ~2GB) + what is required by the job itself (<200MB). The time it takes to run each rule also appears to increase as the size of our input data (and thereby the DAG) increases.

I can't seem to find any documentation about whether this is expected, or if this is something that can be bypassed with the package. Thanks in advance.

@mat10d mat10d changed the title SLURM executor rebuilds complete DAG for each individual job in large workflows SLURM executor appears to rebuild complete DAG for each individual job in large workflows Feb 27, 2025
@cmeesters
Copy link
Member

Thank you for bringing this issue to our attention:

  • yes, Snakemake submits itself. This is for a number of reasons, e.g. otherwise it cannot fulfil its software environment on the compute nodes or guarantee the execution of group jobs, etc. It merits better documentation — for sure. I will attend to it.
  • As Snakemake is operating on a single job when in a SLURM job context, it assumes to be executing an entire workflow (which might be the case in upcoming releases) and computes the DAG. This in itself is fast. After all, there is only one job. I am therefore not sure, whether you are measuring the DAG building time or rather the accumulated time for the upstart (e.g. added file system latency). Particularly, file system latency can kick in, if operating on parallel file systems, without striping or in directories with too many files or when dealing with a network file system (NFS), which is currently or due to the many jobs strained. Another reason might include Snakemake itself: If you are running many jobs, the source code (Python), which is dynamically loaded every time, might need more time to become available on a compute node, as it is accessed so many times. Here it depends on the file system delivering the source code. Is it possible, that any of those reasons is applicable in your case? We might(!) find a solution in staging your data in and out with the fs storage plugin.
  • As for the memory footprint: Yes, this is a concern. Thank you for submitting the little code snippet to underline this — we definitively need to work on limiting the memory footprint. We will discuss this.

@mat10d
Copy link
Author

mat10d commented Feb 28, 2025

Thank you for your detailed response and for confirming this behavior, and for getting back to us so quickly. Points 1 & 3 are well noted and I agree, would be nice to document. The memory footprint especially is quite complex, because one would expect a workflow that has identical rules to require the same amount of memory for each job as a larger job, but that isn't exactly the case.

I have been thinking about point 2. So just to be clear, are you saying that the DAG is computed, but that DAG computation is just for one job? I think the print statements in the .out file do point towards this--the time that we are measuring for the Snakefile to run doesn't exactly capture how long it takes to build the DAG, this happens after the Snakefile is run, from what I can tell?

I've noticed this: while our print statements show the Snakefile being run takes about ~120 seconds in both the scheduler node and in each job, the initial scheduling process takes about 30 minutes before it starts submitting jobs. However, the individual SLURM jobs don't seem to have the same startup delay, despite showing the building prints of ~120 seconds. I am not sure if there is a way of returning the actual time that the DAG computation takes both for the head scheduler and each individual job.

This suggests (to me) that the main Snakemake process is doing much more comprehensive work (full dependency resolution, scheduling logic, etc.) than what's happening in each job. The jobs might be loading the Snakefile and building only the necessary portion of the DAG for their specific task, which still creates some delays, but is faster.

Thanks again for looking into this, and let me know if the above is accurate. Again, the memory footprint improvement would be very helpful for large-scale workflows like ours and generally helpful for users across the board, I believe. I wonder if there is a way of automatically adding the DAG computing memory to the specified rule memory for each job.

I am not sure the best way of sharing the entire workflow, which is seen here. For just running the first (of 5 steps) in this analysis, this is the number of jobs generated:

Provided remote nodes: 300
Job stats:
job                       count
----------------------  -------
all_preprocess                1
calculate_ic_phenotype       45
calculate_ic_sbs            495
convert_phenotype         57645
convert_sbs              164835
total                    223021

This is reading and processing about 20TB of data.

@cmeesters
Copy link
Member

cmeesters commented Mar 5, 2025

Well, regarding the memory footprint: I did a little test submitting (of course, I cannot reproduce your workflows, due to lacking inputs) once with and once pandas import: 72 MB and 27 MB, respectively. That is just the difference of the Pandas library - of course, for me, as I was not doing anything with pandas, the GC could kick in. edit: Yet, this is about the size of the pandas lib. I cannot reproduce a 2G footprint. Perhaps, you can dig into it?

To see the DAG calculation time within jobs, run with --verbose and --slurm-keep-successful-logs. And you will realize, the DAG is actually built within a reasonable time (for my toy test, 0.06 seconds, for a real one seconds), whilst not at all in a job context, if there is only one job - it merely states that it is building the DAG. The issue is, as I presume, that it still needs to perform a stat-call to check for the presence of your input(s). And if the file system is laggy, that will consume time. Particularly, if it has to wait for Python files, to be loaded, too.

The sheer amount of data should not play a role, as long as you are not caching (in which case, Snakemake will calculate hash sums and doing that for 20T will take time).

@mat10d
Copy link
Author

mat10d commented Mar 5, 2025

Got it--I will use those flags to check the overhead. I will dig into the footprint. Since we spoke, we simplified our snakefile to do as little computation and data loading as possible to decrease this overhead, and that seems to have helped.

One additional question--should specifying a rerun trigger (we are currently using --rerun-triggers mtime) or having output mappings (e.g. temp, protected) also affect the speed of a run?

@cmeesters
Copy link
Member

Regarding mtime I am not sure myself. Need to dig in the code. I guess it will all result in a stat call. That is the downside of Posix: tools tend to gather all stats and tend to report only a part of it. You can trace an ordinary ls call and compare to ls -l to see what I mean. There is only a small difference per file.

  • temp() can affect the speed of a re-run, not a run. It may, however, be beneficial when it comes to data management ...
  • likewise, protected() will only impact runtime, when doing it for many files. It might crash a workflow upon re-run.

Both effects will be negligible, unless applied a zillion times (my gut feeling). If you run /usr/bin/time -v snakemake ... on a login node, you will realize that:

  • the cpu time is a tiny fraction of the runtime of a workflow.
  • there are a zillion file operations in total. (Of course, since Snakemake will always check for correct outputs, etc.) That might improve, if we really implement solutions in file lookups and shift the code base partially to Rust. It will come with a memory price tag, however. It is all a question of balancing.

Now, did I guess correctly? Are you experiencing file system issues? (due to too many files in a folder, NFS mounts, or similar?)

@mat10d
Copy link
Author

mat10d commented Mar 6, 2025

Sounds good. So it should be safe to use those options with minimal slowdown.

I am not sure if the file system issues were the trouble. We can split our run into 8 in a not too complicated fashion, because the data is organized that way. Doing that reduced the job count and made things easier to run, faster, and decreased overhead--although there still is a bump up based on what is run on the Snakefile. We will examine further with others on our team if the filesystem is responsible for slowing down the full run. Thanks for all your help @cmeesters and we will get back to you soon with further details.

@mat10d
Copy link
Author

mat10d commented Mar 8, 2025

Hello @cmeesters -- one other thing that I was thinking was that job grouping may reduce some of this overhead, particularly on the DAG creation (for each rule). Is that accurate? I can't find much documentation on grouping with respect to the slurm executor.

Furthermore, I did try some grouping, but I'm seeing some very weird issues:

    snakemake --executor slurm --use-conda \
        --workflow-profile "slurm/" \
        --snakefile "../brieflow/workflow/Snakefile" \
        --configfile "config/config.yml" \
        --latency-wait 60 \
        --rerun-triggers mtime \
        --until all_sbs \
        --groups log_filter=sbs_{plate}_{well}_{tile} \
                compute_standard_deviation=sbs_{plate}_{well}_{tile} \
                find_peaks=sbs_{plate}_{well}_{tile} \
                max_filter=sbs_{plate}_{well}_{tile} \
                apply_ic_field_sbs=sbs_{plate}_{well}_{tile} \
                segment_sbs=sbs_{plate}_{well}_{tile} \
                extract_bases=sbs_{plate}_{well}_{tile} \
                call_reads=sbs_{plate}_{well}_{tile} \
                call_cells=sbs_{plate}_{well}_{tile} \
                extract_sbs_info=sbs_{plate}_{well}_{tile} \
        --config plate_filter=$PLATE

The workflow profile in the slurm folder only lists one partition--20

default-resources:
    slurm_partition: 20
    slurm_account: wibrusers
    mem_mb: 3000
    tasks: 1
    cpus_per_task: 1
    runtime: 400
    slurm_extra: "'--output=slurm/slurm_output/rule/%j.out'"
jobs: 300

# for individual jobs
set-resources:
   # preprocessing
    extract_metadata_sbs:

This generates on output like this which seems to suggest that the grouping is going according to plan, but for some reason, the group partition is set to 60:

SLURM sbatch failed. The error message was sbatch: error: invalid partition specified: 60

Is this a bug? Do I need to change the config.yaml to specify a group partition (even though the default partition should scale to everything)? Should grouping improve anything at all?

@cmeesters
Copy link
Member

  • I doubt job grouping influences the DAG creation (heavily). Did you measure with --verbose?
  • In contrast, the sheer number of jobs, may result in a huge number of files in single directories. Which rather indicates that my earlier suspicion (the file system might cause this) may be right. Is the workflow of yours public (a repo link?)?
  • What? HPC admins are a weird folk! (I am one, so I might write that.) Your partition is a numeral? It is still weird, that the partition changes and is a multiple of an earlier partition. Can you please give me:
  1. the output of sinfo -sa
  2. a verbose log (run with --verbose)?

@andynu
Copy link

andynu commented Mar 10, 2025

😂 The partitions correspond to the Ubuntu LTS major version the node is based on, which also corresponds to the default set of software (glibc, etc.). They are numeric, e.g. 18, 20, 24, etc.

@cmeesters
Copy link
Member

cmeesters commented Mar 11, 2025

@andynu & @mat10d the code, at no point, alters the partition name (except for removing an asterisk, with which SLURM indicates its default partition). So, while the phantasy of fellow admins to label and constrain and setup and .... "their" cluster never ceases to amaze me, I still need further information to pinpoint the issue(S). That is why I asked for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants