-
Notifications
You must be signed in to change notification settings - Fork 28
SLURM executor appears to rebuild complete DAG for each individual job in large workflows #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for bringing this issue to our attention:
|
Thank you for your detailed response and for confirming this behavior, and for getting back to us so quickly. Points 1 & 3 are well noted and I agree, would be nice to document. The memory footprint especially is quite complex, because one would expect a workflow that has identical rules to require the same amount of memory for each job as a larger job, but that isn't exactly the case. I have been thinking about point 2. So just to be clear, are you saying that the DAG is computed, but that DAG computation is just for one job? I think the print statements in the .out file do point towards this--the time that we are measuring for the Snakefile to run doesn't exactly capture how long it takes to build the DAG, this happens after the Snakefile is run, from what I can tell? I've noticed this: while our print statements show the Snakefile being run takes about ~120 seconds in both the scheduler node and in each job, the initial scheduling process takes about 30 minutes before it starts submitting jobs. However, the individual SLURM jobs don't seem to have the same startup delay, despite showing the building prints of ~120 seconds. I am not sure if there is a way of returning the actual time that the DAG computation takes both for the head scheduler and each individual job. This suggests (to me) that the main Snakemake process is doing much more comprehensive work (full dependency resolution, scheduling logic, etc.) than what's happening in each job. The jobs might be loading the Snakefile and building only the necessary portion of the DAG for their specific task, which still creates some delays, but is faster. Thanks again for looking into this, and let me know if the above is accurate. Again, the memory footprint improvement would be very helpful for large-scale workflows like ours and generally helpful for users across the board, I believe. I wonder if there is a way of automatically adding the DAG computing memory to the specified rule memory for each job. I am not sure the best way of sharing the entire workflow, which is seen here. For just running the first (of 5 steps) in this analysis, this is the number of jobs generated:
This is reading and processing about 20TB of data. |
Well, regarding the memory footprint: I did a little test submitting (of course, I cannot reproduce your workflows, due to lacking inputs) once with and once pandas import: 72 MB and 27 MB, respectively. That is just the difference of the Pandas library - of course, for me, as I was not doing anything with pandas, the GC could kick in. edit: Yet, this is about the size of the pandas lib. I cannot reproduce a 2G footprint. Perhaps, you can dig into it? To see the DAG calculation time within jobs, run with The sheer amount of data should not play a role, as long as you are not caching (in which case, Snakemake will calculate hash sums and doing that for 20T will take time). |
Got it--I will use those flags to check the overhead. I will dig into the footprint. Since we spoke, we simplified our snakefile to do as little computation and data loading as possible to decrease this overhead, and that seems to have helped. One additional question--should specifying a rerun trigger (we are currently using |
Regarding
Both effects will be negligible, unless applied a zillion times (my gut feeling). If you run
Now, did I guess correctly? Are you experiencing file system issues? (due to too many files in a folder, NFS mounts, or similar?) |
Sounds good. So it should be safe to use those options with minimal slowdown. I am not sure if the file system issues were the trouble. We can split our run into 8 in a not too complicated fashion, because the data is organized that way. Doing that reduced the job count and made things easier to run, faster, and decreased overhead--although there still is a bump up based on what is run on the Snakefile. We will examine further with others on our team if the filesystem is responsible for slowing down the full run. Thanks for all your help @cmeesters and we will get back to you soon with further details. |
Hello @cmeesters -- one other thing that I was thinking was that job grouping may reduce some of this overhead, particularly on the DAG creation (for each rule). Is that accurate? I can't find much documentation on grouping with respect to the slurm executor. Furthermore, I did try some grouping, but I'm seeing some very weird issues:
The workflow profile in the slurm folder only lists one partition--20
This generates on output like this which seems to suggest that the grouping is going according to plan, but for some reason, the group partition is set to 60:
Is this a bug? Do I need to change the config.yaml to specify a group partition (even though the default partition should scale to everything)? Should grouping improve anything at all? |
|
😂 The partitions correspond to the Ubuntu LTS major version the node is based on, which also corresponds to the default set of software (glibc, etc.). They are numeric, e.g. 18, 20, 24, etc. |
@andynu & @mat10d the code, at no point, alters the partition name (except for removing an asterisk, with which SLURM indicates its default partition). So, while the phantasy of fellow admins to label and constrain and setup and .... "their" cluster never ceases to amaze me, I still need further information to pinpoint the issue(S). That is why I asked for more information. |
Software Versions
Bug
When running Snakemake with the SLURM executor, each individual rule job appears to reload the Snakefile and rebuild the entire DAG. For large workflows, this causes significant memory usage and processing time overhead for every job. Our DAG takes ~109 seconds to build and consumes over 2GB of memory, which is then repeated for each of thousands of jobs.
Bash script run on login node, via tmux session:
This project is based on the Snakemake suggested workflow and consists of a Snakefile_well_debug.txt
that loads rules, targets, and then deploys them.
Expected behavior
We expect that the DAG should be built once in the main Snakemake process (in the tmux session), and individual jobs should only need to execute their specific rule without rebuilding the entire workflow graph. If this were the case, the print statements that we have (testing memory usage) would just be seen in the tmux session that this is run in. This is especially important for large workflows where DAG building is resource-intensive, for both memory and runtime aspects.
What we're seeing
While the Snakefile print statements do populate in the tmux session, they also appear in every .out file, for example: 5580823.txt that is generated by slurm job submission. Furthermore, we see that the expected memory usage of the individual jobs is effectively the sum of both what happens in the main Snakefile (stably ~2GB) + what is required by the job itself (<200MB). The time it takes to run each rule also appears to increase as the size of our input data (and thereby the DAG) increases.
I can't seem to find any documentation about whether this is expected, or if this is something that can be bypassed with the package. Thanks in advance.
The text was updated successfully, but these errors were encountered: