SLURM Priority assigns each job a priority score to decide what gets scheduled next. Priority is based on:
- Fairshare tracks past resource usage per user/project/account. Users who have used fewer resources recently get higher priority.
- Age tracks how long your job’s been waiting. Older jobs (waiting longer) gain priority over time.
- Job size tracks how many nodes/CPUs you’ve requested and/or are currently using. Larger jobs may be prioritized to make efficient use of resources.
- Partition tracks the varying weights of different partitions.
- QoS tracks and can adjust the base priority of jobs. It can control things like:
- Priority boost (Priority)
- Maximum wall time
- Maximum number of jobs per user/account
- Preemption (ability to interrupt lower-QoS jobs)
“There are two paths people can take. They can either play now and pay later or pay now and play later. Regardless of the choice, one thing is certain. [SLURM] will demand a payment.” — John C. SlurmWall
The wall time is how long your job is allowed to run — set in srun command as a flag or in an sbatch script with - #SBATCH --time=HH:MM:SS
.
- If your job exceeds the wall time, SLURM terminates it, if your request exceeds the wall time limits, your job is put in “
PD
” or pending state, oftentimes with an error/reason ( e.g.,QOSMaxWallDurationPerJobLimit
) - Setting a shorter wall time can help your job start sooner, since the scheduler can fit it into smaller time gaps in the schedule.
There are many ways to break jobs up that allow for use under these circumstances. Normally, it’s a matter of decomposing your workflow as you interact with the cluster - and/or adjusting your submission scripts to break jobs into smaller parts. Applying fault tolerance techniques to your HPC workflows adds exponential benefit to your job submission experience.
Arrays are a powerful concept in both general programming and in job scheduling systems like SLURM. They offer a mechanism for interpreting, submitting, and managing collections of similar elements or workflows, offering:
- Efficient storage: Instead of creating multiple variables (
job1
,job2
,job3
, etc. etc. ) you store them in a single variable (job[ ]
). - Iterability: You can easily loop through them using constructs like for or while.
- Scalability: Code can adapt to more data without rewriting logic for each new variable.
- Organization: Keeps related data grouped logically.
SLURM Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily and can be:
- Specified using the
--array
flag from the command line or within a submit script ( e.g.,--array=1-1000%10
- which would submit 1 job with 1000 tasks, 10 at a time.) - The smallest index that can be specified by a user is zero and the maximum index is
MaxArraySize
minus one:[user@host ~]$ scontrol show config | grep -i maxarray MaxArraySize = 2001 [user@host ~]$ sbatch --array=1-2001 brain-stew.sh sbatch: error: Batch job submission failed: Invalid job array specification [user@host ~]$ sbatch --array=1-2000 brain-stew.sh Submitted batch job 137945
Checkpointing is a fault tolerance technique based on the Backward Error Recovery (BER) technique, designed to overcome “fail-stop” failures (interruptions during the execution of a job). This makes your jobs more resilient to crashes, partition time limits, and hardware failures, as well as aids in overcoming lab group/partition time limits by replacing your single long job with multiple shorter jobs. You can use job arrays to set each job to run in serial. If checkpointing is used, each job will write a checkpoint file and the following job will use the latest checkpoint file to continue from the last state of the calculation. There are a variety of ways to checkpoint your jobs (some applications/modules even have “checkpointing” as a built-in feature with its own flags/options), but we’ll focus on the two most common methods, using checkpointing logic within/in-tandem with your code/submission scripts and DMTCP (Distributed MultiThreaded Checkpointing).
Checkpointing from within or in-tandem with your code: We’ve modified some existing submission scripts to leverage checkpointing, provided examples of data-science/bioinformatics software (TensorFlow/Keras/NumPy) that can leverage checkpointing within their own code, as well as written some new ones as examples, some of which involve resource requests that would be potentially inhibited or prohibited by common SLURM scheduling limitations when implemented without checkpointing mechanisms in place. These scripts are only meant to outline basic principles in hopes that they’ll give you a sort of skeleton to build your code around.
-
purple-rain.sbatch
runs a large-scale array job (2000 tasks) where each task performs iterative work with checkpointing - this tests SLURMsMaxArraySize
and uses resubmission logic. -
purple-rain-requeue.sbatch
is the same script from above, using SLURM’s built-in requeue support instead of resubmission (which involves manually calling sbatch each time). This means, at the end of your script, SLURM will put the same job back into the queue, preserving its array index. This method is more efficient because:- There are fewer jobs in the scheduler. You don’t flood the queue with brand-new jobs - SLURM just bumps your existing one to the back of the queue,
- You keep the same JobID and your
- Array context is preserved, and lastly
- Built-in fault recovery. You can also couple this with different flavors of checkpointing or module-level preemption handling.
To allow your job to be requeued on failure or request, the #SBATCH --requeue
flag must be declared in your script - re queueing is enabled on both clusters:
[user@host ~]$ scontrol show config | grep -i requeue
JobRequeue = 1
MaxBatchRequeue = 5
-
white-rabbit.sbatch
sets up time-based checkpoints at a defined interval - purposely designed to request a longer running duration than is allowed on any partition, as well as checkpoints/requeues every 6 days to overcome the 7-day default lab-group limit . This is intended to show how using checkpointing and preemptive requeue logic allows a user to request as much time as they need for their job(s). It prints the current system date/time every 60 seconds which not only exercises the I/O path to the log file, but also verifies accuracy of SLURM’s stream of logs over extended time periods.- Inside each branch:
if (( $(date +%s) - $START_TIME >= $CHECKPOINT_INTERVAL )); then sbatch --time=6-00:00:00 --requeue fi
- Once 6 days have elapsed (i.e., hitting the wall-time limit), the script submits another copy of itself
- With the same time limit and
--requeue
flag, as well as verifies - That the new job will start fresh (with a new
START_TIME
), continuing the chain, which
- With the same time limit and
- Effectively “chains” five 6-day jobs together to cover ~33 days (arbitrary number) total without ever hitting the 7-day default lab-group limit.
- Inside each branch:
-
globus-pocus.sbatch
uses a pattern called "self-requeuing" to create an "infinite" job despite enforced job time limits. It keeps a Globus job running "forever" by catching a warning signal 10 minutes before the job limit, gracefully saving state, and requeuing itself, automatically, by:- Setting up a trap with:
trap_handler() { ... } trap trap_handler USR1
- This means: When
SIGUSR1
is received, it runstrap_handler
instead of just dying. trap_handler
:- Prints a message: "
Caught SIGUSR1
" Option: save checkpoint state (you can add your own commands here), then - Most importantly, it runs:
scontrol requeue $SLURM_JOB_ID
- Which asks SLURM to immediately requeue the job (then exits cleanly).
- Prints a message: "
- Setting up a trap with:
/vectors
- (contains 2 scripts that are intended to work in-tandem and should reside in the same directory)numpy-knuckles.sbatch
- Loads the Miniconda3 module,
- Activates a NumPy (v2.2.3) conda environment, and
- Submits
numpy-vectors.py
, a Python TensorFlow script that:- Loads a dataset and defines a simple neural network model,
- Sets up a checkpoint system where the Python codes is managing the saving/restoring progress (less reliant on the system), then
- Trains the model with the checkpoint callback active, that way
- When you checkpoint and restore the model, you are effectively saving and restoring the trained arrays of neural network weights, which means
- Next time you submit, the script detects the checkpoint, loads the saved model, and continues training.
/tensors
(contains 3 scripts that are intended to work in-tandem and should reside in the same directory)kerasell.sbatch
- Loads the Miniconda3 module,
- Activates a TensorFlow-GPU (v2.18.0) conda environment, and
- Submits
sensorflow-long.py
(a production-style, resumable training loop), orsensorflow-short.py
(is a standalone illustration); Python TensorFlow scripts that: - Load and slice datasets that consists of simple feed-forward networks, and
- Set up checkpointed callback mechanisms that manage saving/loading of model states via Keras callbacks and load_weights logic.
We’ve covered Application-Level Checkpointing (handled in the application code, itself) and checkpointing through submit script code modification via BASH. Last, we’ll look at how to setup your own checkpointing mechanism with a separate piece of software (DMTCP).
DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints a single-host or distributed computation to disk, in user-space. It works with no modifications to the Linux kernel nor to the application binaries. It can be used by unprivileged users (no root privilege needed) - the newest version is installed as a module on BigSky and Skyline ( currently dmtcp/3.2.0-qr37nrg
). You can later restart from a checkpoint or even migrate processes by moving the checkpoint files to another host prior to restarting.
/DMTCP
(contains 4 files that are intended to work in-tandem and should reside in the same directory, including a test-set of data (array/C++ code) and a pre-written submit script (with included wrapper-script)dmtcp-array.sbatch
- Loads the dmtcp module,
- Starts a DMTCP coordinator process (small background service) listening on TCP port 7779,
- Runs the
example.cpp
code onexample_array
, and - Monitors the program, saving periodic checkpoint images of the process and exits cleanly when finished.
mumble-wrapper.sh
is a wrapper script that can be used to submit 10,000 tasks in chunks of 2000 each, with 10 running at a time.
It’s worth noting - by default, DMTCP uses gzip to compress the checkpoint images. This can be turned off (dmtcp_launch --no-gzip
; or setting an environment variable to 0: DMTCP_GZIP=0
). This will be faster, and if your memory is dominated by incompressible data, this can be helpful. Gzip can add seconds for large checkpoint images. Typically, checkpoints and restarts are less than one second without gzip.
The general workflow consists of 2 main components, the Coordinator and the Launcher:
-
SSH to a submit node and use srun to submit a SLURM job - specifically to the host/location where you want to run your program (optional: start a new tmux session here if you’d like - e.g.,
$ tmux new -s dmtcp-coordinator
)[user@host ~]$ srun -p gpu -w host -c 16 --mem 4G --pty bash
-
Load the DMTCP module and launch a DMTCP coordinator (the default behavior is manual checkpointing using the
[c]
keystroke, however, you can specify an automatic checkpoint interval using the-i <time_in_seconds>
flag):[user@host ~]$ ml dmtcp && dmtcp_launch --version dmtcp_launch (DMTCP) 3.2.0 [user@host ~]$ dmtcp_coordinator dmtcp_coordinator starting... Host: host.domain.foo.bar (10.140.218.143) Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Type '?' for help. [2025-04-23T13:55:12.667, 125367, 125367, Note] at coordinatorplugin.h:205 in tick; REASON='No active clients; starting stale timeout theStaleTimeout = 28800 dmtcp>
- SSH to a submit node and use srun to submit a SLURM job specifically to the same host/location where you started your coordinator:
[user@host ~]$ srun -p gpu -w host -c 32 --mem 8G --pty bash
- Load the DMTCP module and launch your application using DMTCP (you must use the
--join-coordinator
flag to connect to your coordinator instance) - (dmtcp_launch <flags> <command>
)[user@host ~]$ ml dmtcp && dmtcp_launch --version [user@host ~]$ dmtcp_launch --join-coordinator ./p-dd.sh Running: dd iflag=fullblock if=/dev/zero of=/data/scratch/user/1-1G bs=1G count=1 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.525024 s, 2.0 GB/s
-
On your coordinator window, you can begin to track the process as it connects and exchanges process information after execution of various components (press
[s]
then[enter]
for a status):[2025-04-23T14:13:34.398, 126044, 126044, Note] at dmtcp_coordinator.cpp:919 in onConnect; REASON='worker connected hello_remote.from = 40f5a9a622088167-64000-10deccda2e4203 client->progname() = bash_(forked) [2025-04-23T14:13:34.398, 126044, 126044, Note] at dmtcp_coordinator.cpp:680 in onData; REASON='Updating process Information after fork() client->hostname() = host.domain.foo.bar client->progname() = bash_(forked) msg.from = 40f5a9a622088167-65000-10deccda34fc14 client->identity() = 40f5a9a622088167-40000-10dec20e3d5513 [2025-04-23T14:13:34.399, 126044, 126044, Note] at dmtcp_coordinator.cpp:680 in onData; REASON='Updating process Information after fork() client->hostname() = host.domain.foo.bar client->progname() = bash_(forked) msg.from = 40f5a9a622088167-66000-10deccda3eb86c client->identity() = 40f5a9a622088167-64000-10deccda2e4203 [2025-04-23T14:13:34.439, 126044, 126044, Note] at dmtcp_coordinator.cpp:689 in onData; REASON='Updating process Information after exec() progname = dd msg.from = 40f5a9a622088167-66000-10deccda3eb86c client->identity() = 40f5a9a622088167-66000-10deccda3eb86c [2025-04-23T14:13:34.439, 126044, 126044, Note] at dmtcp_coordinator.cpp:689 in onData; REASON='Updating process Information after exec() progname = tee msg.from = 40f5a9a622088167-65000-10deccda34fc14 client->identity() = 40f5a9a622088167-65000-10deccda34fc14 dmtcp> s Status... Host: host.domain.foo.bar (10.140.218.143) Port: 7779 Checkpoint Interval: Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Kill after checkpoint: 0 Computation Id: 40f5a9a622088167-40000-10e45ea0c96acd Checkpoint Dir: /data/home/user NUM_PEERS=4 RUNNING=yes
-
You can manually checkpoint your running job at any time by pressing the
[c]
key, the most important part of the output being the path to the restart script. A new restart script is written after each successful checkpoint file is created:dmtcp> c [2025-04-23T14:15:52.263, 126044, 126044, Note] at dmtcp_coordinator.cpp:1164 in startCheckpoint; REASON='starting checkpoint; incrementing generation; suspending all nodes s.numPeers = 4 compId.computationGeneration() = 1 [2025-04-23T14:15:53.086, 126044, 126044, Note] at dmtcp_coordinator.cpp:496 in releaseBarrier; REASON='Checkpoint complete; all workers running [2025-04-23T14:15:53.174, 126044, 126044, Note] at dmtcp_coordinator.cpp:559 in recordCkptFilename; REASON='Checkpoint complete. Wrote restart script restartScriptPath = /data/home/user/dmtcp_restart_script_40f5a9a622088167-40000-10dec2040da13b.sh
restartScriptPath = /data/home/user/dmtcp_restart_script_40f5a9a622088167-40000-10dec2040da13b.sh
If your job is cancelled, terminated unexpectedly or purposefully, your SSH session is broken or stale, etc. etc. - you can srun back to your launcher instance and run the provided script directly to restart your job from the checkpoint (below)
DMTCP Coordinator showing the process being forcibly cancelled, client disconnected, and the stale timeout timer started:
[2025-04-23T14:18:07.234, 126044, 126044, Note] at dmtcp_coordinator.cpp:689 in onData; REASON='Updating process Information after exec()
progname = dd
msg.from = 40f5a9a622088167-87000-10df0c5e0f9761
client->identity() = 40f5a9a622088167-87000-10df0c5e0f9761
[2025-04-23T14:19:29.014, 126044, 126044, Note] at dmtcp_coordinator.cpp:775 in onDisconnect; REASON='client disconnected
client->identity() = 40f5a9a622088167-86000-10df0c5e08ffc1
client->progname() = tee
[2025-04-23T14:19:54.078, 126044, 126044, Note] at dmtcp_coordinator.cpp:775 in onDisconnect; REASON='client disconnected
client->identity() = 40f5a9a622088167-87000-10df0c5e0f9761
client->progname() = dd
[2025-04-23T14:19:54.701, 126044, 126044, Note] at dmtcp_coordinator.cpp:775 in onDisconnect; REASON='client disconnected
client->identity() = 40f5a9a622088167-85000-10df0c5dfedd8a
client->progname() = bash_(forked)
[2025-04-23T14:19:55.772, 126044, 126044, Note] at dmtcp_coordinator.cpp:775 in onDisconnect; REASON='client disconnected
client->identity() = 40f5a9a622088167-40000-10dec20e3d5513
client->progname() = bash
[2025-04-23T14:19:56.775, 126044, 126044, Note] at coordinatorplugin.h:205 in tick; REASON='No active clients; starting stale timeout
theStaleTimeout = 28800
Showing the job being restarted by instantiating our DMTCP restart script ( ./dmtcp_restart_script_ckptID.sh
):
[user@host ~]$ /data/home/user/dmtcp_restart_script_40f5a9a622088167-40000-10dec2040da13b.sh
. . .
1000+0 records in
1000+0 records out
1073741824000 bytes (1.1 TB, 1000 GiB) copied, 269.9 s, 3.8 GB/s
. . .
Showing computational reset and connection
[2025-04-23T14:20:43.980, 126044, 126044, Note] at dmtcp_coordinator.cpp:811 in initializeComputation; REASON='Resetting computation
[2025-04-23T14:20:43.981, 126044, 126044, Note] at dmtcp_coordinator.cpp:995 in validateRestartingWorkerProcess; REASON='FIRST restart connection. Set numRestartPeers. Generate timestamp
numRestartPeers = 4
curTimeStamp = 4749000625441924
compId = 40f5a9a622088167-40000-10dec2040da13b
[2025-04-23T14:20:43.981, 126044, 126044, Note] at dmtcp_coordinator.cpp:919 in onConnect; REASON='worker connected
hello_remote.from = 40f5a9a622088167-40000-10dec20e3d5513
client->progname() = bash
[2025-04-23T14:20:43.986, 126044, 126044, Note] at dmtcp_coordinator.cpp:919 in onConnect; REASON='worker connected
hello_remote.from = 40f5a9a622088167-64000-10deccda2e4203
client->progname() = bash
[2025-04-23T14:20:43.986, 126044, 126044, Note] at dmtcp_coordinator.cpp:919 in onConnect; REASON='worker connected
hello_remote.from = 40f5a9a622088167-65000-10deccda34fc14
client->progname() = tee
[2025-04-23T14:20:43.989, 126044, 126044, Note] at dmtcp_coordinator.cpp:919 in onConnect; REASON='worker connected
hello_remote.from = 40f5a9a622088167-66000-10deccda3eb86c
client->progname() = dd