-
Notifications
You must be signed in to change notification settings - Fork 4
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm and parallel computing profiling #210
Comments
Let's try setting explicit CPU and memory limits, and see if the scheduling problems persist. Do an opts:
backend:
singularity:
auto_update: false
slurm:
enable: true
srun_opts:
time: 0
cpus-per-task: 4 # make default conservative -- give more to specific cabs below
ratt-parrot:
assign:
dirs:
temp: /scratch3/users/{config.run.env.USER}/tmp
ncpu: 4 # default setting unless overridden by cab
steps:
upsample-2:
params:
num-threads: 32
upsample-3:
params:
num-threads: 32
cabs:
quartical:
backend:
slurm:
srun_opts:
mem: 128GB
cpus-per-task: 32
wsclean:
backend:
slurm:
srun_opts:
mem: 240GB
cpus-per-task: 32 |
@Athanaseus, heads up:
|
Searching around, it seems that these:
Are likely due to a memory limit not being specified. https://stackoverflow.com/questions/35498763/parallel-but-different-slurm-srun-job-step-invocations-not-working @Athanaseus, have you seen them since adding the |
I've added a feature to the slurm backend to check that memory settings are specified. See #208. |
No, haven't spotted them. I only experience a longer queue wait when I set
Got it.
By the way, the logs have a double extension ( |
CASE3Here 2/3 nodes succeeded to the end of the reductions.
The profiling:
This was running on the Pleiadi cluster. After the new |
Yeah that figures. 240GB is pretty much a full ilifu fat node. So it needs to wait for an entire node to be available. Whereas with a lower memory request, it can squeeze you onto a partially used node.
Did the same step run with another MS on another node? If yes: then it's a problem with their filesystem, image reads fine on some nodes but not on others -- report it to the sysadmin on the chat. If no: maybe the casa image is bad, check/rebuild. |
Yes, the other MSs on other nodes finished successfully and even have profiling stats, while the other one just crashed. Here is another related one after re-running (different MS):
|
Might be the same problem... again an I/O error on the filesystem... |
The error seems to go away after the new setup. I'll keep an eye on it. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi. We are currently testing our
stimela
pipeline using slurm on ilifu-cluster (and Pleiadi cluster, on which the pipeline is currently running with no issues.)Here, I will post some feedback while running on the ilifu-cluster.
3 single scan ms files with different pointings are used for testing.
CASE1
In this run, the 1/3 imaging steps passed, while it failed in the next step, which I'm currently investigating. Also, the log of the succeeded step seems to be overridden by the next step (even on the other cluster we see this).
Here are part of the logs:
logfile name:
log-loop-pointings.mgpls-reductions-ms0.txt.txt
logfile name:
log-loop-pointings.mgpls-reductions-ms2.txt.txt
This is the imaging step that succeeded but we only get the log of the next step.
logfile name:
log-loop-pointings.mgpls-reductions-ms1.txt.txt
Furthermore, looking at the log directory structure, for some reason, the profiling files were not dumped.
CASE2
Here we re-ran the the pipeline with only the imaging steps without the ms file that succeeded in the initial run.
NB: It is also noticeable that the srun didn't wait longer to until resources are allocated, unlike in the above runs.
Both the loops failed with an out of memory error.
Here, the log directory structure includes the profiling files.
The resulting profiling does not present the total averaged time.
NB:
The text was updated successfully, but these errors were encountered: