Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm and parallel computing profiling #210

Closed
Athanaseus opened this issue Feb 1, 2024 · 10 comments
Closed

Slurm and parallel computing profiling #210

Athanaseus opened this issue Feb 1, 2024 · 10 comments

Comments

@Athanaseus
Copy link
Contributor

Athanaseus commented Feb 1, 2024

Hi. We are currently testing our stimela pipeline using slurm on ilifu-cluster (and Pleiadi cluster, on which the pipeline is currently running with no issues.)
Here, I will post some feedback while running on the ilifu-cluster.
3 single scan ms files with different pointings are used for testing.

CASE1

In this run, the 1/3 imaging steps passed, while it failed in the next step, which I'm currently investigating. Also, the log of the succeeded step seems to be overridden by the next step (even on the other cluster we see this).

Here are part of the logs:

logfile name: log-loop-pointings.mgpls-reductions-ms0.txt.txt

2024-01-31 23:33:58 STIMELA.loop-pointings.mgpls-reductions.image-pol-00 INFO: singularity image /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg exists
### running /opt/slurm/bin/srun -J loop-pointings.mgpls-reductions.image-pol-00 /software/common/singularity/3.11.3/bin/singularity exec --containall --pwd /scratch3/users/athanaseus --bind /scratch3/users/athanaseus:/scratch3/users/athanaseus:rw /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg wsclean -name mgpls-output-1/mgpls-imaging-00/T16R02C02-corr-scan3-chan512-t16-XXYY -data-column DATA -channels-out 4 -deconvolution-channels 4 -pol I -j 32 -weight briggs -0.3 -size 5000 5000 -multiscale -multiscale-scales 0,6,12,18,24,28,32,64,128,256,300,340,454,512 -multiscale-scale-bias 0.5 -scale 1.5asec -niter 100000 -nmiter 4 -threshold 3.5e-05 -auto-mask 12.0 -gain 0.1 -mgain 0.8 -join-channels -fit-spectral-pol 4 -padding 1.2 -nwlayers-factor 3.0 -parallel-gridding 72 -use-wgridder -log-time /scratch3/users/athanaseus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms
# srun: Job 8414557 step creation temporarily disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: error: Unable to create step for job 8414557: Job/step already completing or completed
2024-02-01 01:16:55 STIMELA.loop-pointings.mgpls-reductions.image-pol-00 ERROR: step 'loop-pointings.mgpls-reductions.image-pol-00' has failed, aborting the recipe

logfile name: log-loop-pointings.mgpls-reductions-ms2.txt.txt

2024-01-31 23:33:43 STIMELA.loop-pointings.mgpls-reductions.image-pol-00 INFO: singularity image /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg exists
### running /opt/slurm/bin/srun -J loop-pointings.mgpls-reductions.image-pol-00 /software/common/singularity/3.11.3/bin/singularity exec --containall --pwd /scratch3/users/athanaseus --bind /scratch3/users/athanaseus:/scratch3/users/athanaseus:rw /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg wsclean -name mgpls-output-1/mgpls-imaging-00/T16R02C06-corr-scan5-chan512-t16-XXYY -data-column DATA -channels-out 4 -deconvolution-channels 4 -pol I -j 32 -weight briggs -0.3 -size 5000 5000 -multiscale -multiscale-scales 0,6,12,18,24,28,32,64,128,256,300,340,454,512 -multiscale-scale-bias 0.5 -scale 1.5asec -niter 100000 -nmiter 4 -threshold 3.5e-05 -auto-mask 12.0 -gain 0.1 -mgain 0.8 -join-channels -fit-spectral-pol 4 -padding 1.2 -nwlayers-factor 3.0 -parallel-gridding 72 -use-wgridder -log-time /scratch3/users/athanaseus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)
# srun: Step created for StepId=8414557.13
# !!! WARNING: Parameter '-use-wgridder' is deprecated and will be removed in a future version of WSClean.
# !!!          Use parameter '-gridder' instead.
# 2024-Feb-01 01:13:44.097194
# 2024-Feb-01 01:13:44.097212 WSClean version 3.3 (2023-03-31)
# 2024-Feb-01 01:13:44.097214 This software package is released under the GPL version 3.
# 2024-Feb-01 01:13:44.097216 Author: André Offringa (offringa@gmail.com).
# 2024-Feb-01 01:13:44.097218
# 2024-Feb-01 01:13:44.806872 === IMAGING TABLE ===
# 2024-Feb-01 01:13:44.806892        # Pol Ch JG ²G FG FI In Freq(MHz)
# 2024-Feb-01 01:13:44.806897 | Independent group:
# 2024-Feb-01 01:13:44.806906 +-+-J- 0  I   0  0  0  0  0  0  856-1070 (1024)
# 2024-Feb-01 01:13:44.806950   +-J- 1  I   1  0  1  1  0  0  1070-1284 (1024)
# 2024-Feb-01 01:13:44.806960   +-J- 2  I   2  0  2  2  0  0  1284-1498 (1024)
# 2024-Feb-01 01:13:44.806969   +-J- 3  I   3  0  3  3  0  0  1498-1712 (1024)
.
.
.
# 2024-Feb-01 01:14:25.839509 Writing dirty image...
# 2024-Feb-01 01:14:26.384564 Writing dirty image...
# 2024-Feb-01 01:14:26.917101 Writing dirty image...
# 2024-Feb-01 01:14:27.494332 Writing dirty image...
# srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
# slurmstepd: error: *** STEP 8414557.13 ON compute-202 CANCELLED AT 2024-02-01T01:16:55 DUE TO TIME LIMIT ***

This is the imaging step that succeeded but we only get the log of the next step.

logfile name: log-loop-pointings.mgpls-reductions-ms1.txt.txt

2024-02-01 01:13:41 STIMELA.loop-pointings.mgpls-reductions.quality-assess-00 INFO: ### validated inputs
2024-02-01 01:13:41 STIMELA.loop-pointings.mgpls-reductions.quality-assess-00 INFO: cab aimfast:
2024-02-01 01:13:41 STIMELA.loop-pointings.mgpls-reductions.quality-assess-00 INFO:   residual_image = mgpls-output-1/mgpls-imaging-00/T16R02C04-corr-scan4-chan512-t16-XXYY-MFS-residual.fits
2024-02-01 01:13:41 STIMELA.loop-pointings.mgpls-reductions.quality-assess-00 INFO:   all_source = False
2024-02-01 01:13:41 STIMELA.loop-pointings.mgpls-reductions.quality-assess-00 INFO:   outfile = mgpls-output-1/mgpls-imaging-00/T16R02C04-corr-scan4-chan512-t16-XXYYim00.json
2024-02-01 01:13:41 STIMELA.loop-pointings.mgpls-reductions.quality-assess-00 INFO: singularity image /scratch3/users/athanaseus/images/quay.io-stimela2-aimfast:cc0.0.2.simg exists
### running /opt/slurm/bin/srun -J loop-pointings.mgpls-reductions.quality-assess-00 /software/common/singularity/3.11.3/bin/singularity exec --containall --pwd /scratch3/users/athanaseus --bind /scratch3/users/athanaseus:/scratch3/users/athanaseus:rw /scratch3/users/athanaseus/images/quay.io-stimela2-aimfast:cc0.0.2.simg aimfast --residual-image mgpls-output-1/mgpls-imaging-00/T16R02C04-corr-scan4-chan512-t16-XXYY-MFS-residual.fits --outfile mgpls-output-1/mgpls-imaging-00/T16R02C04-corr-scan4-chan512-t16-XXYYim00.json
# srun: error: Unable to create step for job 8414557: Job/step already completing or completed

Furthermore, looking at the log directory structure, for some reason, the profiling files were not dumped.

$ ls -lrt
total 30
449 Jan 31 23:17 log-loop-pointings.mgpls-reductions.txt
6279 Jan 31 23:17 stimela.recipe.deps
1479 Jan 31 23:17 log-loop-pointings.txt
12977 Feb  1 01:16 log-loop-pointings.mgpls-reductions-ms2.txt.txt
5972 Feb  1 01:16 log-loop-pointings.mgpls-reductions-ms0.txt.txt
1649 Feb  1 01:16 log-loop-pointings.mgpls-reductions-ms1.txt.txt

CASE2

Here we re-ran the the pipeline with only the imaging steps without the ms file that succeeded in the initial run.

NB: It is also noticeable that the srun didn't wait longer to until resources are allocated, unlike in the above runs.

2024-02-01 07:55:35 STIMELA.loop-pointings.mgpls-reductions.image-pol-00 INFO: singularity image /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg exists
### running /opt/slurm/bin/srun -J loop-pointings.mgpls-reductions.image-pol-00 /software/common/singularity/3.11.3/bin/singularity exec --containall --pwd /scratch3/users/athanaseus --bind /scratch3/users/athanaseus:/scratch3/users/athanaseus:rw /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg wsclean -name mgpls-output-1/mgpls-imaging-00/T16R02C02-corr-scan3-chan512-t16-XXYY -data-column DATA -channels-out 4 -deconvolution-channels 4 -pol I -j 32 -weight briggs -0.3 -size 5000 5000 -multiscale -multiscale-scales 0,6,12,18,24,28,32,64,128,256,300,340,454,512 -multiscale-scale-bias 0.5 -scale 1.5asec -niter 100000 -nmiter 4 -threshold 3.5e-05 -auto-mask 12.0 -gain 0.1 -mgain 0.8 -join-channels -fit-spectral-pol 4 -padding 1.2 -nwlayers-factor 3.0 -parallel-gridding 72 -use-wgridder -log-time /scratch3/users/athanaseus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms
# srun: job 8414665 queued and waiting for resources
# srun: job 8414665 has been allocated resources

Both the loops failed with an out of memory error.

# 2024-Feb-01 07:56:51.337033 Writing dirty image...
# slurmstepd: error: Detected 1 oom_kill event in StepId=8414666.0. Some of the step tasks have been OOM Killed.
# srun: error: compute-202: task 0: Out Of Memory
# srun: Terminating StepId=8414666.0
2024-02-01 07:56:59 STIMELA.loop-pointings.mgpls-reductions.image-pol-00 ERROR: step 'loop-pointings.mgpls-reductions.image-pol-00' has failed, aborting the recipe
─────────────────────────────────────────────────────────────────────────────── detailed error report follows ───────────────────────────────────────────────────────────────────────────────        ⚠ step 'loop-pointings.mgpls-reductions.image-pol-00' has failed, aborting the recipe
        └── error running cab 'wsclean'
            └── wsclean returns error code 1 after 0:01:23
# slurmstepd: error: Detected 1 oom_kill event in StepId=8414665.0. Some of the step tasks have been OOM Killed.
# srun: error: compute-201: task 0: Out Of Memory
# srun: Terminating StepId=8414665.0
2024-02-01 07:57:04 STIMELA.loop-pointings.mgpls-reductions.image-pol-00 ERROR: step 'loop-pointings.mgpls-reductions.image-pol-00' has failed, aborting the recipe
─────────────────────────────────────────────────────────────────────────────── detailed error report follows ───────────────────────────────────────────────────────────────────────────────        ⚠ step 'loop-pointings.mgpls-reductions.image-pol-00' has failed, aborting the recipe
        └── error running cab 'wsclean'
            └── wsclean returns error code 1 after 0:01:28

Here, the log directory structure includes the profiling files.

$ ls -lrt
total 43
328 Feb  1 07:55 log-loop-pointings.mgpls-reductions.txt
6279 Feb  1 07:55 stimela.recipe.deps
11580 Feb  1 07:56 log-loop-pointings.mgpls-reductions-ms1.txt.txt
11576 Feb  1 07:57 log-loop-pointings.mgpls-reductions-ms0.txt.txt
4336 Feb  1 07:57 stimela.stats.full
6732 Feb  1 07:57 stimela.stats.summary.txt
1530 Feb  1 07:57 log-loop-pointings.txt

The resulting profiling does not present the total averaged time.
NB:

───────────────────────────────────────────────────────────────────────────────────── profiling results ─────────────────────────────────────────────────────────────────────────────────────

                                                                                    averages & total I/O
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━┓
           ┃                                                                                                 ┃  time hms ┃ CPU % ┃ Mem GB ┃ Load ┃ R GB/s ┃ W GB/s ┃ R GB ┃ W GB ┃
           ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━┩
           │ loop-pointings                                                                                  │ 0:00:00.0 │  5.68 │   6.00 │ 0.45 │   0.00 │   0.00 │ 0.00 │ 0.00 │
           │ loop-pointings.ms=/scratch3/users/athanaseus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms 2/2 │ 0:01:23.7 │ 32.65 │   6.00 │ 0.45 │   0.00 │   0.00 │ 0.00 │ 0.00 │
           │ loop-pointings.mgpls-reductions                                                                 │ 0:01:23.7 │  4.86 │   6.00 │ 0.45 │   0.00 │   0.00 │ 0.00 │ 0.00 │
           │ loop-pointings.mgpls-reductions.image-pol-00                                                    │ 0:01:23.4 │  4.70 │   6.00 │ 0.45 │   0.00 │   0.00 │ 0.00 │ 0.00 │
           │ loop-pointings.ms=/scratch3/users/athanaseus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms 1/2 │ 0:01:29.0 │ 44.50 │   6.00 │ 0.43 │   0.00 │   0.00 │ 0.00 │ 0.00 │
           └─────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┴───────┴────────┴──────┴────────┴────────┴──────┴──────┘

                                                                                     peaks
           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┓
           ┃                                                                                                 ┃  time hms ┃ CPU % ┃ Mem GB ┃ Load ┃ R GB/s ┃ W GB/s ┃
           ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━┩
           │ loop-pointings                                                                                  │ 0:00:00.0 │ 50.00 │   6.00 │ 0.65 │   0.00 │   0.00 │
           │ loop-pointings.ms=/scratch3/users/athanaseus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms 2/2 │ 0:01:23.7 │ 40.30 │   6.00 │ 0.55 │   0.00 │   0.00 │
           │ loop-pointings.mgpls-reductions                                                                 │ 0:01:23.7 │ 37.50 │   6.00 │ 0.65 │   0.00 │   0.00 │
           │ loop-pointings.mgpls-reductions.image-pol-00                                                    │ 0:01:23.4 │ 31.00 │   6.00 │ 0.65 │   0.00 │   0.00 │
           │ loop-pointings.ms=/scratch3/users/athanaseus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms 1/2 │ 0:01:29.0 │ 50.00 │   6.00 │ 0.51 │   0.00 │   0.00 │
           └─────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┴───────┴────────┴──────┴────────┴────────┘
@o-smirnov
Copy link
Member

Let's try setting explicit CPU and memory limits, and see if the scheduling problems persist.

Do an _include_post: tweaks.yml in your recipe, and put something like this in tweaks.yml:

opts:
  backend:
    singularity:
      auto_update: false
    slurm:
      enable: true
      srun_opts:
        time: 0 
        cpus-per-task: 4  # make default conservative -- give more to specific cabs below

ratt-parrot:
  assign:
    dirs:
      temp: /scratch3/users/{config.run.env.USER}/tmp
    ncpu: 4  # default setting unless overridden by cab
  steps:
    upsample-2:
      params:
        num-threads: 32
    upsample-3:
      params:
        num-threads: 32

cabs:
  quartical:
    backend:
      slurm:
        srun_opts:
          mem: 128GB
          cpus-per-task: 32
  wsclean:
    backend:
      slurm:
        srun_opts:
          mem: 240GB
          cpus-per-task: 32

@o-smirnov
Copy link
Member

@Athanaseus, heads up: time: 0 gives me a funny error: # srun: Requested partition configuration not available now. Guess ilifu doesn't like time-unlimited jobs.

time: 1-12 (meaning 1 day 12 hours) works though.

@o-smirnov
Copy link
Member

Searching around, it seems that these:

# srun: Job 8414557 step creation temporarily disabled, retrying (Requested nodes are busy)
# srun: Job 8414557 step creation still disabled, retrying (Requested nodes are busy)

Are likely due to a memory limit not being specified. https://stackoverflow.com/questions/35498763/parallel-but-different-slurm-srun-job-step-invocations-not-working

@Athanaseus, have you seen them since adding the mem option?

@o-smirnov
Copy link
Member

I've added a feature to the slurm backend to check that memory settings are specified. See #208.

@Athanaseus
Copy link
Contributor Author

@Athanaseus, have you seen them since adding the mem option?

No, haven't spotted them. I only experience a longer queue wait when I set mem: 240GB for the imager and had to kill it. But decreasing it to 200GB didn't put me in a longer queue.

### running /opt/slurm/bin/srun -J loop-pointings.mgpls-reductions.image-pol-00 --time 1-12 --mem 240GB --cpus-per-task 32 /software/common/singularity/3.11.3/bin/singularity exec --containall 
--pwd /scratch3/users/athanaseus --bind /scratch3/users/athanaseus:/scratch3/users/athanaseus:rw /scratch3/users/athanaseus/images/quay.io-stimela2-wsclean:cc0.0.2.simg wsclean -name 
mgpls-output-001/mgpls-imaging-00/T16R02C04-corr-scan4-chan512-t16-XXYY -data-column DATA -channels-out 4 -deconvolution-channels 4 -pol I -weight briggs -0.3 -size 10000 10000 -multiscale     
-multiscale-scales 0,6,12,18,24,28,32,64,128,256,300,340,454,512 -multiscale-scale-bias 0.5 -scale 1.5asec -niter 100000 -nmiter 4 -threshold 3.5e-05 -auto-mask 12.0 -gain 0.1 -mgain 0.8       
-join-channels -fit-spectral-pol 4 -padding 1.2 -nwlayers-factor 3.0 -parallel-gridding 72 -use-wgridder -log-time /scratch3/users/athanaseus/msdir/T16R02C04-corr-scan4-chan512-t16-XXYY.ms     
# srun: job 8416180 queued and waiting for resources
⠋ 0:00:12 loop-pointings ⠋  0:01:47
⠹ 0:00:12 loop-pointings ⠹  0:02:35
⠏ 0:00:12 loop-pointings ⠏  0:03:09

I've added a feature to the slurm backend to check that memory settings are specified.

Got it.

opts:
  backend:
    singularity:
      auto_update: false
    slurm:
      enable: true
      srun_opts:
        time: 1-12
        mem: 100GB
        cpus-per-task: 4  # make default conservative -- give more to specific cabs below

cabs:
  quartical:
    backend:
      slurm:
        srun_opts:
          mem: 128GB
          cpus-per-task: 32
  wsclean:
    backend:
      slurm:
        srun_opts:
          mem: 200GB
          cpus-per-task: 32

By the way, the logs have a double extension (log-loop-pointings.mgpls-reductions-ms0.txt.txt).

@Athanaseus
Copy link
Contributor Author

Athanaseus commented Feb 2, 2024

CASE3

Here 2/3 nodes succeeded to the end of the reductions.
One failed with the following error:

2024-02-01 22:39:45 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 INFO: ### validated inputs
2024-02-01 22:39:45 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 INFO: cab flagdata:
2024-02-01 22:39:45 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 INFO:   spw =
2024-02-01 22:39:45 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 INFO:   mode = summary
2024-02-01 22:39:45 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 INFO:   ms = /mnt/beegfs/mktplus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms
2024-02-01 22:39:45 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 INFO: singularity image /mnt/beegfs/mktplus/SINGULARITY_IMAGES/quay.io-stimela2-casa:cc0.0.2.simg exists
### running /usr/bin/srun -J loop-pointings.mgpls-reductions.flagsummary-stats-0 --time 1-12 --cpus-per-task 4 /opt/tools/singularity-3.9.9//bin/singularity exec --containall --pwd /mnt/beegfs/mktplus --bind /mnt/beegfs/mktplus:/mnt/beegfs/mktplus:rw /mnt/beegfs/mktplus/SINGULARITY_IMAGES/quay.io-stimela2-casa:cc0.0.2.simg casa --log2term --nologger --nologfile -c
import sys, json
kw = json.loads(sys.argv[-1])

try:
    utype = unicode
except NameError:
    utype = bytes

def stringify(x):
    if isinstance(x, (utype, str)):
        return str(x)
    elif isinstance(x, list):
        return [stringify(y) for y in x]
    else:
        return x

kw = {key: stringify(value) for key, value in kw.items()}

flagdata(**kw)

 {"vis": "/mnt/beegfs/mktplus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms", "spw": "", "mode": "summary"}
# FATAL:   container creation failed: mount /proc/self/fd/3->/opt/tools/singularity-3.9.9/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: kernel reported a bad superblock for squashfs image partition, possible causes are that your kernel doesn't support the compression algorithm or the image is corrupted
# srun: error: r35c1s12: task 0: Exited with exit code 255
2024-02-01 22:39:48 STIMELA.loop-pointings.mgpls-reductions.flagsummary-stats-0 ERROR: step 'loop-pointings.mgpls-reductions.flagsummary-stats-0' has failed, aborting the recipe

The profiling:

2024-02-02 00:48:35 STIMELA.loop-pointings.mgpls-reductions INFO: recipe 'MGPLS-HII-bubbles' executed successfully
─────────────────────────────────────────────────────────────────────────────────────── profiling results ───────────────────────────────────────────────────────────────────────────────────────

                                                                                      averages & total I/O
                ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━┓
                ┃                                                                                          ┃  time hms ┃ CPU % ┃ Mem GB ┃ Load ┃ R GB/s ┃ W GB/s ┃ R GB ┃ W GB ┃
                ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━┩
                │ loop-pointings                                                                           │ 0:00:00.0 │  0.09 │   5.00 │ 0.01 │   0.00 │   0.00 │ 0.00 │ 0.03 │
                │ loop-pointings.ms=/mnt/beegfs/mktplus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms 1/3 │ 0:00:03.6 │  2.95 │   5.00 │ 0.29 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.mgpls-reductions                                                          │ 0:00:03.6 │  0.08 │   5.00 │ 0.01 │   0.00 │   0.00 │ 0.00 │ 0.03 │
                │ loop-pointings.mgpls-reductions.flagsummary-stats-0                                      │ 0:00:03.4 │  3.86 │   5.00 │ 0.29 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.ms=/mnt/beegfs/mktplus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms 3/3 │ 1:28:09.8 │  3.95 │   5.00 │ 0.14 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.mgpls-reductions.save-flags-1                                             │ 0:00:07.4 │  1.47 │   5.00 │ 0.19 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.mgpls-reductions.flagsummary-stats-1                                      │ 0:00:09.0 │  1.46 │   5.00 │ 0.18 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.mgpls-reductions.image-pol-00                                             │ 0:37:31.2 │  0.06 │   5.00 │ 0.02 │   0.00 │   0.00 │ 0.00 │ 0.01 │
                │ loop-pointings.mgpls-reductions.quality-assess-00                                        │ 0:00:25.6 │  0.31 │   5.00 │ 0.02 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.mgpls-reductions.masking-0                                                │ 0:00:08.7 │  0.66 │   5.00 │ 0.01 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.mgpls-reductions.image-pol-0                                              │ 0:48:59.1 │  0.06 │   5.00 │ 0.01 │   0.00 │   0.00 │ 0.00 │ 0.02 │
                │ loop-pointings.mgpls-reductions.quality-assess-0                                         │ 0:00:25.4 │  0.38 │   5.00 │ 0.01 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                │ loop-pointings.ms=/mnt/beegfs/mktplus/msdir/T16R02C04-corr-scan4-chan512-t16-XXYY.ms 2/3 │ 2:08:49.7 │  4.00 │   5.00 │ 0.14 │   0.00 │   0.00 │ 0.00 │ 0.00 │
                └──────────────────────────────────────────────────────────────────────────────────────────┴───────────┴───────┴────────┴──────┴────────┴────────┴──────┴──────┘

                                                                                      peaks
                ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┓
                ┃                                                                                          ┃  time hms ┃ CPU % ┃ Mem GB ┃ Load ┃ R GB/s ┃ W GB/s ┃
                ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━┩
                │ loop-pointings                                                                           │ 0:00:00.0 │ 16.70 │   5.00 │ 0.29 │   0.00 │   0.00 │
                │ loop-pointings.ms=/mnt/beegfs/mktplus/msdir/T16R02C02-corr-scan3-chan512-t16-XXYY.ms 1/3 │ 0:00:03.6 │  4.50 │   5.00 │ 0.29 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions                                                          │ 0:00:03.6 │ 16.70 │   5.00 │ 0.29 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.flagsummary-stats-0                                      │ 0:00:03.4 │  8.40 │   5.00 │ 0.29 │   0.00 │   0.00 │
                │ loop-pointings.ms=/mnt/beegfs/mktplus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms 3/3 │ 1:28:09.8 │  5.40 │   5.00 │ 0.29 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.save-flags-1                                             │ 0:00:07.4 │  5.50 │   5.00 │ 0.21 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.flagsummary-stats-1                                      │ 0:00:09.0 │  5.30 │   5.00 │ 0.23 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.image-pol-00                                             │ 0:37:31.2 │  4.40 │   5.00 │ 0.23 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.quality-assess-00                                        │ 0:00:25.6 │  2.70 │   5.00 │ 0.02 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.masking-0                                                │ 0:00:08.7 │  3.00 │   5.00 │ 0.02 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.image-pol-0                                              │ 0:48:59.1 │  2.80 │   5.00 │ 0.08 │   0.00 │   0.00 │
                │ loop-pointings.mgpls-reductions.quality-assess-0                                         │ 0:00:25.4 │  4.30 │   5.00 │ 0.01 │   0.00 │   0.00 │
                │ loop-pointings.ms=/mnt/beegfs/mktplus/msdir/T16R02C04-corr-scan4-chan512-t16-XXYY.ms 2/3 │ 2:08:49.7 │  5.40 │   5.00 │ 0.29 │   0.00 │   0.00 │
                └──────────────────────────────────────────────────────────────────────────────────────────┴───────────┴───────┴────────┴──────┴────────┴────────┘

2024-02-02 00:48:35 STIMELA.loop-pointings INFO: saved full profiling stats to logs/log-20240201-223932/stimela.stats.full
2024-02-02 00:48:35 STIMELA.loop-pointings INFO: saved summary to logs/log-20240201-223932/stimela.stats.summary.txt
2024-02-02 00:48:35 STIMELA ERROR: run failed after 2:08:51
───────────────────────────────────────────────────────────────────────────────── detailed error report follows ─────────────────────────────────────────────────────────────────────────────────
        ⚠ run failed after 2:08:51
        └── 1/3 loop iterations failed
            └── step 'loop-pointings.mgpls-reductions' has failed, aborting the recipe
                └── step 'loop-pointings.mgpls-reductions.flagsummary-stats-0' has failed, aborting the recipe
                    └── error running cab 'flagdata'
                        └── flagdata returns error code 255 after 0:00:03
2024-02-02 00:48:35 STIMELA.loop-pointings INFO: last log directory was logs/log-20240201-223932

This was running on the Pleiadi cluster. After the new srun options, but before the global mem in #189.

@o-smirnov
Copy link
Member

I only experience a longer queue wait when I set mem: 240GB for the imager and had to kill it.

Yeah that figures. 240GB is pretty much a full ilifu fat node. So it needs to wait for an entire node to be available. Whereas with a lower memory request, it can squeeze you onto a partially used node.

FATAL: container creation failed: mount /proc/self/fd/3->/opt/tools/singularity-3.9.9/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: kernel reported a bad superblock for squashfs image partition, possible causes are that your kernel doesn't support the compression algorithm or the image is corrupted

Did the same step run with another MS on another node? If yes: then it's a problem with their filesystem, image reads fine on some nodes but not on others -- report it to the sysadmin on the chat. If no: maybe the casa image is bad, check/rebuild.

@Athanaseus
Copy link
Contributor Author

Athanaseus commented Feb 2, 2024

Yes, the other MSs on other nodes finished successfully and even have profiling stats, while the other one just crashed.
Ok, I'll report to the sysadmin.

Here is another related one after re-running (different MS):

───────────────────────────────────────────────────────────────────────── loop-pointings.mgpls-reductions.save-flags-1 ──────────────────────────────────────────────────────────────────────────
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 INFO: ### validated inputs
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 INFO: cab flagmanager:
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 INFO:   versionname = mgpls_1_flagging
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 INFO:   mode = save
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 INFO:   ms = /mnt/beegfs/mktplus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 INFO: singularity image /mnt/beegfs/mktplus/SINGULARITY_IMAGES/quay.io-stimela2-casa:cc0.0.2.simg exists
### running /usr/bin/srun -J loop-pointings.mgpls-reductions.save-flags-1 --time 1-12 --mem 80GB --cpus-per-task 4 /opt/tools/singularity-3.9.9//bin/singularity exec --containall --pwd         
/mnt/beegfs/mktplus --bind /mnt/beegfs/mktplus:/mnt/beegfs/mktplus:rw /mnt/beegfs/mktplus/SINGULARITY_IMAGES/quay.io-stimela2-casa:cc0.0.2.simg casa --log2term --nologger --nologfile -c        
import sys, json
kw = json.loads(sys.argv[-1])

try:
    utype = unicode
except NameError:
    utype = bytes

def stringify(x):
    if isinstance(x, (utype, str)):
        return str(x)
    elif isinstance(x, list):
        return 
    else:
        return x

kw = {key: stringify(value) for key, value in kw.items()}

flagmanager(**kw)

 {"vis": "/mnt/beegfs/mktplus/msdir/T16R02C06-corr-scan5-chan512-t16-XXYY.ms", "versionname": "mgpls_1_flagging", "mode": "save"}
# Traceback (most recent call last):
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/start_casa.py", line 10, in <module>
#     from IPython import start_ipython
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/IPython/__init__.py", line 49, in <module>
#     from .terminal.embed import embed
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/IPython/terminal/embed.py", line 15, in <module>
#     from IPython.core.magic import Magics, magics_class, line_magic
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/IPython/core/magic.py", line 22, in <module>

#     from IPython.core.magic import Magics, magics_class, line_magic
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/IPython/core/magic.py", line 22, in <module>
#     from IPython.core import oinspect
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/IPython/core/oinspect.py", line 48, in <module>
#     from pygments.lexers import PythonLexer
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pygments/lexers/__init__.py", line 20, in <module>
#     from pygments.plugin import find_plugin_lexers
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pygments/plugin.py", line 39, in <module>
#     import pkg_resources
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2994, in <module>
#     @_call_aside
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2980, in _call_aside
#     f(*args, **kwargs)
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 3007, in _initialize_master_working_set
#     working_set = WorkingSet._build_master()
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 649, in _build_master
#     ws = cls()
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 642, in __init__
#     self.add_entry(entry)
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 698, in add_entry
#     for dist in find_distributions(entry, True):
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2000, in find_on_path
#     path_item, entry, metadata, precedence=DEVELOP_DIST
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2409, in from_location
#     py_version=py_version, platform=platform, **kw
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2749, in _reload_version
#     md_version = _version_from_file(self._get_metadata(self.PKG_INFO))
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2374, in _version_from_file
#     line = next(iter(version_lines), '')
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2542, in _get_metadata
#     for line in self.get_metadata_lines(name):
# 
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1470, in get_metadata_lines
#     return yield_lines(self.get_metadata(name))
# =========================================
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1466, in get_metadata
# The start-up time of CASA may vary
#     value = self._get(self._fn(self.egg_info, name))
# depending on whether the shared libraries
#   File "/casa-release-5.8.0-109.el7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1576, in _get
#     return stream.read()
# are cached or not.
# IOError: [Errno 5] Input/output error
# =========================================
# srun: error: r35c1s12: task 0: Exited with exit code 1
# 
2024-02-02 10:30:41 STIMELA.loop-pointings.mgpls-reductions.save-flags-1 ERROR: step 'loop-pointings.mgpls-reductions.save-flags-1' has failed, aborting the recipe
───────────────────────────────────────────────────────────────────────────────── detailed error report follows ─────────────────────────────────────────────────────────────────────────────────
        ⚠ step 'loop-pointings.mgpls-reductions.save-flags-1' has failed, aborting the recipe
        └── error running cab 'flagmanager'
            └── flagmanager returns error code 1 after 0:00:01

@o-smirnov
Copy link
Member

Might be the same problem... again an I/O error on the filesystem...

@Athanaseus
Copy link
Contributor Author

The error seems to go away after the new setup. I'll keep an eye on it.
(This might have been due to misallocation of resources. i.e. Running on compute node instead of head node)

@caracal-pipeline caracal-pipeline locked and limited conversation to collaborators Feb 2, 2024
@o-smirnov o-smirnov converted this issue into discussion #216 Feb 2, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants