Skip to content

Slurm: -o argument in addprocs_slurm leads to an error #65

Open
@stasis0

Description

@stasis0

Hello everyone,

To add workers and schedule jobs on the cluster, I'm using the addprocs_slurm function from ClasterManagers

slurm_cpus = 4
@async addprocs(SlurmManager(slurm_cpus), partition="all", t="00:10:0")

It works as intended

Task (runnable) @0x00002b8be08c5cd0connecting to worker 1 out of 4

srun: job 13332841 queued and waiting for resources

julia> srun: job 13332841 has been allocated resources
connecting to worker 2 out of 4
connecting to worker 3 out of 4
connecting to worker 4 out of 4

However, if I have a lot of workers, the corresponding number of output files appears in the working directory. I decided to add the -o argument and log everything into one file

slurm_cpus = 4
@async addprocs(SlurmManager(slurm_cpus), partition="all", t="00:10:0", o="log.out")

It indeed creates this log file

julia_worker:9007#131.169.193.109
julia_worker:9006#131.169.193.109
julia_worker:9008#131.169.193.109
julia_worker:9009#131.169.193.109
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

but does not give any workers

Task (runnable) @0x00002b8be01f7260connecting to worker 1 out of 4

srun: job 13332876 queued and waiting for resources

julia> srun: job 13332876 has been allocated resources
srun: error: max-wn009: tasks 0-3: Exited with exit code 1

I decided to have a look at the source code. If I understand correctly, it specifies values for -o and -D independently of my choice. Maybe, it causes trouble

jobname = "julia-$(getpid())"
job_output_name = "$(jobname)-$(trunc(Int, Base.time() * 10))"
make_job_output_path(task_num) = joinpath(job_file_loc, "$(job_output_name)-$(task_num).out")
job_output_template = make_job_output_path("%4t")
srun_cmd = `srun -J $jobname -n $np -o "$(job_output_template)" -D $exehome $(srunargs) $exename $exeflags $(worker_arg())`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions