Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launcher runs too many jobs when using multiple nodes #48

Open
schristley opened this issue Feb 21, 2018 · 3 comments
Open

launcher runs too many jobs when using multiple nodes #48

schristley opened this issue Feb 21, 2018 · 3 comments

Comments

@schristley
Copy link

I filed a TACC ticket (Ticket #43327) as I don't know if this is a launcher bug, or just an issue with TACC's current version of launcher.

I'm using launcher module to run multiple processes across multiples nodes. When the number of commands is less than the total processes available, launcher runs the "last" command multiple times.

A test job that reproduces the bug is available here:

/scratch/01114/vdj/vdj/launcher-test

I create a simple joblist with 12 echo commands. LAUNCHER_PPN=4 and the job requests 4 nodes, which means that a total of 16 concurrent processes could be run, though only 12 are needed. Here you can see the summary printed by launcher.

------------- SUMMARY ---------------
Number of hosts: 4
Working directory: /scratch/01114/vdj/vdj/launcher-test
Processes per host: 4
Total processes: 16
Total jobs: 12
Scheduling method: interleaved


The last command in joblist is "echo 12", and this command is actually run 5 times. If you look at job.out, even though there are 12 total jobs, 16 jobs are actually run, with the last one being run multiple times.

@schristley
Copy link
Author

schristley commented Feb 21, 2018

As you probably don't have access to TACC, here are the test files, this is the job.sh

#!/bin/bash
#SBATCH -J repcalc_bcr4_test
#SBATCH -o job.out
#SBATCH -e job.err
#SBATCH -t 01:00:00
#SBATCH -p skx-normal
#SBATCH -N 4 -n 48
#SBATCH -A RepServer

module purge
module load TACC
module load launcher
module load python

rm -f joblist
touch joblist
echo "echo 1" >> joblist
echo "echo 2" >> joblist
echo "echo 3" >> joblist
echo "echo 4" >> joblist
echo "echo 5" >> joblist
echo "echo 6" >> joblist
echo "echo 7" >> joblist
echo "echo 8" >> joblist
echo "echo 9" >> joblist
echo "echo 10" >> joblist
echo "echo 11" >> joblist
echo "echo 12" >> joblist

# Launcher to use multicores on node
export LAUNCHER_WORKDIR=$PWD
export LAUNCHER_PPN=4
export LAUNCHER_JOB_FILE=joblist
export LAUNCHER_SCHED=interleaved

$LAUNCHER_DIR/paramrun

@schristley
Copy link
Author

Here is the output from running the job:

Launcher: Setup complete.

------------- SUMMARY ---------------
   Number of hosts:    4
   Working directory:  /scratch/01114/vdj/vdj/launcher-test
   Processes per host: 4
   Total processes:    16
   Total jobs:         12
   Scheduling method:  interleaved

-------------------------------------
Launcher: Starting parallel tasks...
Launcher: Task 0 running job 1 on c479-111.stampede2.tacc.utexas.edu (echo 1)
Launcher: Task 3 running job 4 on c479-111.stampede2.tacc.utexas.edu (echo 4)
1
4
Launcher: Task 1 running job 2 on c479-111.stampede2.tacc.utexas.edu (echo 2)
2
Launcher: Task 2 running job 3 on c479-111.stampede2.tacc.utexas.edu (echo 3)
3
Launcher: Job 2 completed in 0 seconds.
Launcher: Job 1 completed in 0 seconds.
Launcher: Job 3 completed in 0 seconds.
Launcher: Job 4 completed in 0 seconds.
Launcher: Task 1 done. Exiting.
Launcher: Task 0 done. Exiting.
Launcher: Task 3 done. Exiting.
Launcher: Task 2 done. Exiting.
Launcher: Task 5 running job 6 on c479-112.stampede2.tacc.utexas.edu (echo 6)
Launcher: Task 6 running job 7 on c479-112.stampede2.tacc.utexas.edu (echo 7)
Launcher: Task 7 running job 8 on c479-112.stampede2.tacc.utexas.edu (echo 8)
Launcher: Task 4 running job 5 on c479-112.stampede2.tacc.utexas.edu (echo 5)
6
7
8
5
Launcher: Task 10 running job 11 on c490-084.stampede2.tacc.utexas.edu (echo 11)
11
Launcher: Task 8 running job 9 on c490-084.stampede2.tacc.utexas.edu (echo 9)
9
Launcher: Task 13 running job 14 on c490-091.stampede2.tacc.utexas.edu (echo 12)
Launcher: Task 15 running job 16 on c490-091.stampede2.tacc.utexas.edu (echo 12)
12
12
Launcher: Task 14 running job 15 on c490-091.stampede2.tacc.utexas.edu (echo 12)
12
Launcher: Task 11 running job 12 on c490-084.stampede2.tacc.utexas.edu (echo 12)
12
Launcher: Task 9 running job 10 on c490-084.stampede2.tacc.utexas.edu (echo 10)
10
Launcher: Task 12 running job 13 on c490-091.stampede2.tacc.utexas.edu (echo 12)
12
Launcher: Job 5 completed in 0 seconds.
Launcher: Job 7 completed in 0 seconds.
Launcher: Job 8 completed in 0 seconds.
Launcher: Job 11 completed in 0 seconds.
Launcher: Job 6 completed in 0 seconds.
Launcher: Job 9 completed in 0 seconds.
Launcher: Task 7 done. Exiting.
Launcher: Job 14 completed in 0 seconds.
Launcher: Task 6 done. Exiting.
Launcher: Task 4 done. Exiting.
Launcher: Job 12 completed in 0 seconds.
Launcher: Job 16 completed in 0 seconds.
Launcher: Job 10 completed in 0 seconds.
Launcher: Task 10 done. Exiting.
Launcher: Task 5 done. Exiting.
Launcher: Job 15 completed in 0 seconds.
Launcher: Task 8 done. Exiting.
Launcher: Job 13 completed in 0 seconds.
Launcher: Task 13 done. Exiting.
Launcher: Task 15 done. Exiting.
Launcher: Task 11 done. Exiting.
Launcher: Task 9 done. Exiting.
Launcher: Task 14 done. Exiting.
Launcher: Task 12 done. Exiting.
Launcher: Done. Job exited without errors

@schristley
Copy link
Author

I guess this is a duplicate of #16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant