Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From serialisation to parallelisation #6

Open
TCLamnidis opened this issue Apr 19, 2022 · 2 comments
Open

From serialisation to parallelisation #6

TCLamnidis opened this issue Apr 19, 2022 · 2 comments

Comments

@TCLamnidis
Copy link
Collaborator

TCLamnidis commented Apr 19, 2022

In the current setup, run_Eager.sh uses nextflow to parallelise jobs within each batch, but batches are handled in series, with one run starting only after the previous one finishes/fails. To improve processing speeds I want to parallelise batches, so that up to 3 batches can run at once.

Multiple instances of nextflow cannot be launched from the same directory, as the .nextflow.log files will collide. A potential solution would be to bind each sequencing batch to a specific instance of run_Eager.sh, which would run every 3rd batch.

2020-05-03-batch1.eager_input.txt  ## instance 1
2020-05-03-batch2.eager_input.txt  ## instance 2
2020-06-26-batch3.eager_input.txt  ## instance 3
2020-06-26-batch4.eager_input.txt  ## instance 1
2020-06-26-batch5.eager_input.txt  ## instance 2
2020-06-26-batch6.eager_input.txt  ## instance 3

Since batches contain the initial creation date and are sorted alphabetically, their run_Eager.sh instance will be stable, allowing resuming without issue (🤞) .

@stschiff
Copy link

But wouldn't they still be in the same directory? How does this solve the issue that you can't fire them off from the same dir?

@TCLamnidis
Copy link
Collaborator Author

Problem 1: each run needs its own directory
Solution: Initialise run from different directories. This is what I did the last weeks to speed up processing. But that raises problem 2.

Problem 2: When resuming processing of a run, it should be done from the same directory as the original run, else resuming restarts from scratch (i.e. past progress is ignored).
Solution: If the directory that nextflow is launched from is fixed for each batch, then resuming will also work as intended.

The extreme case for this would be to start all runs at the same time, launching nextflow for each batch in its own eager output directory. That would be fastest but would also block the cluster for everyone. So I'm leaning to a having a set number of "active" runs at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants