-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance LOCIDEX_MERGE
#42
base: dev
Are you sure you want to change the base?
Conversation
|
Notice: First implementation will break some of the tests. I will fix them shortly, basically the outputs look different now that LOCIDEX_MERGE is creating multiple different processes, each with it's own output file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks amazing @sgsutcliffe. Thanks so much 😄 . I've added a few initial commments.
@@ -0,0 +1,36 @@ | |||
process LOCIDEX_CONCAT { | |||
tag 'Concat LOCIDEX files' | |||
label 'process_medium' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you switch to process_single
? Unless you need the increased resources, not sure how much csvtk concat uses for large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So csvtk
has a --num-cpus
with a default being 4
. This led me to believe it can be speed up with multiple processers. I should add --num-cpus $task.cpus
regardless. 8360c0d
nextflow.config
Outdated
@@ -59,6 +59,9 @@ params { | |||
gm_method = "average" | |||
gm_delimiter = "." | |||
|
|||
// Locidex | |||
batch_size = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would default it to something much larger (maybe 1000?). Though we can adjust this based on results of our benchmarking.
For test cases we can have two tests with smaller batch sizes (to test the mv
condition if one batch and csvtk
if multiple).
Some by bifurcating the input into subchannels -- the output became inconsistent which led to flakey tests. This change 0d84be3 should make output consistent and reduce the flakey results of tests. |
Running a benchmark we have used on previous releases of |
The branch needed to be update -- didn't have the patch release on the branch. Was causing benchmark to fail |
STRY0016853: Enhance
LOCIDEX_MERGE
We want to switch
LOCIDEX_MERGE
steps to use multiple processes ingasnomenclature
, so that the process is more efficient and can handle larger volumes of data.Criteria
LOCIDEX_MERGE
, allowing multiple instances to run in parallel.LOCIDEX_MERGE
should output CSV files of subsets of samples.gasnomenclature
to merge the sample subset CSV files into a single profiles file.Proposed Solution
Before running
LOCIDEX_MERGE
, the list of MLST JSON files, that was orginally passed toLOCIDEX_MERGE
, is subdivided into batches (size specified by the parameter--batch_size
):Then downstream of
LOCIDEX_MERGE
we runcsvtk concat
on the output (if more than one output file is generated). This performed by the moduleLOCIDEX_CONCAT
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.CHANGELOG.md
is updated.README.md
is updated.