Enhance `LOCIDEX_MERGE` #42

sgsutcliffe · 2025-02-13T19:14:11Z

STRY0016853: Enhance `LOCIDEX_MERGE`

We want to switch LOCIDEX_MERGE steps to use multiple processes in gasnomenclature, so that the process is more efficient and can handle larger volumes of data.

Criteria

The enhancement should batch up and pass a subset of samples to LOCIDEX_MERGE, allowing multiple instances to run in parallel.
LOCIDEX_MERGE should output CSV files of subsets of samples.
A post-processing step should be added in gasnomenclature to merge the sample subset CSV files into a single profiles file.

Proposed Solution

Before running LOCIDEX_MERGE, the list of MLST JSON files, that was orginally passed to LOCIDEX_MERGE, is subdivided into batches (size specified by the parameter --batch_size):

    // Divide up inputs into groups for LOCIDEX
    grouped_ref_files = reference_values.flatten() //
        .buffer( size: params.batch_size, remainder: true )
    grouped_query_files = query_values.flatten() //
        .buffer( size: params.batch_size, remainder: true )

Then downstream of LOCIDEX_MERGE we run csvtk concat on the output (if more than one output file is generated). This performed by the module LOCIDEX_CONCAT

PR checklist

Add tests (and fix broken tests)
Make sure your code lints (nf-core lint).
Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
Usage Documentation in docs/usage.md is updated.
CHANGELOG.md is updated.
README.md is updated.

github-actions · 2025-02-13T19:15:30Z

`nf-core pipelines lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit cc66f83

+| ✅ 145 tests passed       |+
#| ❔  23 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

files_exist - File not found: conf/igenomes_ignored.config
nextflow_config - nf-validation has been detected in the pipeline. Please migrate to nf-schema: https://nextflow-io.github.io/nf-schema/latest/migration_guide/
nextflow_config - Config manifest.version should end in dev: 0.3.1
schema_lint - Schema $id should be https://raw.githubusercontent.com/phac-nml/gasnomenclature/master/nextflow_schema.json
Found https://raw.githubusercontent.com/phac-nml/gasnomenclature/main/nextflow_schema.json

❔ Tests ignored:

files_exist - File is ignored: assets/nf-core-gasnomenclature_logo_light.png
files_exist - File is ignored: docs/images/nf-core-gasnomenclature_logo_dark.png
files_exist - File is ignored: docs/images/nf-core-gasnomenclature_logo_light.png
files_exist - File is ignored: .github/workflows/awstest.yml
files_exist - File is ignored: .github/workflows/awsfulltest.yml
nextflow_config - Config variable ignored: manifest.name
nextflow_config - Config variable ignored: manifest.homePage
nextflow_config - Config variable ignored: params.max_cpus
files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
files_unchanged - File ignored due to lint config: assets/email_template.html
files_unchanged - File ignored due to lint config: assets/email_template.txt
files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
files_unchanged - File does not exist: assets/nf-core-gasnomenclature_logo_light.png
files_unchanged - File does not exist: docs/images/nf-core-gasnomenclature_logo_light.png
files_unchanged - File does not exist: docs/images/nf-core-gasnomenclature_logo_dark.png
files_unchanged - File ignored due to lint config: docs/README.md
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/gasnomenclature/gasnomenclature/.github/workflows/awstest.yml
actions_awsfulltest - actions_awsfulltest
pipeline_name_conventions - pipeline_name_conventions

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: modules.json
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: docs/images/nf-core-gasnomenclature_logo.png
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/WorkflowGasnomenclature.groovy
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: Singularity
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Found nf-validation plugin
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.gm_thresholds= 10,5,0
nextflow_config - Config default value correct: params.gm_method= average
nextflow_config - Config default value correct: params.gm_delimiter= .
nextflow_config - Config default value correct: params.pd_distm= hamming
nextflow_config - Config default value correct: params.pd_missing_threshold= 1.0
nextflow_config - Config default value correct: params.pd_sample_quality_threshold= 1.0
nextflow_config - Config default value correct: params.pd_file_type= text
nextflow_config - Config default value correct: params.batch_size= 100
nextflow_config - Config default value correct: params.max_cpus= 4
nextflow_config - Config default value correct: params.max_memory= 2.GB
nextflow_config - Config default value correct: params.max_time= 1.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.validate_params= true
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
readme - README Zenodo placeholder was replaced with DOI.
pipeline_todos - No TODO strings found
plugin_includes - No wrong validation plugin imports have been found
template_strings - Did not find any Jinja template strings (0 files)
schema_lint - Schema lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: branch.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - assets/multiqc_config.yml found and not ignored.
multiqc_config - assets/multiqc_config.yml contains report_section_order
multiqc_config - assets/multiqc_config.yml contains export_plots
multiqc_config - assets/multiqc_config.yml contains report_comment
multiqc_config - assets/multiqc_config.yml follows the ordering scheme of the minimally required plugins.
multiqc_config - assets/multiqc_config.yml contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
base_config - conf/base.config found and not ignored.
base_config - CUSTOM_DUMPSOFTWAREVERSIONS found in conf/base.config and Nextflow scripts.
modules_config - conf/modules.config found and not ignored.
modules_config - INPUT_ASSURE found in conf/modules.config and Nextflow scripts.
modules_config - LOCIDEX_MERGE_REF found in conf/modules.config and Nextflow scripts.
modules_config - LOCIDEX_MERGE_QUERY found in conf/modules.config and Nextflow scripts.
modules_config - PROFILE_DISTS found in conf/modules.config and Nextflow scripts.
modules_config - GAS_CALL found in conf/modules.config and Nextflow scripts.
modules_config - CUSTOM_DUMPSOFTWAREVERSIONS found in conf/modules.config and Nextflow scripts.
nfcore_yml - Repository type in .nf-core.yml is valid: pipeline
nfcore_yml - nf-core version in .nf-core.yml is set to the latest version: 3.0.1

Run details

nf-core/tools version 3.0.1
Run at 2025-02-20 18:46:20

sgsutcliffe · 2025-02-13T19:20:11Z

Notice: First implementation will break some of the tests. I will fix them shortly, basically the outputs look different now that LOCIDEX_MERGE is creating multiple different processes, each with it's own output file.

apetkau

This looks amazing @sgsutcliffe. Thanks so much 😄 . I've added a few initial commments.

apetkau · 2025-02-13T21:28:41Z

modules/local/locidex/concat/main.nf

@@ -0,0 +1,36 @@
+process LOCIDEX_CONCAT {
+    tag 'Concat LOCIDEX files'
+    label 'process_medium'


Could you switch to process_single? Unless you need the increased resources, not sure how much csvtk concat uses for large files.

So csvtk has a --num-cpus with a default being 4. This led me to believe it can be speed up with multiple processers. I should add --num-cpus $task.cpus regardless. 8360c0d

apetkau · 2025-02-13T21:31:10Z

nextflow.config

@@ -59,6 +59,9 @@ params {
    gm_method = "average"
    gm_delimiter = "."

+    // Locidex
+    batch_size = 2


I would default it to something much larger (maybe 1000?). Though we can adjust this based on results of our benchmarking.

For test cases we can have two tests with smaller batch sizes (to test the mv condition if one batch and csvtk if multiple).

sgsutcliffe · 2025-02-17T19:22:05Z

Some by bifurcating the input into subchannels -- the output became inconsistent which led to flakey tests. This change 0d84be3 should make output consistent and reduce the flakey results of tests.

sgsutcliffe · 2025-02-18T15:13:43Z

Running a benchmark we have used on previous releases of gasnomenclature to confirm it is a true enhancement.

sgsutcliffe · 2025-02-20T18:58:11Z

The branch needed to be update -- didn't have the patch release on the branch. Was causing benchmark to fail

Initial implementation of parallel locidex

2ad9646

apetkau reviewed Feb 13, 2025

View reviewed changes

sgsutcliffe added 2 commits February 17, 2025 11:16

added multiple core argument to csvtk command

8360c0d

Makes output order consistent

0d84be3

sgsutcliffe added 7 commits February 17, 2025 14:24

prettier formatted

d195734

Modified tests to point to proper files in tests work directory

494a509

Proccess test for LOCIDEX_CONCAT

801f0b4

remove symlink files

37d376b

Remove symlink files from test

83ae1e5

Adjusted batch_size default to more realistic 100 default

b24e291

Fixed batch_size default

2d4f21b

sgsutcliffe added 2 commits February 19, 2025 09:22

Add retry strategy to reduce fail issue on large sample size

1ccb75c

Merge branch 'dev' into parallel-locidex-input

cc66f83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `LOCIDEX_MERGE` #42

Enhance `LOCIDEX_MERGE` #42

sgsutcliffe commented Feb 13, 2025 •

edited

Loading

github-actions bot commented Feb 13, 2025 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

sgsutcliffe commented Feb 13, 2025

apetkau left a comment

apetkau Feb 13, 2025

sgsutcliffe Feb 17, 2025 •

edited

Loading

apetkau Feb 13, 2025

sgsutcliffe commented Feb 17, 2025

sgsutcliffe commented Feb 18, 2025

sgsutcliffe commented Feb 20, 2025

Enhance LOCIDEX_MERGE #42

Are you sure you want to change the base?

Enhance LOCIDEX_MERGE #42

Conversation

sgsutcliffe commented Feb 13, 2025 • edited Loading

STRY0016853: Enhance LOCIDEX_MERGE

Criteria

Proposed Solution

PR checklist

github-actions bot commented Feb 13, 2025 • edited Loading

nf-core pipelines lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

sgsutcliffe commented Feb 13, 2025

apetkau left a comment

Choose a reason for hiding this comment

apetkau Feb 13, 2025

Choose a reason for hiding this comment

sgsutcliffe Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

apetkau Feb 13, 2025

Choose a reason for hiding this comment

sgsutcliffe commented Feb 17, 2025

sgsutcliffe commented Feb 18, 2025

sgsutcliffe commented Feb 20, 2025

Enhance `LOCIDEX_MERGE` #42

Enhance `LOCIDEX_MERGE` #42

sgsutcliffe commented Feb 13, 2025 •

edited

Loading

STRY0016853: Enhance `LOCIDEX_MERGE`

github-actions bot commented Feb 13, 2025 •

edited

Loading

`nf-core pipelines lint` overall result: Passed ✅ ⚠️

sgsutcliffe Feb 17, 2025 •

edited

Loading