Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance LOCIDEX_MERGE #42

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Open

Enhance LOCIDEX_MERGE #42

wants to merge 12 commits into from

Conversation

sgsutcliffe
Copy link
Contributor

@sgsutcliffe sgsutcliffe commented Feb 13, 2025

STRY0016853: Enhance LOCIDEX_MERGE

We want to switch LOCIDEX_MERGE steps to use multiple processes in gasnomenclature, so that the process is more efficient and can handle larger volumes of data.

Criteria

  • The enhancement should batch up and pass a subset of samples to LOCIDEX_MERGE, allowing multiple instances to run in parallel.
  • LOCIDEX_MERGE should output CSV files of subsets of samples.
  • A post-processing step should be added in gasnomenclature to merge the sample subset CSV files into a single profiles file.

Proposed Solution

Before running LOCIDEX_MERGE, the list of MLST JSON files, that was orginally passed to LOCIDEX_MERGE, is subdivided into batches (size specified by the parameter --batch_size):

    // Divide up inputs into groups for LOCIDEX
    grouped_ref_files = reference_values.flatten() //
        .buffer( size: params.batch_size, remainder: true )
    grouped_query_files = query_values.flatten() //
        .buffer( size: params.batch_size, remainder: true )

Then downstream of LOCIDEX_MERGE we run csvtk concat on the output (if more than one output file is generated). This performed by the module LOCIDEX_CONCAT

PR checklist

  • Add tests (and fix broken tests)
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated.

Copy link

github-actions bot commented Feb 13, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit cc66f83

+| ✅ 145 tests passed       |+
#| ❔  23 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: assets/nf-core-gasnomenclature_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-gasnomenclature_logo_dark.png
  • files_exist - File is ignored: docs/images/nf-core-gasnomenclature_logo_light.png
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • nextflow_config - Config variable ignored: params.max_cpus
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-gasnomenclature_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-gasnomenclature_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-gasnomenclature_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/gasnomenclature/gasnomenclature/.github/workflows/awstest.yml
  • actions_awsfulltest - actions_awsfulltest
  • pipeline_name_conventions - pipeline_name_conventions

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.1
  • Run at 2025-02-20 18:46:20

@sgsutcliffe
Copy link
Contributor Author

Notice: First implementation will break some of the tests. I will fix them shortly, basically the outputs look different now that LOCIDEX_MERGE is creating multiple different processes, each with it's own output file.

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing @sgsutcliffe. Thanks so much 😄 . I've added a few initial commments.

@@ -0,0 +1,36 @@
process LOCIDEX_CONCAT {
tag 'Concat LOCIDEX files'
label 'process_medium'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you switch to process_single? Unless you need the increased resources, not sure how much csvtk concat uses for large files.

Copy link
Contributor Author

@sgsutcliffe sgsutcliffe Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So csvtk has a --num-cpus with a default being 4. This led me to believe it can be speed up with multiple processers. I should add --num-cpus $task.cpus regardless. 8360c0d

nextflow.config Outdated
@@ -59,6 +59,9 @@ params {
gm_method = "average"
gm_delimiter = "."

// Locidex
batch_size = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would default it to something much larger (maybe 1000?). Though we can adjust this based on results of our benchmarking.

For test cases we can have two tests with smaller batch sizes (to test the mv condition if one batch and csvtk if multiple).

@sgsutcliffe
Copy link
Contributor Author

Some by bifurcating the input into subchannels -- the output became inconsistent which led to flakey tests. This change 0d84be3 should make output consistent and reduce the flakey results of tests.

@sgsutcliffe
Copy link
Contributor Author

Running a benchmark we have used on previous releases of gasnomenclature to confirm it is a true enhancement.

@sgsutcliffe
Copy link
Contributor Author

The branch needed to be update -- didn't have the patch release on the branch. Was causing benchmark to fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants