Skip to content

wildcard expansion in vsearch bug fix #8307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion modules/nf-core/vsearch/cluster/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ process VSEARCH_CLUSTER {

if [[ $args3 == "--clusters" ]]
then
gzip -n ${prefix}.${out_ext}*
find . -name \"${prefix}.${out_ext}*[0-9]\" | xargs gzip -n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be consistent throughout, if you've had this issue with --clusters, others might have the same issue with --samout. Also update that one no?

Also is it a given that vsearch always append a single digit to the end of the file?

Might also be could to specify to find that we're looking for files with -type f?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in the -type f good call on that.

The --samout bit doesn't use a wildcard expansion like the --clusters does, so I think it should be ok as is. vsearch outputs many single-entry fastas, one for each centroid, whereas I believe that samtools makes a single multi-entry fasta so no wildcard expansion is needed. Happy to discuss further if I've missed something.

vsearch does append digits to the end when that --clusters flag is set, starting with 0 and counting up where there's one for each cluster centroid:

ASV_post_clustering.clusters.fasta0
ASV_post_clustering.clusters.fasta10000
ASV_post_clustering.clusters.fasta10001
ASV_post_clustering.clusters.fasta10002
ASV_post_clustering.clusters.fasta10003
...

So anchoring the regex with a final trailing digit does match all of those. The gzip then appends .gz and so the files end up looking like

ASV_post_clustering.clusters.fasta0.gz
ASV_post_clustering.clusters.fasta10000.gz
ASV_post_clustering.clusters.fasta10001.gz
ASV_post_clustering.clusters.fasta10002.gz
ASV_post_clustering.clusters.fasta10003.gz
...

which will not be matched again as now they no longer end with a digit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, then it's good to go I believe!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

elif [[ $args3 != "--samout" ]]
then
gzip -n ${prefix}.${out_ext}
Expand Down
Loading