-
Notifications
You must be signed in to change notification settings - Fork 12
Add command-line options for custom reference genomes #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add command-line options for custom reference genomes #32
Conversation
Add command-line options for custom reference genomes - Add two new command-line options for Yleaf.py: - `-fg/--full_genome_reference`: Specify a custom full genome reference file - `-yr/--y_chromosome_reference`: Specify a custom Y chromosome reference file - Create new utility script `extract_y_chromosome.py` to extract Y chromosome from a full genome - Update documentation with examples of using the new options - Maintain backward compatibility with existing default references These changes allow users to run Yleaf with their own reference genomes without modifying the config.txt file, making it more flexible and user-friendly.
Increment minor version number to reflect the addition of new features: - Command-line options for custom reference genomes - Y chromosome extraction utility Following semantic versioning principles for feature additions.
@dionzand @dmontielg @bramvanwersch I have an improvement I'd like to offer to make Yleaf easier to integrate into pipelines: just the ability to specify the reference genomes on the command line. I hope the pull request is clear, but if not, please ask questions! |
Thanks @trianglegrrl for your PR! Please give me some time to review. Hopefully next week |
Thank you @dionzand ! I'm building a Nextflow module for Yleaf and we're going to integrate it into Eager, so I'll be excited to get your comments! Happy to make any changes you request. |
Hi again @dionzand ! I'm wondering if you've had the opportunity to review this? We're hoping to integrate this into our pipelines and make it available to the larger nextflow/nf-core community as an nf-core module. The draft pull request for the module is here: nf-core/modules#8210) For testing, I've pinned my nf-core module to my local 3.3.0 version of Yleaf (from this branch). It's working great. To release the module, though, we'll need the Yleaf 3.3.0 version on bioconda. This PR does bump the version to 3.3.0, but you'll have to tag the release yourself, I believe! I can take care of the bioconda/biocontainer stuff for you and just link it here. I am also happy to commit to keeping bioconda up to date for you. At this point, now that Yleaf 3.2.1 is in bioconda, it's easy to create update PRs in bioconda-recipes for version upgrades. |
@dionzand Hate to bug you again with this, but any chance you'll have some time to review it? I'm sure you have a thousand other things to do, too! |
y_chrom_found = False | ||
|
||
for line in fi: | ||
if line.startswith(">chrY") or line.startswith(">Y"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the new T2T reference starts with >CP086569.2. Please check if there are other nomenclatures for Y chromosomal record IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dionzand ... I checked a couple of T2T references (hg002v1.1.mat_Y_EBV_MT.fasta.gz, and chm13v2.0_maskedY_rCRS.fa.gz) and they use chrY
for the Y. Is there a different version I should be checking? I'm not super familiar with T2T so I might be missing something, but this is how I got what I got:
╰─ zgrep "^>" hg002v1.1.mat_Y_EBV_MT.fasta.gz | cut -d ' ' -f 1 | sed 's/>//'
[... other chromosome labels]
chrY_PATERNAL
╰─ zgrep "^>" chm13v2.0_maskedY_rCRS.fa.gz | cut -d ' ' -f 1 | sed 's/>//'
[... other chromosome labels]
chrY
I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow. |
Of course, no worries at all, and thank you! I will get those changes in and tag you again for review. |
Hi @dionzand, @ArwinRalf @trianglegrrl is working hard on adding MT and Y haplotyping functionality for the upcoming Users of this functionality in nf-core/eager would then need cite your original paper. Did you have a chance to discuss the changes? We're happy to make changes if necessary. :) |
Add command-line options for custom reference genomes
Changes
This PR adds the ability to specify custom reference genome files directly via command-line options, making Yleaf more flexible and easier to use in a pipeline, with pre-specified reference genomes.
Added features:
-fg/--full_genome_reference
: Allows users to specify a custom full genome reference file-yr/--y_chromosome_reference
: Allows users to specify a custom Y chromosome reference fileextract_y_chromosome.py
: For extracting just the Y chromosome from a full genome referenceMotivation
Previously, users were required to modify the config.txt file to use custom reference genomes. This approach had several limitations:
With these changes, users can specify reference files directly on the command line, making it easier to:
Testing
The changes have been tested with:
Example of successful execution with no custom references and a VCF output by nf-core/eager:
python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh
Example of successful execution with custom references and a VCF output by nf-core/eager:
python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16