Skip to content

Add command-line options for custom reference genomes #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

trianglegrrl
Copy link

@trianglegrrl trianglegrrl commented Mar 28, 2025

Add command-line options for custom reference genomes

Changes

This PR adds the ability to specify custom reference genome files directly via command-line options, making Yleaf more flexible and easier to use in a pipeline, with pre-specified reference genomes.

Added features:

  • New command-line options:
    • -fg/--full_genome_reference: Allows users to specify a custom full genome reference file
    • -yr/--y_chromosome_reference: Allows users to specify a custom Y chromosome reference file
  • New utility script extract_y_chromosome.py: For extracting just the Y chromosome from a full genome reference
  • Updated README with documentation and examples for the new features

Motivation

Previously, users were required to modify the config.txt file to use custom reference genomes. This approach had several limitations:

  • It wasn't easily scriptable (required file edits)
  • It didn't allow for different references in the same pipeline
  • It made it harder to use Yleaf in automated workflows

With these changes, users can specify reference files directly on the command line, making it easier to:

  • Use custom reference files for specific analyses
  • Integrate Yleaf into automated pipelines
  • Run Yleaf with different references without modifying configuration files

Testing

The changes have been tested with:

  • Standard pre-existing reference download flow (backward compatibility)
  • Custom full genome and Y chromosome references
  • Various file formats and real-world datasets

Example of successful execution with no custom references and a VCF output by nf-core/eager:

python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh

INFO 10:32:02.076430 (0.001 s) - Running Yleaf with command: /home/a/Yleaf/yleaf/Yleaf.py -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -rg hg38 -o test_output -t 16 -dh
INFO 10:32:02.076593 (0.001 s) - No reference genome version was found. Downloading the hg38 reference genome. This should be a one time thing.
INFO 10:32:02.076702 (0.001 s) - Starting with preparing hg38...
INFO 10:47:19.166804 (917.091 s) - Finished downloading the reference genome.
/home/a/Yleaf/yleaf/Yleaf.py:188: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_fmf = pd.concat([df_belowzero, df_readsthreshold, df_basemajority, df_discordantgenotype], axis=0, sort=True)
INFO 10:47:24.294138 (922.218 s) - Finished extracting genotypes for ancient0003.chrY.filtered.vcf.gz
INFO 10:47:24.304648 (922.229 s) - Starting haplogroup prediction...
INFO 10:47:24.403965 (922.328 s) - Starting with drawing haplogroups
INFO 10:47:24.485426 (922.410 s) - Finished drawing haplogroups
INFO 10:47:24.485505 (922.410 s) - Done!
╰─ cat test_output/hg_prediction.hg 
Sample_name     Hg      Hg_marker       Total_reads     Valid_markers   QC-score        QC-1    QC-2    QC-3
ancient0003.chrY.filtered       Q-Z770  CTS11007        VCF     128     1.0     1.0     1.0     1.0

Example of successful execution with custom references and a VCF output by nf-core/eager:

python3 -m yleaf.Yleaf -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16

INFO 10:16:56.748596 (0.001 s) - Running Yleaf with command: /home/a/Yleaf/yleaf/Yleaf.py -vcf /20tb/2025-01-project-drive/2024-12-19-yhaplo/ancient0003.chrY.vcf.gz -fg /references/reference_genomes/hg38.analysisSet.fa -yr /references/reference_genomes/hg38.chrY.analysisSet.fa -rg hg38 -o test_output -t 16
INFO 10:16:56.748752 (0.001 s) - Using custom full genome reference: /references/reference_genomes/hg38.analysisSet.fa
INFO 10:16:56.748841 (0.001 s) - Using custom Y chromosome reference: /references/reference_genomes/hg38.chrY.analysisSet.fa
/home/a/Yleaf/yleaf/Yleaf.py:188: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_fmf = pd.concat([df_belowzero, df_readsthreshold, df_basemajority, df_discordantgenotype], axis=0, sort=True)
INFO 10:17:01.835815 (5.088 s) - Finished extracting genotypes for ancient0003.chrY.filtered.vcf.gz
INFO 10:17:01.844207 (5.096 s) - Starting haplogroup prediction...
INFO 10:17:01.931244 (5.183 s) - Done!
> cat test_output/hg_prediction.hg
Sample_name     Hg      Hg_marker       Total_reads     Valid_markers   QC-score        QC-1    QC-2    QC-3
ancient0003.chrY.filtered       Q-Z770  CTS11007        VCF     128     1.0     1.0     1.0     1.0

Add command-line options for custom reference genomes

- Add two new command-line options for Yleaf.py:
  - `-fg/--full_genome_reference`: Specify a custom full genome
reference file
  - `-yr/--y_chromosome_reference`: Specify a custom Y chromosome
reference file
- Create new utility script `extract_y_chromosome.py` to extract Y
chromosome from a full genome
- Update documentation with examples of using the new options
- Maintain backward compatibility with existing default references

These changes allow users to run Yleaf with their own reference genomes
without
modifying the config.txt file, making it more flexible and
user-friendly.
Increment minor version number to reflect the addition of new features:
- Command-line options for custom reference genomes
- Y chromosome extraction utility

Following semantic versioning principles for feature additions.
@trianglegrrl
Copy link
Author

@dionzand @dmontielg @bramvanwersch I have an improvement I'd like to offer to make Yleaf easier to integrate into pipelines: just the ability to specify the reference genomes on the command line.

I hope the pull request is clear, but if not, please ask questions!

@dionzand
Copy link
Contributor

Thanks @trianglegrrl for your PR! Please give me some time to review. Hopefully next week

@trianglegrrl
Copy link
Author

Thank you @dionzand ! I'm building a Nextflow module for Yleaf and we're going to integrate it into Eager, so I'll be excited to get your comments! Happy to make any changes you request.

@trianglegrrl
Copy link
Author

trianglegrrl commented Apr 4, 2025

Hi again @dionzand ! I'm wondering if you've had the opportunity to review this?

We're hoping to integrate this into our pipelines and make it available to the larger nextflow/nf-core community as an nf-core module. The draft pull request for the module is here: nf-core/modules#8210)

For testing, I've pinned my nf-core module to my local 3.3.0 version of Yleaf (from this branch). It's working great.

To release the module, though, we'll need the Yleaf 3.3.0 version on bioconda. This PR does bump the version to 3.3.0, but you'll have to tag the release yourself, I believe!

I can take care of the bioconda/biocontainer stuff for you and just link it here. I am also happy to commit to keeping bioconda up to date for you. At this point, now that Yleaf 3.2.1 is in bioconda, it's easy to create update PRs in bioconda-recipes for version upgrades.

@trianglegrrl
Copy link
Author

@dionzand Hate to bug you again with this, but any chance you'll have some time to review it? I'm sure you have a thousand other things to do, too!

y_chrom_found = False

for line in fi:
if line.startswith(">chrY") or line.startswith(">Y"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the new T2T reference starts with >CP086569.2. Please check if there are other nomenclatures for Y chromosomal record IDs.

Copy link
Author

@trianglegrrl trianglegrrl Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dionzand ... I checked a couple of T2T references (hg002v1.1.mat_Y_EBV_MT.fasta.gz, and chm13v2.0_maskedY_rCRS.fa.gz) and they use chrY for the Y. Is there a different version I should be checking? I'm not super familiar with T2T so I might be missing something, but this is how I got what I got:

╰─ zgrep "^>" hg002v1.1.mat_Y_EBV_MT.fasta.gz | cut -d ' ' -f 1 | sed 's/>//'
[... other chromosome labels]
chrY_PATERNAL


╰─ zgrep "^>" chm13v2.0_maskedY_rCRS.fa.gz | cut -d ' ' -f 1 | sed 's/>//'
[... other chromosome labels]
chrY

@dionzand
Copy link
Contributor

I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow.

@trianglegrrl
Copy link
Author

I'm sorry, there were indeed some other things that required priority. I requested two small changes in your PR. Otherwise I am happy to merge the changes, but will first have to check with @ArwinRalf tomorrow.

Of course, no worries at all, and thank you! I will get those changes in and tag you again for review.

@trianglegrrl trianglegrrl requested a review from dionzand April 24, 2025 12:58
@TCLamnidis
Copy link

Hi @dionzand, @ArwinRalf

@trianglegrrl is working hard on adding MT and Y haplotyping functionality for the upcoming 3.0.0 release of nf-core/eager (Github).
The addition of Yleaf on bioconda (#24 thanks!) and the added flexibility from this PR would help us greatly to include Yleaf in the pipeline.

Users of this functionality in nf-core/eager would then need cite your original paper.

Did you have a chance to discuss the changes? We're happy to make changes if necessary. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants