Skip to content

Files

Latest commit

 

History

History
122 lines (82 loc) · 3.45 KB

README.md

File metadata and controls

122 lines (82 loc) · 3.45 KB

VCboost - Improving the Precision of Variant Calling for Long-Read Sequencing by Filtering

Email: holyterror@163.com


Introduction

VCboost effectively filters out a substantial number of false positive sites, leading to a significant improvement in accuracy and F1 score with minimal loss of true positive sites.


Contents



Pre-trained Models

In a bioconda installation, models are in {CONDA_PREFIX}/bin/models/.

Model name Platform Training samples Basecaller
r941_g422_1245 ONT r9.4.1 HG001,2,4,5 Guppy4.2.2
r941_g422_1235 ONT r9.4.1 HG001,2,3,5 Guppy4.2.2

Installation

Build an anaconda virtual environment

# Clone the Repository
git clone https://github.com/oranges7/VCboost.git
# Navigate to the Project Directory
cd ${repository}
# You can use Conda, pip, or other tools for installation.
conda create --name vcboost --file requirements.txt
# Activate the virtual environment
conda activate vcboost
# Display help information for the vcboost
sh vcboost.sh -h

Usage

General Usage

sh vcboost.sh \
  -o ${OUTPUT_PATH} \
  -b ${BAM_FILE} \ 
  -v ${ORIGINAL_VCF_FILE} \
  -m ${MODEL_PREFIX} \ 
  -r ${REFERENCE}

## VCboost final output file: ${OUTPUT_PATH}/vc_boost.vcf

Options

Required parameters:

  Options:
  -o, -out_path        Output path.
  -b, -bam_file        BAM file path.
  -v, -vcf             VCF file path.
  -m, -model prefix    Model path.
  -r, -ref_path        Reference file path.

Other parameters:

  -t, -threads     Number of threads.The default is 32.
  -c, -contig      Contig to process.The default is chr1-22.
  -p, -phase       Disable phase.
  -h, -help        Display this help message.

Train parameters:

  -train   Train mode.
  -w, -work_path   Working directory path.
  -a, -aim_vcf     Aim VCF file path.
  -b, -bam_file    BAM file path.
  -r, -ref         Reference file path.
  -q, -vcf         Benchmark VCF file path.
  -m, -mode        Mode of operation.You can choose snp or indel. The default is both.
  -j, -object      Object to process.
  -p, -phase       Enable phase.

Training Data

The training and testing datasets were derived from the public Genome in a Bottle (GIAB) and Human Pangenome Reference Consortium (HPRC) from individuals HG001, HG002, HG003, HG004, and HG005. The original nanopore sequencing data generated using R9.4.1 flow cells and basecalled with Guppy v4.2.2 were aligned to the reference GRCh38 to map sequencing reads. Currently, two models have been trained and tested: one based on training with HG001, HG002, HG003, HG005, and their downsampled counterparts; and the other based on training with HG001, HG002, HG004, HG005, and their downsampled counterparts.