Similar to the case study on whole genome sequencing data, in this study we describe applying DeepVariant to a real exome sample using a single machine.
NOTE: This case study demonstrates an example of how to run DeepVariant end-to-end on one machine. We report the runtime with specific machine type for the sake of consistency in reporting run time. This is NOT the fastest or cheapest configuration. For more scalable execution of DeepVariant see the External Solutions section.
DeepVariant pipeline consists of 3 steps: make_examples
, call_variants
, and
postprocess_variants
. You can now run DeepVariant with one command using the
run_deepvariant
script.
Here is an example command:
sudo docker run \
-v "${DATA_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref="/input/${REF}" \
--reads="/input/${BAM}" \
--regions="/input/${CAPTURE_BED}" \
--output_vcf=/output/HG002.output.vcf.gz \
--output_gvcf=/output/HG002.output.g.vcf.gz \
--num_shards=${N_SHARDS}
By specifying --model_type=WES
, you'll be using a model that is best suited
for Illumina Whole Exome Sequencing data.
The script run_wes_case_study_docker.sh shows a full example of which data to download, and run DeepVariant with the data.
Before you run the script, you can read through all sections to understand the details. Here is a quick way to get the script and run it:
curl https://raw.githubusercontent.com/google/deepvariant/r0.8/scripts/run_wes_case_study_docker.sh | bash
See this page for the commands used to obtain different machine types on Google Cloud.
Step | Hardware | Wall time |
---|---|---|
make_examples |
64 CPUs | ~ 8m |
call_variants |
64 CPUs | ~ 2m |
postprocess_variants (with gVCF) |
1 CPU | ~ 1m |
In this example, call_variants
does not take much time on 64 CPUs. Running
with GPU might be unnecessary. You can read
case study on whole genome sequencing data about the use of GPU. If you want
to use GPU on the exome data in this case study, you can see
run_wes_case_study_docker_gpu.sh shows a full example.
151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam
Same as described in the case study for whole genome data
HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_*
are from NIST, as part of the
Genomes in a Bottle project. They are
downloaded from
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/
According to the paper "Extensive sequencing of seven human genomes to
characterize benchmark reference
materials", the HG002 exome was
generated with Agilent SureSelect. In this case study we'll use the SureSelect
v5 BED (agilent_sureselect_human_all_exon_v5_b37_targets.bed
) and intersect it
with the GIAB confident regions for evaluation.
We used the hap.py
(https://github.com/Illumina/hap.py)
program from Illumina to evaluate the resulting vcf file. This serves as a check
to ensure the three DeepVariant commands ran correctly and produced high-quality
results.
We evaluate against the capture region:
Type | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|
INDEL | 117 | 55 | 0.955000 | 0.978507 | 0.966611 |
SNP | 37 | 23 | 0.998903 | 0.999318 | 0.999110 |