Skip to content

User-friendly pipelie for mutational analysis of HIV using ultra-accurate maximum-depth sequencing

Notifications You must be signed in to change notification settings

ljmills/MDS_pipeline_HIV

Repository files navigation

MDS_pipeline_HIV

User-friendly pipeline for mutational analysis of HIV using ultra-accurate maximum-depth sequencing

User-friendly Pipeline for Analysis of Ultra-Accurate Maximum-Depth Sequencing

This repository contains the pipline described in Development of a user-friendly pipeline for mutational analyses of HIV using ultra-accurate maximum-depth sequencing

There are two versions of the alignment and mutation calling pipline, the first is a BASH script designed to be used with a Linux computing cluster with job submission. The second is a Galaxy pipeline that can be uploaded to any Galaxy instance. All tools are available in the Galaxy Toolshed.

The included R scripts take the output from bcftools mpileup from either the Galaxy or BASH scripts and performs the hotspot identification and generates the data for the mutational profile analysis seen in the publication.

Reference Genome FASTA

You will need a FASTA formatted version of the reference genome that you are working with. This could be the sequence from a specific plasmid or a reference genome downloaded from a repository such as ENSEMBL or the UCSC genome browser. In the case of this paper we also hard masked (replaced some bp with Ns) across highly similar regions so we didn't get cross-mapping to regions we were not targeting. We used bedtools maskfasta and a bed file containing the regions we wanted masked i.e. 3' UTR region.

Galaxy Workflow

Galaxy-Workflow-MDS_fgbio_workflow_single_sample.ga

This workflow takes a pair of FASTQ files, R1 and R2 from paired end sequenceing and a reference genome FASTA file and runs the entire pipeline using the same tools as in the BASH scripts.

BASH scripts (least user-friendly)

Software Dependencies

  • Java 8 (aka Java 1.8) or later
  • trimmomatic verion 0.33 or later
  • fgbio
  • seqtk
  • samtools 1.9 or later
  • bwa
  • picard tools
  • bcftools 1.9 or later
  • bedtools 2.29 or later

Script overview

All scripts will need to be edited to be used with your compute enviroment. Edits could include paths to software, paths to specific files needed and edits to how we extract sample names from the FASTQ files. They also include SLURM submission information ( sbatch ) which will change based on if you are running the SLURM scheduler or not.

indexNewGenome.sh

Helper script to create bwa, picard tools and samtools indices from FASTA reference genomes.

by_Sample_run_MDS.sh

This script runs all of the major steps of the MDS analysis pipeline. It also includes a step that downsamples to 1.5 million sequences (seqtk sample) after the trimming step that is included to speed the process but can be excluded or changed as needed. Script generates two files the first contains the commands needed to run the pipeline on each sample included. The second wraps each sample into a sbatch submission line which can be submitted directly to SLURM.

Hotspot Identification and Mutational Spectrum Analysis

HIV_MDS_SequenceAnalysis.R

This script takes the VCF files generated by either the Galaxy workflow or the BASH scripts and generates mutational specturm data. This script requires a sample sheet (excel file) that matches sample names to other sample information such as the background genome (i.e. HIV-1) and sample type (plasmid vs gDNA). Script assumes that all vcf files are in a single folder and end with .vcf. Further details can be found in the script.

About

User-friendly pipelie for mutational analysis of HIV using ultra-accurate maximum-depth sequencing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published