This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory (by default) after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data using the following steps:
- Preprocessing
- Kraken2 - Taxonomic read classification
- Bracken - Species abundance estimation from kraken2 output
- Custom Abundance Check - Filter samples with
< X%
Legionella pneumophila reads - Trimmomatic - Trim and crop Illumina reads
- FastQC - Trimmed read QC plots
- Sequence Typing
- el_gato Reads - Sequence type (ST) input sample reads
- el_gato Assembly - Sequence type input sample assemblies when reads fail to generate an ST
- el_gato Report - Create PDF summary el_gato report
- Pysamstats - Calculate positional depth, mapq, and baseq for each ST allele
- Allele Reports - Create per-sample ST allele report pdf
- Assembly
- cgMLST and Clustering
- chewBBACA - cgMLST results
- Final Quality Control
- QUAST Scoring Script - Simple assembly score of QUAST output based on established criteria
- Final QC Checks - Summary of pipeline QC metrics
Additionally Pipeline information which includes report metrics generated during the workflow execution can also be found
Initial processing steps and statistic gathering
Output files
kraken_bracken/
*-kreport.tsv
: Kraken2 taxonomic report*-classified.tsv
: Kraken2 standard output
Kraken2 classifies input sequences based on a taxonomic k-mer database where the input sequences are mapped to the lowest common ancestor of all genomes known to contain the given k-mer.
In the pipeline, kraken2 along with bracken are used to determine if there is any/enough L.pneumophila data to run through the pipeline
Output files
kraken_bracken/
*-abundances.tsv
: Bracken abundance report*-braken-breakdown.tsv
: Bracken taxonomic report that matches kraken2 report
Bracken reestimates species abundance from kraken2 output.
In the pipeline, kraken2 along with bracken are used to determine if there is any/enough L.pneumophila data to run through the pipeline
Simply python program that takes in the bracken abundance report and determines if a sample is above the given threshold required to keep in the pipeline (default 10.0%)
Output files
trimmomatic/
*_paired_R1.fastq.gz
: Paired trimmed read 1 to be used in the following pipeline steps*_paired_R2.fastq.gz
: Paired trimmed read 2 to be used in the following pipeline steps*_unpaired_R1.fastq.gz
: Unpaired trimmed reads 1 to assist in SPAdes assembly*_unpaired_R1.fastq.gz
: Unpaired trimmed reads 2 to assist in SPAdes assembly*.summary.txt
: Trimmomatic output summary
Trimmomatic removes Illumina adapters and trim reads according to quality
Output files
fastqc/
*_fastqc.html
: FastQC per read quality summary report
FastQC gives general quality metrics and plots for the input reads.
In silico sequence typing and allele reporting using el_gato
Output files
el_gato/reads/
*_possible_mlsts.txt
: All possible allele calls and sequences types seen in reads*_reads.json
: Machine-readable summary for building el_gato pdf report*_reads_vs_all_ref_filt_sorted.bam
: Pileup of reads for each ST allele used for building allele report*_run.log
: Program logging info*_ST.tsv
: Called Sequence Type
Sequence-based Typing (SBT) of Legionella pneumophila sequences using reads based on the identification and comparison of 7 loci (flaA, pilE, asd, mip, mompS, proA, neuA/neuAh) against an allele database.
Output files
el_gato/assemblies/
*_assembly.json
: Machine-readable summary for building el_gato pdf report*_run.log
: Program logging info*_ST.tsv
: Called Sequence Type
Sequence-based Typing (SBT) of Legionella pneumophila sequences using output assemblies based on the identification and comparison of 7 loci (flaA, pilE, asd, mip, mompS, proA, neuA/neuAh) against an allele database. The assemblies are only run when there is an inconclusive ST call as this was found to sometimes recover the ST.
Note: if the ST results are inconclusive after both approaches have been tried, users are encouraged to review the possible_mlsts.txt
intermediate output for that sample in the pipeline results folder under el_gato/reads/
Output files
el_gato/
el_gato_report.pdf
: Final el_gato summary report including reads and assembly approaches
Tabular summaries of locus information for all samples run through el_gato
Output files
el_gato/allele_stats/
*.allele_stats.tsv
: Per-sample summary of depth, map quality, and base quality
Pysamstats combined output containing summary of depth, map quality, and base quality for each allele
el_gato/plots/
*_allele_plots.pdf
: Per-sample plots of allele depth, map quality, and base quality
Custom report plotting of the seven ST alleles looking at depth, map quality, and base quality for each sample.
De novo assembly and quality assessment
spades/
*.contigs.fa
: SPAdes assembly contigs.*.scaffolds.fa
: SPAdes scaffold assembly*.spades.log
: SPAdes logging information
SPAdes is an de novo de Bruijn graph-based assembly toolkit containing various assembly pipelines. In this pipeline we are using the --careful
assembly flag to do the assembly and using the contigs
to do subsequent analysis steps
quast/
report.html
:transposed_report.tsv
:
QUAST is used to generate a single report with which to evaluate the quality of the assemblies sequence across all of the samples provided to the pipeline. Input genomes are compared to a Legionella pneumophila reference genome and the transposed report is parsed downstream to report a final quality score.
Core Genome MultiLocus Sequence Typing (cgMLST) using chewBACCA and the Ridom SeqSphere 1521-loci cgMLST schema and how it can be used for follow-up clustering.
chewbbaca/allele_calls/
results_alleles.tsv
: Provides allele calling results including all allele classificationsresults_statistics.tsv
: Per-sample summary of classification type countscgMLST/cgMLST.html
: Interactive line plot that displays number of loci in the cgMLST per threshold value (95/99,100)cgMLST/cgMLST###.tsv
: Allele calling results that masks all non-integer classifications that can be used for downstream visualization
ChewBBACA cgMLST according to the published Ridom SeqSphere 1521-loci cgMLST schema for L. pneumophila.
The cgMLST allele calling results can be used downstream for clustering and visualization along with the STs.
Finally summary scoring and metrics
scored_quast_report.csv
: Scored quast report based on determined thresholds
Scored QUAST report based on adapted thresholds from Gorzynski et al. to determine if the sample has any metrics that significantly deviate from the expected results
overall.qc.csv
: Final collated overall summary report
The final collated summary report that is created using the outputs from the other pipeline steps and checks some final quality criteria.
The qc_status
column will be any of the following statuses:
- Pass: The sample passes all checks!
- Warn: The sample was flagged for a specific warning
- Fail: The sample has failed out of the pipeline
The qc_message
column contains the reason for the qc_status
and includes:
Message | Associated Status | Flag Reason |
---|---|---|
low_lpn_abundance | WARN | Low (< 75% abundance) L.pneumophila abundance is not expected with isolate sequencing and may signify a problem sample |
low_read_count | WARN | Low read count (< 150,000 reads default) has been shown to lead to poor, uninformative assemblies and sample is kicked out |
low_n50 | WARN | Low N50 (< 80,000) scores have been shown to very negatively affect clustering outputs |
low_exact_allele_calls | WARN | Low chewBBACA exact allele calls (< 90% called) show that there may be issues in the assembly |
low_qc_score | WARN | Low QUAST-Analyzer QC score (< 4) shows that there may be issues in the assembly |
no_lpn_detected | FAIL | Very little (< 10% default) L.pneumophila abundance flags that the sample may not be L.pneumophila and sample is kicked from pipeline |
failing_read_count | FAIL | Read count below failing threshold (< 60,000 reads default) has been shown to lead to poor, uninformative assemblies and sample is kicked out |