Skip to content

Overview

jhensl19 edited this page Jul 18, 2024 · 2 revisions

Overview

MetaCerberus Workflow

General Info

  • MetaCerberus has three basic modes:
    1. Quality Control (QC) for raw reads
    2. Formatting/gene prediction
    3. Annotation
  • MetaCerberus can use three different input files:
    1. Raw read data from any sequencing platform (Illumina, PacBio, or Oxford Nanopore)
    2. Assembled contigs, as MAGs, vMAGs, isolate genomes, or a collection of contigs
    3. Amino acid fasta (.faa), previously called pORFs
  • We offer customization, including running all databases together, individually or specifying select databases. For example, if a user wants to run prokaryotic or eukaryotic-specific KOfams, or an individual database alone such as dbCAN, both are easily customized within MetaCerberus.
  • In QC mode, raw reads are quality controlled with pre- and post-trim via FastQC. Raw reads are then trimmed via data type; if the data is Illumina or PacBio, fastp is called, otherwise it assumes the data is Oxford Nanopore then PoreChop is utilized.
  • If Illumina reads are utilized, an optional bbmap step to remove the phiX174 genome is available or user provided contaminate genome. Phage phiX174 is a common contaminant within the Illumina platform as their library spike-in control. We highly recommend this removal if viral analysis is conducted, as it would provide false positives to ssDNA microviruses within a sample.
  • We include a --skip_decon option to skip the filtration of phiX174, which may remove common k-mers that are shared in ssDNA phages.
  • In the formatting and gene prediction stage, contigs and genomes are checked for N repeats. These N repeats are removed by default.
  • We impute contig/genome statistics (e.g., N50, N90, max contig) via our custom module Metaome Stats.
  • Contigs can be converted to pORFs using Prodigal, FragGeneScanRs, and Prodigal-gv) as specified by user preference.
  • Scaffold annotation is not recommended due to N's providing ambiguous annotation.
  • Both Prodigal and FragGeneScanRs can be used via our --super option, and we recommend using FragGeneScanRs for samples rich in eukaryotes.
  • FragGeneScanRs found more ORFs and KOs than Prodigal for a stimulated eukaryote rich metagenome. HMMER searches against the above databases via user specified bitscore and e-values or our minimum defaults (i.e., bitscore = 25, e-value = 1 x 10-9 ).

Input File Formats

  • From any NextGen sequencing technology (from Illumina, PacBio, Oxford Nanopore)
  • Type 1 raw reads (.fastq format)
  • Type 2 nucleotide fasta (.fasta, .fa, .fna, .ffn format), assembled raw reads into contigs
  • Type 3 protein fasta (.faa format), assembled contigs which genes are converted to amino acid sequence

Output Files

  • If an output directory is given, that folder will be created where all files are stored.
  • If no output directory is specified, the 'results_metacerberus' subfolder will be created in the current directory.
  • Gage/Pathview R analysis provided as separate scripts within R.

Visualization of Outputs

  • We use Plotly to visualize the data
  • Once the program is finished running, the html reports containing the visuals will be saved to the last step of the pipeline.
  • The HTML files require plotly.js to be present. One has been provided in the package and is saved to the report folder.
Clone this wiki locally