Skip to content
Curro Campuzano edited this page Nov 8, 2023 · 6 revisions

Overview of the pipeline

We describe the workflow step by. For more information about the specific version we use, please refer to the workflow/envs directory, which contains one YAML file per program that we use to create a conda environment per step. The pipeline should work on any POSIX-compliant system, but we have only tested it on Ubuntu 18.04.6 LTS. If you want more detailed information about every rule, the flags we use, and how to extend the pipeline, please refer to the Developer documentation.

Trimming reads and removing adapters

We use Trim-Galore to trim reads and remove adapters, which combines Cutadapt and FastQC.

You will find the following files:

  • The trimmed reads: results/trimmed/*_R{1, 2}.fq.gz
  • The individual FastQC reports: results/trimmed/*_R{1, 2}__trimming_report.txt
  • The multiqc report (you probably want to check this file, as it agglomerates the individual ones): qc/trim_galore_multiqc.html with all data in qc/trim_galore_multiqc_data/.

Filtering rRNA reads

We use SortMeRNA to filter out rRNA reads with the SILVA SSURef_NR99 database.

You will find the following files:

  • The rRNA reads: results/rrna/*_{fwd, rev}.fq.gz
  • The paired SortMeRNA log files: results/rrna/*.log.
  • The multiqc report: qc/sortmerna_SSU_multiqc.html with all data in qc/sortmerna_SSU_multiqc.html_data/. You should check this file, as it agglomerates the individual ones.
  • The unaligned reads (we'll use them later): results/sortmerna/not_SSU/*_{fwd, rev}.fq.gz

Reconstructing ribosomal genes

We use MetaRib, a specific tool for reconstructing full-length ribosomal genes from meta-transcriptomics data. MetaRib is based on EMIRGE.

You will find the following files:

  • The reconstructed ribosomal genes after processing: results/all.dedup.filtered.fasta.
  • All the intermediate files and other outputs we don't need: results/MetaRib/MetaRib.

Evaluating the reconstruction

We use Quast to evaluate the reconstruction.

You will find the following files:

  • The Quast report in different formats: results/quast/report.txt, results/quast/report.pdf, results/quast/report.html
  • The multiqc report: qc/MetaRib_quast_multiqc.html with all data in qc/MetaRib_quast_multiqc.html_data/. I don't love this report, as I prefer results/quast/report.pdf, but I included it for consistency.

Estimating the abundance of the reconstructed ribosomal genes

We use BWA and Samtools to map the reads to the reconstructed ribosomal genes as a proxy for estimating their abundance. Check this script for more details in the post-processing of the mapping.

You will find the following files:

  • The BAM file with the mapping: results/rrna/bwa/*_{fwd, rev}_sorted.bam
  • The stats of the mapping: results/rrna/bwa/*_{fwd, rev}_sorted.bam.idxstats
  • The abundance table: results/rrna/mapped_reads_to_contigs.tsv

Taxonomic annotation of the reconstructed ribosomal genes

We annotate the reconstructed ribosomal genes using crest4. This tool uses the Lowest Common Ancestor (LCA) algorithm to assign taxonomy after blasting the contigs to a custom database. The database comprises a manually curated SILVA NR SSU Ref v.138 version for Bacteria, Archaea, Metazoa, and Fungi and the PR2 v4.13 for protists.

You will find the following files:

  • The blast output: results/crest4_results/search.hits.
  • The annotated abundance table: results/crest4_results/mapped_reads_to_contigs.tsv.
  • An physeq object you can open in Rstudio to start analyzing you data (check readRds documentation): results/crest4_results/physeq.Rds. You will probably need to add you own metadata (check the merge_phyloseq documentation).

Filtering mRNA reads

We use SortMeRNA to filter out mRNA reads with the SILVA LSURef_NR99 database. After filtering out the rRNA reads, we use the unaligned reads to filter out the mRNA reads.

You will find the following files:

  • The LSU reads: results/sortmerna/LSU/*_{fwd, rev}.fq.gz
  • The paired SortMeRNA log files: results/sortmerna/LSU/*.log
  • The not SSU and not LSU reads (the ones we are interested in): results/sortmerna/not_LSU/*_{fwd, rev}.fq.gz
  • The multiqc report: qc/sortmerna_LSU_multiqc.html with all data in qc/sortmerna_LSU_multiqc.html_data/. You should check this file, as it agglomerates the individual ones.

Trinity assembly

We use Trinity to do a de novo assembly of the reads.

You will find the following files:

  • The Trinity assembly: results/mRNA/trinity.Trinity.fasta

Mapping reads to the Trinity contigs

We use BWA and Samtools to map the reads to the mRNA contigs.

You will find the following files:

  • The BAM file with the mapping: results/mRNA/bwa/*_{fwd, rev}_sorted.bam
  • The stats of the mapping: results/mRNA/bwa/*_{fwd, rev}_sorted.bam.idxstats
  • The abundance table: results/mRNA/mapped_reads_to_contigs.tsv.

Filtering non-coding RNA

We exclude non-coding RNA by using cmsearch with the Rfam database. We claim that the remaining contigs of the assembly are mRNA contigs.

You will find the following files:

  • The mRNA contigs from the Trinity assembly (after filtering non-coding RNA): results/mRNA/Trinity_contigs_ncrna_filtered.fasta
  • The research output: results/mRNA/non_coding_rna/RFAM_cmsearch.{tbl, out}.

Filtering low-abundance contigs

We filter out low-abundance contigs using the R library Vegan. Check the script here for more details.

You will find the following files:

  • The Trinity assembly (after filtering low-abundance contigs & non-coding RNA): results/mRNA/abundance_filtered/Trinity_contigs_ncrna_filtered.fasta
  • The abundance table (after filtering low-abundance contigs): results/mRNA/abundance_filtered/mapped_reads_to_contigs.tsv.

Functional and taxonomic annotation of the mRNA contigs

Finally, we use Diamond and the AnnoTree database to annotate the mRNA contigs.

  1. We assign taxonomic annotation to the mRNA contigs using the Lowest Common Ancestor (LCA) algorithm (of course, only when the reference protein has a taxonomic annotation).
  2. We assign functional annotation to the mRNA contigs by using the best hit of the reference protein.

You will find the following files:

  • The Diamond output in tabular format: results/mRNA/diamond/annotree.tsv.
  • The notebook with the post-processing of the Diamond output: notebooks/annotree.ipynb. Notice that although snakemake ran the notebook with default parameters, you should re-rerun it interactively and change the parameters to your needs.
  • The annotated file with Brite, KEGG, INTERPRO2GO and taxonomic annotation: results/mRNA/diamond/annotree_ann.tsv.