-
Notifications
You must be signed in to change notification settings - Fork 2
Home
We describe the workflow step by. For more information about the specific version we use, please refer to the workflow/envs
directory, which contains one YAML file per program that we use to create a conda environment per step. The pipeline should work on any POSIX-compliant system, but we have only tested it on Ubuntu 18.04.6 LTS. If you want more detailed information about every rule, the flags we use, and how to extend the pipeline, please refer to the Developer documentation.
We use Trim-Galore to trim reads and remove adapters, which combines Cutadapt and FastQC.
You will find the following files:
- The trimmed reads:
results/trimmed/*_R{1, 2}.fq.gz
- The individual FastQC reports:
results/trimmed/*_R{1, 2}__trimming_report.txt
- The multiqc report (you probably want to check this file, as it agglomerates the individual ones):
qc/trim_galore_multiqc.html
with all data inqc/trim_galore_multiqc_data/
.
We use SortMeRNA to filter out rRNA reads with the SILVA SSURef_NR99 database.
You will find the following files:
- The rRNA reads:
results/rrna/*_{fwd, rev}.fq.gz
- The paired SortMeRNA log files:
results/rrna/*.log
. - The multiqc report:
qc/sortmerna_SSU_multiqc.html
with all data inqc/sortmerna_SSU_multiqc.html_data/
. You should check this file, as it agglomerates the individual ones. - The unaligned reads (we'll use them later):
results/sortmerna/not_SSU/*_{fwd, rev}.fq.gz
We use MetaRib, a specific tool for reconstructing full-length ribosomal genes from meta-transcriptomics data. MetaRib is based on EMIRGE.
You will find the following files:
- The reconstructed ribosomal genes after processing:
results/all.dedup.filtered.fasta
. - All the intermediate files and other outputs we don't need:
results/MetaRib/MetaRib
.
We use Quast to evaluate the reconstruction.
You will find the following files:
- The Quast report in different formats:
results/quast/report.txt
,results/quast/report.pdf
,results/quast/report.html
- The multiqc report:
qc/MetaRib_quast_multiqc.html
with all data inqc/MetaRib_quast_multiqc.html_data/
. I don't love this report, as I preferresults/quast/report.pdf,
but I included it for consistency.
We use BWA and Samtools to map the reads to the reconstructed ribosomal genes as a proxy for estimating their abundance. Check this script for more details in the post-processing of the mapping.
You will find the following files:
- The BAM file with the mapping:
results/rrna/bwa/*_{fwd, rev}_sorted.bam
- The stats of the mapping:
results/rrna/bwa/*_{fwd, rev}_sorted.bam.idxstats
- The abundance table:
results/rrna/mapped_reads_to_contigs.tsv
We annotate the reconstructed ribosomal genes using crest4. This tool uses the Lowest Common Ancestor (LCA) algorithm to assign taxonomy after blasting the contigs to a custom database. The database comprises a manually curated SILVA NR SSU Ref v.138 version for Bacteria, Archaea, Metazoa, and Fungi and the PR2 v4.13 for protists.
You will find the following files:
- The blast output:
results/crest4_results/search.hits
. - The annotated abundance table:
results/crest4_results/mapped_reads_to_contigs.tsv
. - An physeq object you can open in Rstudio to start analyzing you data (check readRds documentation):
results/crest4_results/physeq.Rds
. You will probably need to add you own metadata (check the merge_phyloseq documentation).
We use SortMeRNA to filter out mRNA reads with the SILVA LSURef_NR99 database. After filtering out the rRNA reads, we use the unaligned reads to filter out the mRNA reads.
You will find the following files:
- The LSU reads:
results/sortmerna/LSU/*_{fwd, rev}.fq.gz
- The paired SortMeRNA log files:
results/sortmerna/LSU/*.log
- The not SSU and not LSU reads (the ones we are interested in):
results/sortmerna/not_LSU/*_{fwd, rev}.fq.gz
- The multiqc report:
qc/sortmerna_LSU_multiqc.html
with all data inqc/sortmerna_LSU_multiqc.html_data/
. You should check this file, as it agglomerates the individual ones.
We use Trinity to do a de novo assembly of the reads.
You will find the following files:
- The Trinity assembly:
results/mRNA/trinity.Trinity.fasta
We use BWA and Samtools to map the reads to the mRNA contigs.
You will find the following files:
- The BAM file with the mapping:
results/mRNA/bwa/*_{fwd, rev}_sorted.bam
- The stats of the mapping:
results/mRNA/bwa/*_{fwd, rev}_sorted.bam.idxstats
- The abundance table:
results/mRNA/mapped_reads_to_contigs.tsv
.
We exclude non-coding RNA by using cmsearch with the Rfam database. We claim that the remaining contigs of the assembly are mRNA contigs.
You will find the following files:
- The mRNA contigs from the Trinity assembly (after filtering non-coding RNA):
results/mRNA/Trinity_contigs_ncrna_filtered.fasta
- The research output:
results/mRNA/non_coding_rna/RFAM_cmsearch.{tbl, out}
.
We filter out low-abundance contigs using the R library Vegan. Check the script here for more details.
You will find the following files:
- The Trinity assembly (after filtering low-abundance contigs & non-coding RNA):
results/mRNA/abundance_filtered/Trinity_contigs_ncrna_filtered.fasta
- The abundance table (after filtering low-abundance contigs):
results/mRNA/abundance_filtered/mapped_reads_to_contigs.tsv
.
Finally, we use Diamond and the AnnoTree database to annotate the mRNA contigs.
- We assign taxonomic annotation to the mRNA contigs using the Lowest Common Ancestor (LCA) algorithm (of course, only when the reference protein has a taxonomic annotation).
- We assign functional annotation to the mRNA contigs by using the best hit of the reference protein.
You will find the following files:
- The Diamond output in tabular format:
results/mRNA/diamond/annotree.tsv
. - The notebook with the post-processing of the Diamond output:
notebooks/annotree.ipynb
. Notice that although snakemake ran the notebook with default parameters, you should re-rerun it interactively and change the parameters to your needs. - The annotated file with Brite, KEGG, INTERPRO2GO and taxonomic annotation:
results/mRNA/diamond/annotree_ann.tsv
.