Skip to content

Developer documentation

Curro Campuzano edited this page Nov 8, 2023 · 2 revisions

Regarding the pipeline structure, we use the snakemake-workflow template. We simplified removing unused GitHub Actions. Please refer to the official documentation for more information.

The entry point of the pipeline is the file workflow/Snakefile. At the start, we check for the existence of usearch, which is a non-distributable software. Instead of using a metadata CSV file (arguably better), we symlink the input files (potentially renamed) into a results/renamed_raw_reads. We match those files with regex into the variables unique_samples we use for the whole pipeline. The rule "all" is executed as default: the pipeline will finish when it completes all files enumerated there. Thus, the way of extending (or removing steps) is to add or remove files to that list. Notice that you don't have to specify intermediate files, even if you want them.

Now, we document briefly every previous step:

Trimming reads and removing adapters

We use the Trim galore snakemake wrapper; you check the optional flags we use by editing the trim_galore section of the config/config.yaml file. For more information, check the file workflow/rules/trim_galore.smk.

Rules name: trim_galore_pe

Filtering rRNA reads

We use a custom script (workflow/scripts/sortmerna.py) to run SortMeRNA. You can check the optional flags we use by editing the sortmerna section of the config/config.yaml file, and specify the database in config/private_config.yaml. For more information, check the file workflow/rules/sortmerna.smk.

Rules name: sortmerna_ssu

Reconstructing ribosomal genes

We must decompress the forward and reverse reads, rename them, concatenate them into a new file, and create a configuration file for MetaRib. The MetaRib script is in workflow/external_scripts, as its legacy code is challenging to integrate into our pipeline. You can modify the EMIRGE database in config/private_config.yaml. You can check the flags we use by editing the metarib section of the config/config.yaml file. For more information, check the file workflow/rules/metarib.smk and workflow/rules/quast.smk.

Rules name: decompress_rrna, data_preparation, config_file_metarib, MetaRib, quast

Estimating the abundance of the reconstructed ribosomal genes

We index the reconstructed ribosomal genes fasta file using BWA, map the reads to the contigs with BWA into SAM files, sort them into BAM files, index them using Samtools, get stats of the mapping using Samtools, and finally, we process the mapping using a custom script (workflow/scripts/process_mapped_reads_contigs.py). Notice that intermediate files are not stored. For more information, check the file workflow/rules/mapped_reads_to_metarib_contigs.smk.

Rules name: metarib_bwa_index, bwa_contig_rrna, sort_bwa_contig_rRNA, samtools_index_rrna, samtools_idxstats_rrna, processing_mapped_contigs_rrna

Taxonomic annotation of the reconstructed ribosomal genes

We use CREST4 for taxonomic annotation. You can specify the database in config/private_config.yaml, although CREST4 can download it automatically for you. We join the annotated contigs table with our abundance table using a custom script (scripts/process_crest4_phyloseq.R), add green-genes prefixes, optional, and create a basic phyloseq object. For more information, check the file workflow/rules/crest4.smk. You can modify the CREST4 flags in config/config.yaml.

Rules name: CREST4, process_crest4, edit_taxonomy

Filtering mRNA reads

We use SortMeRNA as specified above.

Rules name: sortmerna_lsu

Trinity assembly

We use the snakemaker wrapper for Trinity. You can check the flags we use by editing the trinity section of the config/config.yaml file. We don't store intermediate results. For more information, check the file workflow/rules/trinity.smk.

Rules name: trinity

Mapping reads to the Trinity contigs

We use the same procedure as for the ribosomal genes. For more information, check the file workflow/rules/mapped_reads_to_trinity_contigs.smk.

Rules name: trinity_bwa_index, bwa_contig_mrna, sort_bwa_contig_mRNA, samtools_index_mRNA, samtools_idxstats_mRNA, processing_mapped_contigs_mRNA

Filtering non-coding RNA

We use cmsearch to filter out non-coding RNA. You can modify the E-value threshold in config/config.yaml (evalue_non_coding_rna), as well as specify a different database in config/private_config.yaml (RFAM_DATABASE). We process the tabular output of cmsearch using a custom script (workflow/scripts/processing_cmsearch_tbl.py), and we exclude non-coding RNA from the fasta file using a custom script (workflow/scripts/filter_fasta.py). For more information, check the file workflow/rules/non_coding_RNA.smk.

Rules name: infernal_cmsearch, processing_cmsearch_tbl, non_coding_fasta_names, exclude_non_coding_rna

Filtering low-abundance contigs

We use the R library Vegan to filter out low-abundance contigs, and we filter the FASTA file using seqtk. You can modify the minimum in config/config.yaml (filter_mRNA_abundance_minimum). For more information, check the file workflow/rules/filter_mRNA_abundance.smk and the script workflow/scripts/vegan_filter_abundance.R

Rules name: find_contigs_to_keep, filter_contigs

Functional and taxonomic annotation of the mRNA contigs

We use Diamond and AnnoTree for functional and taxonomic annotation. We use a template jupyter notebook, workflow/notebooks/annotree.ipynb.py, to run the whole analysis. Please refer to the notebook for more information and workflow/rules/diamond.smk. This rule produces as input a notebook notebook/annotree.ipynb, which the user can re-execute interactively with their parameters.

Rules name: annotree