-
Notifications
You must be signed in to change notification settings - Fork 2
Developer documentation
Regarding the pipeline structure, we use the snakemake-workflow template. We simplified removing unused GitHub Actions. Please refer to the official documentation for more information.
The entry point of the pipeline is the file workflow/Snakefile. At the start, we check for the existence of usearch, which is a non-distributable software. Instead of using a metadata CSV file (arguably better), we symlink the input files (potentially renamed) into a results/renamed_raw_reads
. We match those files with regex into the variables unique_samples
we use for the whole pipeline. The rule "all" is executed as default: the pipeline will finish when it completes all files enumerated there. Thus, the way of extending (or removing steps) is to add or remove files to that list. Notice that you don't have to specify intermediate files, even if you want them.
Now, we document briefly every previous step:
We use the Trim galore snakemake wrapper; you check the optional flags we use by editing the trim_galore section of the config/config.yaml file. For more information, check the file workflow/rules/trim_galore.smk.
Rules name: trim_galore_pe
We use a custom script (workflow/scripts/sortmerna.py) to run SortMeRNA. You can check the optional flags we use by editing the sortmerna section of the config/config.yaml file, and specify the database in config/private_config.yaml. For more information, check the file workflow/rules/sortmerna.smk.
Rules name: sortmerna_ssu
We must decompress the forward and reverse reads, rename them, concatenate them into a new file, and create a configuration file for MetaRib. The MetaRib script is in workflow/external_scripts, as its legacy code is challenging to integrate into our pipeline. You can modify the EMIRGE database in config/private_config.yaml. You can check the flags we use by editing the metarib section of the config/config.yaml file. For more information, check the file workflow/rules/metarib.smk and workflow/rules/quast.smk.
Rules name: decompress_rrna, data_preparation, config_file_metarib, MetaRib, quast
We index the reconstructed ribosomal genes fasta file using BWA, map the reads to the contigs with BWA into SAM files, sort them into BAM files, index them using Samtools, get stats of the mapping using Samtools, and finally, we process the mapping using a custom script (workflow/scripts/process_mapped_reads_contigs.py). Notice that intermediate files are not stored. For more information, check the file workflow/rules/mapped_reads_to_metarib_contigs.smk.
Rules name: metarib_bwa_index, bwa_contig_rrna, sort_bwa_contig_rRNA, samtools_index_rrna, samtools_idxstats_rrna, processing_mapped_contigs_rrna
We use CREST4 for taxonomic annotation. You can specify the database in config/private_config.yaml, although CREST4 can download it automatically for you. We join the annotated contigs table with our abundance table using a custom script (scripts/process_crest4_phyloseq.R), add green-genes prefixes, optional, and create a basic phyloseq object. For more information, check the file workflow/rules/crest4.smk. You can modify the CREST4 flags in config/config.yaml.
Rules name: CREST4, process_crest4, edit_taxonomy
We use SortMeRNA as specified above.
Rules name: sortmerna_lsu
We use the snakemaker wrapper for Trinity. You can check the flags we use by editing the trinity section of the config/config.yaml file. We don't store intermediate results. For more information, check the file workflow/rules/trinity.smk.
Rules name: trinity
We use the same procedure as for the ribosomal genes. For more information, check the file workflow/rules/mapped_reads_to_trinity_contigs.smk.
Rules name: trinity_bwa_index, bwa_contig_mrna, sort_bwa_contig_mRNA, samtools_index_mRNA, samtools_idxstats_mRNA, processing_mapped_contigs_mRNA
We use cmsearch to filter out non-coding RNA. You can modify the E-value threshold in config/config.yaml (evalue_non_coding_rna), as well as specify a different database in config/private_config.yaml (RFAM_DATABASE). We process the tabular output of cmsearch using a custom script (workflow/scripts/processing_cmsearch_tbl.py), and we exclude non-coding RNA from the fasta file using a custom script (workflow/scripts/filter_fasta.py). For more information, check the file workflow/rules/non_coding_RNA.smk.
Rules name: infernal_cmsearch, processing_cmsearch_tbl, non_coding_fasta_names, exclude_non_coding_rna
We use the R library Vegan to filter out low-abundance contigs, and we filter the FASTA file using seqtk. You can modify the minimum in config/config.yaml (filter_mRNA_abundance_minimum). For more information, check the file workflow/rules/filter_mRNA_abundance.smk and the script workflow/scripts/vegan_filter_abundance.R
Rules name: find_contigs_to_keep, filter_contigs
We use Diamond and AnnoTree for functional and taxonomic annotation. We use a template jupyter notebook, workflow/notebooks/annotree.ipynb.py, to run the whole analysis. Please refer to the notebook for more information and workflow/rules/diamond.smk. This rule produces as input a notebook notebook/annotree.ipynb, which the user can re-execute interactively with their parameters.
Rules name: annotree