-
Notifications
You must be signed in to change notification settings - Fork 12
5. TAXONOMIC ASSIGNMENT
This section describes our approach to assigning taxonomy to sequences (not contigs) generated by hecatomb.
Hecatomb uses mmseqs2 to assign taxonomy to individual sequences. It does so through an iterative search approach, which is a large part of the heart of hecatomb.
The first search takes all query seqeunces (from seqtable.fasta, an output of 00_preprocessing.smk located in your RESTULTS directory) and queries them against a virus protein target database. This initial search is a translated (nt-to-aa) search. The target database is all UniProt viral protein entires clusterd at 99% identity. This clustering greatly reduces the size of the database with minimal sacrifice to information content. For example, there were 4,635,541 protein entires for viruses in UniProt on October 29, 2020. When clustered at 99% identity only 1,840,337 representative sequences remained, this is about a 60% reduction in target query space.
All of the reads assigned a viral taxonomy in this first search need to be confirmed to be truly viral in origin. The issue here is that querying a target database consisting solely of viral proteins does not permit each sequence to see if it really originates from a different domain of life. So while a sequence may be classified as viral at this stage, once queried against a more comprehensive sequence database (consisting of bacteria, plants, vertebrates, fungi, etc.) those sequences may have higher statistically similar sequences in those domains of life.
It is formally possible to just do this secondary search, skipping the first step querying each sequence against a virus-only protien databse. However, this can be computationally very expensive. For example, if you have 1e6 reads as your query input (sequences in your *seqtable.fasta) and you query these against the virus-only protien database with 1.8e6 entries that is over 25 orders of magnitude smaller than UniRef50 which has 49,410,134 entries when this was written. The primary search against a virus-only database serves as a 'net' to (relatively) quickly capture sequences of interest. In our experience. However, the secondary search against the larger database is still required as many of the sequences assigned a viral taxonomy when only querying the smaller virus-only sequence database will reveal themselves to be entirely something else upon this secondary search. Thus, the iterative approach will expedite sample processing by capturing only a small portion of the input seqeunces as 'potentially' viral and confirming them in the secondary search as 'likely' viral.
This process wherein a sequence is classified as viral in the first search against a virus-only database and then no longer classfied as viral after the second search against a more comprehensive database is where hecatomb gets it's namesake. Many investigators, particularly those lacking access to large computing infrastructure will rely on the results of the primary search which, due to the relatively small target database size, can be done on commodity hardware. Examining the results at this stage will regularly identify a large number of viruses. Basically anything with any level of sequence identity to a viral protein will be called viral even if the sequence originated from a bacteria, plant, fungi or other organism. These sequences are 'sacrificed' through the secondary search when many of the viral calls from the first search are lost.
The rationale for using a translated search first ...
INSERT TAXONOMIC ASSIGNMENT FLOWGRAM HERE
General file infomration with explanations.
-
seqtable.fasta: The input file for all taxonomic is the seqtable.fasta generated as output from 00_preprocessing.smk. This is every 'unique' sequence in your population of samples. Uniquenes is defined by default as representative cluster squences. The default clustering is based on 97% identity but can be modified in your config file.
-
MMSEQS_AA_PRIMARY_lca.tsv: Lowest Common Ancestor (LCA) taxonomic assignments for every sequence in seqtable.fasta including sequences that are unclassified. The number of rows should be identical to the number of sequences in seqtable.fasta.
Column identifiers: 1. Query sequence ID (fasta header from seqtable.fasta) 2. NCBI Taxonomy ID 3. Linnean classifcation rank (Kingdom, Phylum, Class, etc.) 4. Lowest level taxonomic name 5. 7-level (K, P, C, O, F, G, S) taxonomic lineage 6. Full taxonomic lineage with lineage prefixes
-
MMSEQS_AA_PRIMARY_tophit_aln: Tophit taxonomic assignments and alignment statistics. The number of rows in this file is likely to be much lower than the number of sequences in seqtable.fasta since no infomration is saved for seqeunces with no-hits (no significant alignments). So this excludes all unlcassifieds unlike MMSEQS_AA_PRIMARY_lca.tsv. However, MMSEQS_AA_PRIMARY_lca.tsv does not contain alignment statistics so these 2 tables need to be analyzed together to understand taxonomic assignments in context of alignment statistics.
Column identifiers: 1. Query sequence ID (fasta header from seqtable.fasta) 2. Target (in this case, UniProt) ID 3. e-value (evalue) 4. Percent identity (pident) 5. Fractional identity (fident) 6. Number of identical matches (nident) 7. Number of mismatches (mismatch) 8. Percent of query covered (qcov) 9. Percent of target covered (tcov) 10. Query alignment start (qstart) 11. Query alignment end (qend) 12. Query sequence length (qlen) (note: this is the query length in nucleotides) 13. Target alignment start (tstart) 14. Target alignment end (tend) 15. Target sequence length (tlen) (note: this is the target length in amino acids) 16. Alignment length (alnlen) (note: this is the alignment length in amino acids. The query is translated similar to blastx) 17. Bit score (bits) 18. Query header (qheader) (note: should be the same as column 1) 19. Target header (theader) 20. NCBI taxonomy ID (taxid) 21. Taxonomy name (taxname) (note: species level) 22. Full taxonomic lineage (lineage) (note: contains prefixes as d_, p_, etc.)
- MMSEQS_AA_PRIMARY_report