|
| 1 | +# LABEL, Lineage Assignment by Extended Learning |
| 2 | + |
| 3 | +*By: Sam Shepard (vfn4@cdc.gov), CDC/NCIRD* |
| 4 | + |
| 5 | +LABEL’s purpose is to quickly, automatically, and correctly assign clades or lineages to nucleotide sequences. Automated lineage assignment has applications in surveillance, research, and high-throughput database annotation. Currently LABEL supports the lineage assignment of hemagglutinins for influenza A subtypes H5N1 and H9N2. |
| 6 | + |
| 7 | +## METHOD |
| 8 | + |
| 9 | +Lineage Assignment By Extended Learning (LABEL) uses hidden Markov model (HMM) profiles of clade alignments--or groups of clades--to analyze query sequences and then classify them via machine learning techniques. The HMM scoring step is performed via SAM v3.5 (see [*compbio.soe.ucsc.edu/sam.html*](http://compbio.soe.ucsc.edu/sam.html) for more). Prediction is performed hierarchically--usually starting out at a more general level (e.g., a groups of clades) and going to a very specific terminal level (a particular clade). This roughly corresponds to the hierarchical structure of phylogenetic trees and the H5N1 nomenclature system. The prediction phase of LABEL is done via support vector machines (SVM) using the free SHOGUN Machine Learning Toolbox v1.1.0 (multi-class GMNP SVM with polynomial kernel of degree 20, *www.shogun-toolbox.org*). Optional sequence alignment (MUSCLE v3.8.31, see [*www.drive5.com/muscle*](http://www.drive5.com/mus); MAFFT if available, see [mafft.cbrc.jp/alignment/software](http://mafft.cbrc.jp/alignment/software); or via SAM's *align2model* program) and tree-building functions are available to validate LABEL’s predictions (GTR+GAMMA, 1000 local support bootstraps, maximum-likelihood tree using FastTreeMP v2.1.4, see [*www.microbesonline.org/fasttree*](http://www.microbesonline.org/fasttree)). |
| 10 | + |
| 11 | +## BROADER IMPACT |
| 12 | + |
| 13 | +Although we have only constructed modules for H5 and H9, LABEL's methodology need not be limited to influenza A or even just viral sequences. Given any phylogenetic tree with defined families or clades, one can train a LABEL module for automated lineage assignment. Training is performed using a combination of support scripts and by manually applied expert knowledge. |
| 14 | + |
| 15 | +## ACCURACY & PERFORMANCE |
| 16 | + |
| 17 | +On H5v2011 and H9v2011 full length sequences LABEL performs with 100% accuracy on tested datasets and runtime scales linearly at about a half-second per hemagglutinin sequence for a four core machine. Full results are in pre-publication drafting and available upon request. Choosing alignment options may increase the runtime significantly; however, guide sequence libraries are never more than 200 sequences in size. For the best results using the alignment options, break down your query sequence file into smaller files. |
| 18 | + |
| 19 | +## USAGE |
| 20 | + |
| 21 | +```{bash} |
| 22 | +Usage: |
| 23 | + LABEL [-P MAX_PROC] [-E C_OPT] [-W WRK_PATH|-O OUT_PATH] [-G|-TACRD|-S] [-L LIN_PATH] <nts.fasta> <project> <Module:H5,H9,etc.> |
| 24 | + -T Do TRAINING again instead of using classifier files. |
| 25 | + -A Do ALIGNMENT of re-annotated fasta file (sorted by clade) & build its ML tree. |
| 26 | + -C Do CONTROL alignment & ML tree construction. |
| 27 | + -E SGE clustering option. Use 1 or 2 for SGE with array jobs, else local. |
| 28 | + -R No RECURSIVE prediction. Limits scope, useful with -L option. |
| 29 | + -D No DELETION of extra intermediary files. |
| 30 | + -S Show available protein modules. |
| 31 | + -W Web-server mode: requires ABSOLUTE path to WRITABLE working directory. |
| 32 | + -O Output directory path, do not use with web mode. |
| 33 | + -G Create a scoring matrix using given header annotations for Graphing. (removed) |
| 34 | +Example: ./LABEL -C gisaid_H5N1.fa Bird_Flu H5 |
| 35 | +``` |
| 36 | + |
| 37 | +## DATA. |
| 38 | + |
| 39 | +- LABEL takes FASTA formatted nucleotide sequences. The FASTA may be single or multi-line and may contain any number of sequences. Extra sequences with redundant headers are removed (first-read, first kept)! Commas and apostrophes are removed from headers while internal spaces are underlined. |
| 40 | + |
| 41 | +- LABEL generates re-annotated FASTA sequences, scoring data, Newick files, alignments, tab-delimited files, and miscellaneous text files. LABEL's output is limited to text and creates no binaries or images. LABEL's output is limited to a specified output directory (or to a default working directory within the package) and to the current working directory of the calling user. |
| 42 | + |
| 43 | +## FILES GENERATED |
| 44 | + |
| 45 | +| File | Type | Description | |
| 46 | +| :------------------------- | :-------- | :-------------------------------------------------------------------------- | |
| 47 | +| PROJ_final.tab | Standard. | Tab-delimited headers & predicted clades. | |
| 48 | +| PROJ_final.txt | Standard. | A prettier output of the above. | |
| 49 | +| LEVEL_trace.tab | Standard. | Table of HMM scores at each level, suitable for visualization in R. | |
| 50 | +| LEVEL_result.tab | Standard. | For the current prediction level, tab-delimited headers & predicted clades. | |
| 51 | +| LEVEL_result.txt | Standard. | For the current prediction level, A prettier output of the above. | |
| 52 | +| FASTA/ | Standard. | Folder containing fasta files and newick trees. | |
| 53 | +| FASTA/PROJ_predictions.fas | Standard. | Query sequence file with predictions added like: _{PRED:CLAD} | |
| 54 | +| FASTA/MOD_control.fasta | Optional. | Alignment of predictions fasta file and guide sequences. | |
| 55 | +| FASTA/MOD_control.nwk | Optional. | Maximum likelihood tree of the above. | |
| 56 | +| FASTA/PROJ_reannotated.fas | Default. | Query file with annotations replaced with predicted ones, ordered by clade. | |
| 57 | +| FASTA/PROJ_ordered.fasta | Optional. | Aligned version of the above, still ordered by clade. | |
| 58 | +| FASTA/PROJ_tree.nwk | Optional. | Maximum likelihood tree of the above. | |
| 59 | +| FASTA/PROJ_clade_CLAD.fas | Standard. | The re-annotated file partitioned into separate clade files. | |
| 60 | +| c-*/ | Standard. | Clade/lineage subfolder for the hierarchical predictions. | |
| 61 | + |
| 62 | +*The project name is denoted "PROJ", the lineage or clade is called "CLAD", and the module of interest as “MOD”.* |
| 63 | + |
| 64 | +## MODULES |
| 65 | + |
| 66 | +LABEL modules are merely directories within the *LABEL\_RES/training\_data* folder and contain all associated pHMMs as well as SVM training data. Extensions such *x-filter.txt* control against inappropriate data input. The guide tree for positive control (if desired) is listed as *MOD\_downsample.fa* for MAFFT/MUSCLE alignment or in the *x-control* folder for faster pHMM alignment. See website for more information or use: `./LABEL -S` |
| 67 | + |
| 68 | +## HARDWARE |
| 69 | + |
| 70 | +We recommend a single multi-core machine with no fewer than 2 cores (8 or more threads work best) and at least 2 GB of RAM. LABEL runtime is impacted by the number of cores available on a machine. Use with Mac OS X requires a 64 bit chipset. |
| 71 | + |
| 72 | +## SOFTWARE PRE-REQUISITES |
| 73 | + |
| 74 | +See "QUICK_INSTALL.txt". |
| 75 | + |
| 76 | +## PACKAGED SOFTWARE |
| 77 | + |
| 78 | +- SHOGUN version 1.0.0 or later (tested 1.1.0) |
| 79 | + - Purpose: executes the SVM decision phase. |
| 80 | + - License: GPL v3 |
| 81 | +- MUSCLE 3.8 or later (tested 3.8.11) |
| 82 | + - Purpose: optionally align output or control |
| 83 | + - License: Public Domain |
| 84 | +- FastTreeMP 2.1.4 or later |
| 85 | + - Purpose: optionally build trees |
| 86 | + - License: GPL (any) |
| 87 | +- SAM version 3.5 or later |
| 88 | + - Purpose: build HMM profiles, score sequences for evaluation |
| 89 | + - License: Academic/Government, not-for-profit, redistributed with permission |
| 90 | +- BASH scripts |
| 91 | + - Purpose: assist installation, main pipeline for LABEL |
| 92 | + - License: owner, GPL |
| 93 | +- Perl scripts |
| 94 | + - Purpose: data manipulation and formatting; calls SHOGUN for SVM use. |
| 95 | + - License: owner, GPL. |
| 96 | + |
| 97 | +## INSTALLATION |
| 98 | + |
| 99 | +1) Unzip the archive containing LABEL. |
| 100 | +2) Move the file “LABEL” and the directory “*LABEL\_RES*” to a place in your PATH environment variable. Otherwise, add the directory containing LABEL and *LABEL\_RES* to your PATH. |
| 101 | +3) Restart your terminal emulator. Note: *LABEL\_RES* and LABEL must be in the same folder. |
| 102 | +4) LABEL is now installed. To test it, execute: LABEL test.fa test\_proj H9v2011 |
| 103 | +5) The file “*test.fa*” is given in the deployment archive. To access LABEL without using the PATH variable, cd to your extracted directory & substitute “./LABEL” for “LABEL” above. |
| 104 | + |
| 105 | +## LICENSE |
| 106 | + |
| 107 | +GPL version 3. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. |
| 108 | + |
| 109 | +This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. |
| 110 | + |
| 111 | +You should have received a copy of the GNU General Public License along with this program. If not, see [*www.gnu.org/licenses/*](http://www.gnu.org/licenses/). |
| 112 | + |
| 113 | +## DICLAIMER & LIMITATION OF LIABILITY |
| 114 | + |
| 115 | +[SAM (align2model,hmmscore,modelfromalign)](http://compbio.soe.ucsc.edu/sam2src/) binaries may be used within LABEL for government and/or academic use only. Commercial use and redistribution for commercial use is excluded. Use of SAM implies this [license](http://compbio.soe.ucsc.edu/sam-lic/obj.0). |
| 116 | + |
| 117 | +The materials embodied in this software are "as-is" and without warranty of any kind, express, implied or otherwise, including without limitation, any warranty of fitness for a particular purpose. In no event shall the Centers for Disease Control and Prevention (CDC) or the United States (U.S.) Government be liable to you or anyone else for any direct, special, incidental, indirect or consequential damages of any kind, or any damages whatsoever, including without limitation, loss of profit, loss of use, savings or revenue, or the claims of third parties, whether or not CDC or the U.S. Government has been advised of the possibility of such loss, however caused and on any theory of liability, arising out of or in connection with the possession, use or performance of this software. In no event shall any other party who modifies and/or conveys the program as permitted according to GPL license [[*www.gnu.org/licenses/*](http://www.gnu.org/licenses/)], make CDC or the U.S. government liable for damages, including any general, special, incidental or consequential damages arising out of the use or inability to use the program, including but not limited to loss of data or data being rendered inaccurate or losses sustained by third parties or a failure of the program to operate with any other programs. Any views, prepared by individuals as part of their official duties as United States government employees or as contractors of the United States government and expressed herein, do not necessarily represent the views of the United States government. Such individuals’ participation in any part of the associated work is not meant to serve as an official endorsement of the software. The CDC and the U.S. government shall not be held liable for damages resulting from any statements arising from use of or promotion of the software that may conflict with any official position of the United States government. |
0 commit comments