Skip to content

Commit d65480d

Browse files
committed
Updated README
1 parent 4e68e60 commit d65480d

File tree

2 files changed

+117
-1
lines changed

2 files changed

+117
-1
lines changed

LABEL_RES/README.rtf

Lines changed: 0 additions & 1 deletion
This file was deleted.

README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# LABEL, Lineage Assignment by Extended Learning
2+
3+
*By: Sam Shepard (vfn4@cdc.gov), CDC/NCIRD*
4+
5+
LABEL’s purpose is to quickly, automatically, and correctly assign clades or lineages to nucleotide sequences. Automated lineage assignment has applications in surveillance, research, and high-throughput database annotation. Currently LABEL supports the lineage assignment of hemagglutinins for influenza A subtypes H5N1 and H9N2.
6+
7+
## METHOD
8+
9+
Lineage Assignment By Extended Learning (LABEL) uses hidden Markov model (HMM) profiles of clade alignments--or groups of clades--to analyze query sequences and then classify them via machine learning techniques. The HMM scoring step is performed via SAM v3.5 (see [*compbio.soe.ucsc.edu/sam.html*](http://compbio.soe.ucsc.edu/sam.html) for more). Prediction is performed hierarchically--usually starting out at a more general level (e.g., a groups of clades) and going to a very specific terminal level (a particular clade). This roughly corresponds to the hierarchical structure of phylogenetic trees and the H5N1 nomenclature system. The prediction phase of LABEL is done via support vector machines (SVM) using the free SHOGUN Machine Learning Toolbox v1.1.0 (multi-class GMNP SVM with polynomial kernel of degree 20, *www.shogun-toolbox.org*). Optional sequence alignment (MUSCLE v3.8.31, see [*www.drive5.com/muscle*](http://www.drive5.com/mus); MAFFT if available, see [mafft.cbrc.jp/alignment/software](http://mafft.cbrc.jp/alignment/software); or via SAM's *align2model* program) and tree-building functions are available to validate LABEL’s predictions (GTR+GAMMA, 1000 local support bootstraps, maximum-likelihood tree using FastTreeMP v2.1.4, see [*www.microbesonline.org/fasttree*](http://www.microbesonline.org/fasttree)).
10+
11+
## BROADER IMPACT
12+
13+
Although we have only constructed modules for H5 and H9, LABEL's methodology need not be limited to influenza A or even just viral sequences. Given any phylogenetic tree with defined families or clades, one can train a LABEL module for automated lineage assignment. Training is performed using a combination of support scripts and by manually applied expert knowledge.
14+
15+
## ACCURACY & PERFORMANCE
16+
17+
On H5v2011 and H9v2011 full length sequences LABEL performs with 100% accuracy on tested datasets and runtime scales linearly at about a half-second per hemagglutinin sequence for a four core machine. Full results are in pre-publication drafting and available upon request. Choosing alignment options may increase the runtime significantly; however, guide sequence libraries are never more than 200 sequences in size. For the best results using the alignment options, break down your query sequence file into smaller files.
18+
19+
## USAGE
20+
21+
```{bash}
22+
Usage:
23+
LABEL [-P MAX_PROC] [-E C_OPT] [-W WRK_PATH|-O OUT_PATH] [-G|-TACRD|-S] [-L LIN_PATH] <nts.fasta> <project> <Module:H5,H9,etc.>
24+
-T Do TRAINING again instead of using classifier files.
25+
-A Do ALIGNMENT of re-annotated fasta file (sorted by clade) & build its ML tree.
26+
-C Do CONTROL alignment & ML tree construction.
27+
-E SGE clustering option. Use 1 or 2 for SGE with array jobs, else local.
28+
-R No RECURSIVE prediction. Limits scope, useful with -L option.
29+
-D No DELETION of extra intermediary files.
30+
-S Show available protein modules.
31+
-W Web-server mode: requires ABSOLUTE path to WRITABLE working directory.
32+
-O Output directory path, do not use with web mode.
33+
-G Create a scoring matrix using given header annotations for Graphing. (removed)
34+
Example: ./LABEL -C gisaid_H5N1.fa Bird_Flu H5
35+
```
36+
37+
## DATA.
38+
39+
- LABEL takes FASTA formatted nucleotide sequences. The FASTA may be single or multi-line and may contain any number of sequences. Extra sequences with redundant headers are removed (first-read, first kept)! Commas and apostrophes are removed from headers while internal spaces are underlined.
40+
41+
- LABEL generates re-annotated FASTA sequences, scoring data, Newick files, alignments, tab-delimited files, and miscellaneous text files. LABEL's output is limited to text and creates no binaries or images. LABEL's output is limited to a specified output directory (or to a default working directory within the package) and to the current working directory of the calling user.
42+
43+
## FILES GENERATED
44+
45+
| File | Type | Description |
46+
| :------------------------- | :-------- | :-------------------------------------------------------------------------- |
47+
| PROJ_final.tab | Standard. | Tab-delimited headers & predicted clades. |
48+
| PROJ_final.txt | Standard. | A prettier output of the above. |
49+
| LEVEL_trace.tab | Standard. | Table of HMM scores at each level, suitable for visualization in R. |
50+
| LEVEL_result.tab | Standard. | For the current prediction level, tab-delimited headers & predicted clades. |
51+
| LEVEL_result.txt | Standard. | For the current prediction level, A prettier output of the above. |
52+
| FASTA/ | Standard. | Folder containing fasta files and newick trees. |
53+
| FASTA/PROJ_predictions.fas | Standard. | Query sequence file with predictions added like: _{PRED:CLAD} |
54+
| FASTA/MOD_control.fasta | Optional. | Alignment of predictions fasta file and guide sequences. |
55+
| FASTA/MOD_control.nwk | Optional. | Maximum likelihood tree of the above. |
56+
| FASTA/PROJ_reannotated.fas | Default. | Query file with annotations replaced with predicted ones, ordered by clade. |
57+
| FASTA/PROJ_ordered.fasta | Optional. | Aligned version of the above, still ordered by clade. |
58+
| FASTA/PROJ_tree.nwk | Optional. | Maximum likelihood tree of the above. |
59+
| FASTA/PROJ_clade_CLAD.fas | Standard. | The re-annotated file partitioned into separate clade files. |
60+
| c-*/ | Standard. | Clade/lineage subfolder for the hierarchical predictions. |
61+
62+
*The project name is denoted "PROJ", the lineage or clade is called "CLAD", and the module of interest as “MOD”.*
63+
64+
## MODULES
65+
66+
LABEL modules are merely directories within the *LABEL\_RES/training\_data* folder and contain all associated pHMMs as well as SVM training data. Extensions such *x-filter.txt* control against inappropriate data input. The guide tree for positive control (if desired) is listed as *MOD\_downsample.fa* for MAFFT/MUSCLE alignment or in the *x-control* folder for faster pHMM alignment. See website for more information or use: `./LABEL -S`
67+
68+
## HARDWARE
69+
70+
We recommend a single multi-core machine with no fewer than 2 cores (8 or more threads work best) and at least 2 GB of RAM. LABEL runtime is impacted by the number of cores available on a machine. Use with Mac OS X requires a 64 bit chipset.
71+
72+
## SOFTWARE PRE-REQUISITES
73+
74+
See "QUICK_INSTALL.txt".
75+
76+
## PACKAGED SOFTWARE
77+
78+
- SHOGUN version 1.0.0 or later (tested 1.1.0)
79+
- Purpose: executes the SVM decision phase.
80+
- License: GPL v3
81+
- MUSCLE 3.8 or later (tested 3.8.11)
82+
- Purpose: optionally align output or control
83+
- License: Public Domain
84+
- FastTreeMP 2.1.4 or later
85+
- Purpose: optionally build trees
86+
- License: GPL (any)
87+
- SAM version 3.5 or later
88+
- Purpose: build HMM profiles, score sequences for evaluation
89+
- License: Academic/Government, not-for-profit, redistributed with permission
90+
- BASH scripts
91+
- Purpose: assist installation, main pipeline for LABEL
92+
- License: owner, GPL
93+
- Perl scripts
94+
- Purpose: data manipulation and formatting; calls SHOGUN for SVM use.
95+
- License: owner, GPL.
96+
97+
## INSTALLATION
98+
99+
1) Unzip the archive containing LABEL.
100+
2) Move the file “LABEL” and the directory “*LABEL\_RES*” to a place in your PATH environment variable. Otherwise, add the directory containing LABEL and *LABEL\_RES* to your PATH.
101+
3) Restart your terminal emulator. Note: *LABEL\_RES* and LABEL must be in the same folder.
102+
4) LABEL is now installed. To test it, execute: LABEL test.fa test\_proj H9v2011
103+
5) The file “*test.fa*” is given in the deployment archive. To access LABEL without using the PATH variable, cd to your extracted directory & substitute “./LABEL” for “LABEL” above.
104+
105+
## LICENSE
106+
107+
GPL version 3. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
108+
109+
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
110+
111+
You should have received a copy of the GNU General Public License along with this program. If not, see [*www.gnu.org/licenses/*](http://www.gnu.org/licenses/).
112+
113+
## DICLAIMER & LIMITATION OF LIABILITY
114+
115+
[SAM (align2model,hmmscore,modelfromalign)](http://compbio.soe.ucsc.edu/sam2src/) binaries may be used within LABEL for government and/or academic use only. Commercial use and redistribution for commercial use is excluded. Use of SAM implies this [license](http://compbio.soe.ucsc.edu/sam-lic/obj.0).
116+
117+
The materials embodied in this software are "as-is" and without warranty of any kind, express, implied or otherwise, including without limitation, any warranty of fitness for a particular purpose. In no event shall the Centers for Disease Control and Prevention (CDC) or the United States (U.S.) Government be liable to you or anyone else for any direct, special, incidental, indirect or consequential damages of any kind, or any damages whatsoever, including without limitation, loss of profit, loss of use, savings or revenue, or the claims of third parties, whether or not CDC or the U.S. Government has been advised of the possibility of such loss, however caused and on any theory of liability, arising out of or in connection with the possession, use or performance of this software.  In no event shall any other party who modifies and/or conveys the program as permitted according to GPL license [[*www.gnu.org/licenses/*](http://www.gnu.org/licenses/)], make CDC or the U.S. government liable for damages, including any general, special, incidental or consequential damages arising out of the use or inability to use the program, including but not limited to loss of data or data being rendered inaccurate or losses sustained by third parties or a failure of the program to operate with any other programs.  Any views, prepared by individuals as part of their official duties as United States government employees or as contractors of the United States government and expressed herein, do not necessarily represent the views of the United States government. Such individuals’ participation in any part of the associated work is not meant to serve as an official endorsement of the software. The CDC and the U.S. government shall not be held liable for damages resulting from any statements arising from use of or promotion of the software that may conflict with any official position of the United States government.

0 commit comments

Comments
 (0)