update lineage section with CLI; fix #60

cambiotraining · Feb 1, 2024 · 4155e98 · 4155e98
1 parent 579653b
commit 4155e98
Show file tree

Hide file tree

Showing 2 changed files with 51 additions and 21 deletions.
diff --git a/materials/02-isolates/02-qc.md b/materials/02-isolates/02-qc.md
@@ -223,7 +223,7 @@ This should give us 7 as the result (which makes sense, since we have 7 samples)
 We can now proceed with cleaning the names of the sequences, by using `sed`:
 
 ```bash
-cat results/viralrecon/variants/ivar/consensus/bcftools/*.fa | sed 's| MN908947.3||' > results/viralrecon/clean_sequences.fa
+cat results/viralrecon/variants/ivar/consensus/bcftools/*.fa | sed 's| MN908947.3||' > results/clean_sequences.fa
 ```
 
 Notice that in this last command we make sure to redirect the result to a new file using `>`.
@@ -247,7 +247,7 @@ To obtain a list of missing intervals in a sequence, we can use the software `se
 In particular, we can use the following command:
 
 ```bash
-seqkit locate -i -P -G -M -r -p "N+" results/viralrecon/clean_sequences.fa
+seqkit locate -i -P -G -M -r -p "N+" results/clean_sequences.fa
 ```
 
 ```
@@ -401,15 +401,15 @@ The following `sed` command can be used to _substitute_ the text "/ARTIC/medaka
 sed 's|/ARTIC/medaka MN908947.3||'
 ```
 
-- Pipe the tools `cat` and `sed` to construct a command that generates a new file called `results/consensus/clean_sequences.fa` containing all the sequences with "clean" sequence names. 
+- Pipe the tools `cat` and `sed` to construct a command that generates a new file called `results/clean_sequences.fa` containing all the sequences with "clean" sequence names. 
 
 
 :::{.callout-answer}
 
 The complete code to achieve the desired outcome is:
 
 ```bash
-cat results/viralrecon/medaka/*.consensus.fasta | sed 's|/ARTIC/medaka MN908947.3||' > results/viralrecon/clean_sequences.fa
+cat results/viralrecon/medaka/*.consensus.fasta | sed 's|/ARTIC/medaka MN908947.3||' > results/clean_sequences.fa
 ```
 
 Look at our [companion Unix course materials](https://cambiotraining.github.io/unix-shell/materials/04-misc/01-sed.html) for more information about how the `sed` command works.

diff --git a/materials/02-isolates/03-lineages.md b/materials/02-isolates/03-lineages.md
@@ -89,24 +89,42 @@ The main steps performed by this tool are:
   - _UShER_ uses a more classic parsimony-based method, but highly optimised for working with large numbers of sequences.
 - Classifying our sequences according to the WHO nomenclature of variants of interest/concern using the `scorpio` software.
 
-_Pangolin_ runs as part of the `nf-core/viralrecon` pipeline we used. 
-The model to classify variants may change over time as more public sequences become available and nomenclature rules are updated. 
-Therefore, it important to know which version was used, as the results may change in the future if we were to run it again. 
-This information is provided in the `multiqc_report.html` generated by the pipeline, under the section "nf-core/viralrecon Software Versions". 
+Although _Pangolin_ can run as part of the `nf-core/viralrecon` pipeline we used, we recommended to turn this option off. 
+The reason is that the model to classify variants regularly changes over time, as more public sequences become available and the nomenclature rules updated. 
+Therefore, it important to always run the samples through the latest _Pangolin_ version available. 
 
-The detailed results of the _Pangolin_ analysis are detailed in the `multiqc_report.html` file under the section named "Pangolin". 
-The table contains information about the lineage assignment analysis from _Pangolin_, including WHO variants of concern identified using the _Scorpio_ software. 
-A detailed explanation of the columns of this file is given in the [Pangolin documentation page](https://cov-lineages.org/resources/pangolin/output.html).
+_Pangolin_ can be run from the command line, using two steps: 
 
-![Snapshot of the **Pangolin** section of the `nf-core/viralrecon` _MultiQC_ report. The table shows which **Pango lineage** each sample was assigned to, the **WHO variant** classification ("S Call" column, which stands for _Scorpio_ call, the software used to do this classification) and several quality metrics. An explanation of these columns is given in the <kbd><i class="fa-solid fa-circle-question"></i> Help</kbd> button on the top-right of the table or on the [Pangolin documentation page](https://cov-lineages.org/resources/pangolin/output.html). Any samples that fail basic QC checks from Pangolin are marked as "Fail" on the "**QC Status**" column.](images/variants_pangolin_multiqc.png)
+- Updating the data used for lineage/variant classification. 
+- Running the actual lineage assignment step. 
+
+On our example data, these would be the commands: 
+
+```bash
+# update pangolin data
+pangolin --update-data
+
+# run pangolin
+pangolin --outdir results/pangolin/ --outfile pango_report.csv results/clean_sequences.fa
+```
+
+- The first command downloads the latest version of the lineages and their characteristic mutations from the Pango classification system. 
+- The second command runs the FASTA file of consensus genomes through _Pangolin_'s classification algorithm. 
+  - `--outdir` is used to define the directory to save the results in.
+  - `--outfile` is used to give the file a name of our choice. 
+
+The output is a CSV file, with several columns of interest, including WHO variants of concern identified using the _Scorpio_ software. 
+A detailed explanation of the columns of this file is given in the [_Pangolin_ documentation page](https://cov-lineages.org/resources/pangolin/output.html).
+
+More information about running _Pangolin_ from the command line can be found in [its online documentation](https://cov-lineages.org/resources/pangolin.html).
 
 
 ### Web Application
 
 This tool can also be run separately using a [web application](https://pangolin.cog-uk.io/), which only requires us to provide with a FASTA file of consensus sequences. 
 This may desirable to re-run samples using the latest version of the _Pangolin_ software and SARS-CoV-2 variant databases.
 
-The results from the web application can be downloaded as a CSV file, which contains a table similar to the one obtained from our pipeline. 
+The results from the web application can be downloaded as a CSV file, which contains a table similar to the one obtained from our command above (some of the column names are different, but their meaning is the same). 
 
 
 ## Nextclade
@@ -124,20 +142,32 @@ _Nextclade_ performs similar steps to _Pangolin_, with some differences in the a
 You can find more details about _Nextclade_'s methods on [its documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/algorithm/index.html).
 _Nextclade_ also provides several quality control metrics, which are very useful to identify problematic samples. 
 
-_Nextclade_ also runs as part of the `nf-core/viralrecon` pipeline we used. 
-The clade assignment can be seen in the `multiqc_report.html` under the first section of the report named "Variant calling metrics". 
+As we discussed above, the models and clades are regularly updated, so we also skipped this step when we ran the `nf-core/viralrecon` pipeline. 
+Instead, we can run this tool directly from the command line, by first making sure to download the most up-to-date clade information.
+Here are the commands: 
+
+```bash
+# get nextclade data
+nextclade dataset get --name sars-cov-2 --output-dir resources/nextclade_background_data
+
+# run nextclade
+nextclade run --input-dataset resources/nextclade_background_data/ --output-all results/nextclade results/clean_consensus.fa
+```
+
+- The first command (`nextclade dataset get`) downloads the latest version of the Nextclade dataset for SARS-CoV-2 (option `--name sars-cov-2`). 
+  We define the directory to store this information with `--output-dir`. 
+- The next step is to run the actual clade assignment on our data. 
+  We use the database from the previous step as `--input-dataset`, we define a directory to output all the results with `--output-all` and at the end of the command we give as input our clean FASTA consensus genomes. 
 
-However, _Nextclade_ provides more information that can be useful to analyse our consensus sequences, including an evaluation of the quality of our sequences based on how many missing bases ('N') we have and the mutations that are present in the sequence. 
-These results are provided as a series of CSV files by the pipeline.
-However, _Nextclade_ provides a web application that makes this analysis easier. 
+More information about running _Nextclade_ from the command line can be found in [its online documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli/index.html).
 
 ### Web Application
 
 _Nextclade_ offers an interactive application, which can be used to run its analysis on a FASTA file with sequences:
 
 - Go to [nextclade.org](https://clades.nextstrain.org/).
-- Select **SARS-CoV-2** and click **Next**.
-- Click **Select a file** to browse your computer and upload the FASTA file with the cleaned consensus sequences (`results/clean_consensus_sequences.fa`).
+- Click **Select a file** to browse your computer and upload the FASTA file with the cleaned consensus sequences (`results/clean_sequences.fa`).
+- _Nextclade_ will automatically detect your data are from SARS-CoV-2, but if not you can select this organism.
 - Click **Run**.
 
 Nextclade will show a progress of its analysis at the top of the page, and the results of several quality control metrics in the main panel (see Figure). 
@@ -158,7 +188,7 @@ When using the _Nextclade_ web application, the data does not leave your compute
 
 In this exercise we will work with 48 consensus sequences from the UK, processed with the `nf-core/viralrecon` pipeline and covered in the [previous section](01-consensus.md).
 
-Go to [nextclade.org](https://clades.nextstrain.org/) and load the sequences provided in `uk_illumina/preprocessed/clean_consensus_sequences.fa`.
+Go to [nextclade.org](https://clades.nextstrain.org/) and load the sequences provided in `uk_illumina/preprocessed/clean_sequences.fa`.
 
 1. Are there any samples that were classified as "bad" quality? If so, what is the main reason?
 2. Sort the table by the "Clade" column. Looking at the mutations in gene S on the right, you can see that all sequences classified as "Alpha" have a deletion in positions 21992-21994. Samples classified as "Delta" do not have this deletion and instead have a deletion in positions 22029-22034. However, there is one exception: sample GB39, classified as "Delta" has both deletions. Investigate if this mutation is accurate using IGV: