SerotypeR performs pneumococcus serotyping based on PneumoCat and SeroBA libraries (Kapatai 2016, Epping 2018) and Group B Streptococcus serotyping (Kapatai 2017).
Using BLAST to interrogate assembled genomes, SerotypeR is able to interpret SNPs in whole CPS regions to the serotype level. There are 4 stages of locus analysis:
1. Determine presence/absence of locus
2. Determine whether gene is intact or pseudogene
3. Serotype determining amino acid substitutions
4. Determine the allele based on the entire gene sequence match of conserved serotype-determining genes
This tool can be run using RStudio (available at https://www.rstudio.com/)
This tool requires the use of R packages: plyr, dplyr, tidyverse, tidyselect, stringr, Biostrings which can be loaded using:
library(plyr)
library(dplyr)
library(tidyverse)
library(tidyselect)
library(stringr)
library(Biostrings)
and the use of the BLAST+ executable from NCBI: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
-
Install R from https://www.r-project.org/
-
Install RStudio from https://www.rstudio.com
-
Install required packages
install.packages("plyr") install.packages("dplyr") install.packages("tidyverse") install.packages("tidyselect") install.packages("stringr")
-
Install Biostrings (https://bioconductor.org/install/)
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("ggtree", "Biostrings"))
-
Install the BLAST+ executable from NCBI: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
-
Ensure SerotypeR is located in a folder with the following subfolders:
a. allele_lkup_dna - contains reference FASTA files b. output - results can be found here after running SerotypeR c. reference - Contains loci lists and lookup tables for SNP-based serotyping d. temp - storage of temporary files generated by this program e. wildgenes - contains reference FASTA files for SNP-based serotyping
-
Set the working directory where SerotypeR is located.
line 20: curr_work_dir <- "C:\\SerotypeR\\"
-
This molecular analysis tool queries pre-assembled fasta files. The location of the contig files needs to be assigned to ContigsDir with the file extension ".fasta" (eg. MySampleNo_contig.fasta).
line 21: ContigsDir <- "C:\\SerotypeR\\contigs\\"
-
Set the organism type to "PNEUMO" or "GBS"
line 19: Org_id <- "PNEUMO"
-
To use the multiple sample list option, a multiple sample list file must be located in the directory. (eg. C:/SerotypeR/list.csv)
list.csv must have the following structure:SampleNo Variable 12345 Quellung_23F 12346 Quellung_3
Where the "Variable" column can be anything you wish to include in the outputs.
When running this program on some Windows machines, the makeblastdb program can give an error. If this happens, the environmental variables setting will need to be changed as follows:
-
Go to Windows Settings and search for "Environmental Variables"
-
in the System Properties dialogue box, click on the "Environmental Variables" button
-
in the "User Variables for..." box, click "New..." button
-
input the following:
Variable Name: BLASTDB_LMDB_MAP_SIZE Variable Value: 1000000
Copyright Government of Canada 2022.
Written by: National Microbiology Laboratory, Public Health Agency of Canada.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Walter Demczuk: Walter.Demczuk@phac-aspc.gc.ca{.email}
Shelley Peterson: Shelley.Peterson@phac-aspc.gc.ca{.email}