single-cell Alternative sPlicing Analyzer (scAPA) is a bioinformatic program to detect and quantify alternative splicing events (ASEs) from 10x Genome single-cell RNAseq data. scAPA detects reads spanning exon-exon junctions within a gene, while discovering all types of ASEs without relying entirely on transcript annotation, quantifying ASEs on cell level based on barcodes, adopting the uneven read coverages, and incorporating unique molecular identifier (UMI) information to handle single-cell data (Figure 1d; see Methods). scAPA only considers ASEs supported by at least 10 junction reads, allowing both a reference and alternative AS isoform to be defined.
scAPA was designed to run under the command line of Linux environment compatible with Julia language and with at least 64G memory.
Running scAPA relies on Julia language compiler on Linux OS. First of all, the user needs to install Julia (v1.6+). A binary version of Julia program can be downloaded from the official website.
scAPA can be installed by running the below code in bash:
$ julia -e 'using Pkg; Pkg.add(PackageSpec(url="https://github.com/lambrechtslab/scAPA.git"))'
Or install under julia package environment (press key ]
in Julia REPL):
(v1.9) pkg> add https://github.com/lambrechtslab/scAPA.git
(v1.9) pkg> build scAPA
The dependent packages and the external tool StringTie
will be installed automatically. A softlink ~/scAPA
will be created at home directory and you can move it to where you want.
To uninstall scAPA, you need to remove the scAPA package in julia and clean the softlink.
$ julia -e 'using Pkg; Pkg.rm("scAPA")'
$ unlink ~/scAPA
Before analysis by scAPA, fastq files generated from 10x Genome scRNAseq platform should be mapped by cellranger
, and optionnaly but highly recommended cell types should be identified. Three types of files are needed for scAPA:
- The bam files output from
cellranger
. - A sample sheet tsv file with two columns: sample ID (unique and no blanks), and bam file path. See example in
test/10x/samplesheet.tsv
. - A genome annotation file (gtf). This file is suggested to use the same version as in the previous
cellranger
. For example, the Ensembl v93 human annotation gtf file can be downloaded from here..gz
compressed file is accepted. - (Optional but recommended) A tsv file for the cell group information, which should be composited by three columns: sample ID (identical to file #2), cell barcode (do not include
-1
at the end of barcode), and name of cell group (like cell type as an example). This file can be in compressed in .gz format. If this file is provided, scAPA will calculate PSI and counts for each cell groups. Considering read coverage on single-cell level is limited, appropriately grouping cells could largely facilitate downstream analysis. - (Optional) A de-nova assembled GTF file. If missing, scAPA will generate one based on the input bam files and annotation files using
StringTie
.
$ ~/scAPA -g ANNOTATION.gtf(.gz) -s SAMPLESHEET.tsv [-p THREAD] \
[-c CELL_BARCODE.tsv(.gz)] [-o OUTPUT-DIR] \
[--assembled-annotation ASSEMBLED_ANNOTATION.gtf(.gz)] \
[--stringtie STRINGTIE_PROGRAM] [--temp-dir TEMP_DIR]
If julia
is not in $PATH
environment, run the below command instead:
$ /path/to/julia ~/scAPA ...
-g
,--ref-annotation
ANNOTATION.gtf GTF files of public reference annotation.-s
,--samplesheet
SAMPLESHEET.tsv A tsv file with columns of sample id and bam files. Sample ID should be unique and without internal blanks.-c
,--cell-barcode
CELL_BARCODE.tsv(.gz) A tsv(.gz) file with columns of sample id, cell barcode and cell group.-o
,--output-dir
OUTPUT_DIR Directory of the outputs. scAPA will create one if not exists. (default: ".")-p
,--thread
THREAD Number of threads to run scAPA. An efficient thread number is no larger than the number of input bam files. (default: 1)
--assembled-annotation
ASSEMBLED_ANNOTATION.gtf GTF files of de-noval assembled annotation. If missing, scAPA will assemble it from bam files using StringTie.--stringtie
STRINGTIE_CMD Command ofStringTie
software. When missing, the internalStringTie
will be used.--temp-dir
TEMP_DIR Tempary directionary. (default: "_tmp")
- If
_tmp/
already exists, scAPA will consider it as an interrupted task and try to recover from the last calculation. If you want to recalculate from scratch, please remove this folder first.
detected_ASE_info.tsv
Features of each detected ASEs, including gene name, splicing type, inclusion/exclusion junction coordinates, etc. The second exon junctionE_junc_B
is only for mutiple-exclusive-exon (MXE).[SampleID]_count_per_ASE_perCell.tsv.gz
The junction read numbers per cell. For MXE, the columnE_junc_count
is the sum of two exclusion junction counts.[SampleID]_count_perASE_perCellGroup.tsv.gz
The junction read number and PSI per cell type. This file is only available when--cell-barcode
CELL_BARCODE.tsv(.gz) was given.
Below chart indicates in each of splicing types, which reads are counted as exclusion reads (E) and inclusion reads (I), and how the PSI value is calculated.
This demo shows how to use scAPA to detect and quantify ASEs on single-cell data of 1000 Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor, which is public available on the website of 10x Genomics as an example. For this data, libraries were generated with 10x Genomics 3' Single Cell Gene Expression v3 chemistry and data were pre-analyzed with Cell Ranger version 3.0 for the aligned bam file and the cell clustering result.
First, download the example bam file of 10x single-cell data (4.5G) into working directory:
$ wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_possorted_genome_bam.bam
And prepare a sample sheet file:
$ echo -e "sample1\tpbmc_1k_v3_possorted_genome_bam.bam" > samplesheet.tsv
Next, download human genome annotation file into working directory:
$ wget https://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.gtf.gz
$ gunzip Homo_sapiens.GRCh38.93.gtf.gz
Then we download the cell clustering result and use the major clusters as cell groups:
$ wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_analysis.tar.gz
$ tar zxf pbmc_1k_v3_analysis.tar.gz analysis/clustering/graphclust/clusters.csv
$ cat analysis/clustering/graphclust/clusters.csv | sed 1d | awk -F'-1,' '{print "sample1\t"$1"\tcellgroup"$2}' > cell_barcode.tsv
At last, run scAPA. It takes approximately one hour.
$ ~/scAPA -g Homo_sapiens.GRCh38.93.gtf -s samplesheet.tsv -c cell_barcode.tsv -o scAPA_output
The outputs will be in folder scAPA_output/
.
Jieyi Xiong (jieyi.xiong[at]kuleuven.be); Diether Lambrechts (diether.lambrechts[at]kuleuven.be)