diff --git a/docs/home_contents.html b/docs/home_contents.html index ea0e9c12..81f9e258 100644 --- a/docs/home_contents.html +++ b/docs/home_contents.html @@ -120,6 +120,11 @@ Info +
diff --git a/docs/home_info.html b/docs/home_info.html index d26673b0..a55c2794 100644 --- a/docs/home_info.html +++ b/docs/home_info.html @@ -135,6 +135,11 @@ Info +
@@ -457,7 +462,7 @@

Contact

- + diff --git a/docs/home_precourse.html b/docs/home_precourse.html index 8aa6767c..0a204619 100644 --- a/docs/home_precourse.html +++ b/docs/home_precourse.html @@ -122,6 +122,11 @@ Info +
diff --git a/docs/home_schedule.html b/docs/home_schedule.html index 872858b3..f4e31af2 100644 --- a/docs/home_schedule.html +++ b/docs/home_schedule.html @@ -123,6 +123,11 @@ Info +
diff --git a/docs/home_syllabus.html b/docs/home_syllabus.html index 67b1ae20..8da6ace5 100644 --- a/docs/home_syllabus.html +++ b/docs/home_syllabus.html @@ -117,6 +117,11 @@ Info +
diff --git a/docs/index.html b/docs/index.html index b3b9f939..c6d8e458 100644 --- a/docs/index.html +++ b/docs/index.html @@ -114,6 +114,11 @@ Info +
@@ -165,7 +170,7 @@

Single Cell R
-

Updated: 18-01-2024 at 17:22:21.

+

Updated: 23-01-2024 at 11:35:27.

diff --git a/docs/labs/bioc/bioc_01_qc.html b/docs/labs/bioc/bioc_01_qc.html index c7de2761..c1a828cd 100644 --- a/docs/labs/bioc/bioc_01_qc.html +++ b/docs/labs/bioc/bioc_01_qc.html @@ -165,6 +165,11 @@ Info +
@@ -202,7 +207,7 @@

Published
-

16-Jan-2024

+

23-Jan-2024

@@ -330,7 +335,7 @@

2 Collate

-

We can now load the expression matrices and merge them into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column Chemistry in the metadata for plotting later on.

+

We can now merge them objects into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column type in the metadata to define covid and ctrl samples.

sce <- SingleCellExperiment(assays = list(counts = cbind(cov.1, cov.15, cov.17, ctrl.5, ctrl.13, ctrl.14)))
 dim(sce)
@@ -349,8 +354,8 @@

gc()

           used  (Mb) gc trigger  (Mb) max used  (Mb)
-Ncells 10216587 545.7   17147474 915.8 13915408 743.2
-Vcells 44612623 340.4   94446392 720.6 83350999 636.0
+Ncells 10216383 545.7 17147170 915.8 13915194 743.2 +Vcells 44612100 340.4 94440822 720.6 83350476 636.0

Here is how the count matrix and the metadata look like for every cell.

@@ -475,7 +480,7 @@

-

As you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 sample having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.

+

As you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 and covid_16 samples having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. We can also plot the different QC-measures as scatter plots.

plotColData(sce, x = "total", y = "detected", colour_by = "sample")
@@ -594,7 +599,7 @@

5.4 Filter genes

-

As the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis.

+

As the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis. In this case we will also remove the HB genes.

dim(sce.filt)
@@ -610,11 +615,11 @@

# sce.filt <- sce.filt[ ! grepl("^RP[SL]", rownames(sce.filt)), ] # Filter Hemoglobin gene -sce.filt <- sce.filt[!grepl("^HB[^(P)]", rownames(sce.filt)), ] +sce.filt <- sce.filt[!grepl("^HB[^(PES)]", rownames(sce.filt)), ] dim(sce.filt)

-
[1] 18183  6023
+
[1] 18186  6023
@@ -636,7 +641,7 @@

"description", "gene_biotype", "chromosome_name", "start_position" ), mart = mart, useCache = F)) -write.csv(genes_table, file = "data/results/genes_table.csv")

+write.csv(genes_table, file = "data/covid/results/genes_table.csv")
genes_file <- file.path(path_results, "genes_table.csv")
@@ -679,7 +684,19 @@ 

+
+
+ +
+
+Discuss +
+
+
+

Here, we can see clearly that we have three males and five females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?

+
+

7 Cell cycle state

@@ -790,7 +807,7 @@

sce.filt <- sce.filt[, sce.filt$scDblFinder.score < 2]
 dim(sce.filt)

-
[1] 18183  6023
+
[1] 18186  6023
@@ -1173,7 +1190,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-22-1.png b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-22-1.png index 4bc1b671..cfef237e 100644 Binary files a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-22-1.png and b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-23-1.png b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-23-1.png index 9da57a1c..0f0cf6f3 100644 Binary files a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-23-1.png and b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-25-1.png b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-25-1.png index cf819ce7..1049d2e4 100644 Binary files a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-25-1.png and b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-28-1.png b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-28-1.png index 2434a502..ec902d96 100644 Binary files a/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-28-1.png and b/docs/labs/bioc/bioc_01_qc_files/figure-html/unnamed-chunk-28-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred.html b/docs/labs/bioc/bioc_02_dimred.html index bbae2d93..7b5aadaa 100644 --- a/docs/labs/bioc/bioc_02_dimred.html +++ b/docs/labs/bioc/bioc_02_dimred.html @@ -165,6 +165,11 @@ Info +
@@ -905,7 +910,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-10-1.png index 2b2e511f..46f62f61 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-10-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-13-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-13-1.png index b929325e..d2a371d3 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-13-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-16-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-16-1.png index 4a1f81c9..851c1cbc 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-16-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-17-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-17-1.png index 2d37a72b..7d126c2b 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-17-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-3-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-3-1.png index bc5a74c1..45ffb9fc 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-3-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-6-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-6-1.png index 6d014f9d..6f0c0d43 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-6-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-7-1.png index 893fd592..34d47fd9 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-8-1.png index 850ba350..bbdb61d1 100644 Binary files a/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/bioc/bioc_02_dimred_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/bioc/bioc_03_integration.html b/docs/labs/bioc/bioc_03_integration.html index f8823e52..977291c6 100644 --- a/docs/labs/bioc/bioc_03_integration.html +++ b/docs/labs/bioc/bioc_03_integration.html @@ -165,6 +165,11 @@ Info +
@@ -509,22 +514,22 @@

lapply(scelist, dim)
[[1]]
-[1] 923 454
+[1] 923 500
 
 [[2]]
-[1] 611 454
+[1] 611 500
 
 [[3]]
-[1] 1111  454
+[1] 1111  500
 
 [[4]]
-[1] 1067  454
+[1] 1067  500
 
 [[5]]
-[1] 1203  454
+[1] 1203  500
 
 [[6]]
-[1] 1108  454
+[1] 1108 500

INTEG_R5:

@@ -901,7 +906,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-10-1.png index 80e1916d..a2603678 100644 Binary files a/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-10-1.png and b/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-4-1.png b/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-4-1.png index 9e2f0dde..9cbdeabe 100644 Binary files a/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-4-1.png and b/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-9-1.png index 03298e09..fd0c6aab 100644 Binary files a/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/bioc/bioc_03_integration_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/labs/bioc/bioc_04_clustering.html b/docs/labs/bioc/bioc_04_clustering.html index c508e947..4ee965d8 100644 --- a/docs/labs/bioc/bioc_04_clustering.html +++ b/docs/labs/bioc/bioc_04_clustering.html @@ -165,6 +165,11 @@ Info +
@@ -830,7 +835,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-12-1.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-12-1.png index 0fb4439f..e4d9a9e9 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-12-1.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-1.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-1.png index 55c9f1f8..967a6e3c 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-1.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-2.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-2.png index a05d20dd..a73494e9 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-2.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-4-2.png differ diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-5-1.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-5-1.png index 673c29fe..5181801b 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-5-1.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-6-1.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-6-1.png index 983c41e3..90cac67d 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-6-1.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-7-1.png index c612fc5b..18da6fb7 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-8-1.png index 20494371..a4c61e94 100644 Binary files a/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/bioc/bioc_04_clustering_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge.html b/docs/labs/bioc/bioc_05_dge.html index bde43e92..25d1c9ac 100644 --- a/docs/labs/bioc/bioc_05_dge.html +++ b/docs/labs/bioc/bioc_05_dge.html @@ -165,6 +165,11 @@ Info +
@@ -297,46 +302,46 @@

# Visualizing the expression of one
 markers_genes[["1"]]
-
DataFrame with 18183 rows and 11 columns
-               p.value         FDR summary.logFC     logFC.2      logFC.3
-             <numeric>   <numeric>     <numeric>   <numeric>    <numeric>
-S100A8     5.01536e-64 9.11942e-60      6.628056     7.76621     2.340367
-S100A12    5.88901e-53 5.35399e-49      1.787072     4.27763     1.787072
-S100A9     2.54322e-28 1.54144e-24      1.421390     7.39019     1.421390
-CXCL8      5.98014e-15 2.71842e-11      1.102967     1.58992     1.102967
-PLBD1      2.42988e-14 8.83649e-11      0.987264     2.43642     0.987264
-...                ...         ...           ...         ...          ...
-AC007325.4           1           1    0.01104654  0.01104654 -0.004812566
-AL354822.1           1           1   -0.00785244 -0.00785244  0.000868684
-AC004556.1           1           1    0.02294381 -0.02462402 -0.124791403
-AC233755.1           1           1   -0.00670799 -0.00670799  0.000000000
-AC240274.1           1           1   -0.00724362 -0.00724362 -0.007032607
-               logFC.4     logFC.5     logFC.6     logFC.7    logFC.8
-             <numeric>   <numeric>   <numeric>   <numeric>  <numeric>
-S100A8         7.89619     7.78462     7.94406     7.88144    6.62806
-S100A12        4.31600     4.28998     4.31586     4.31295    4.26648
-S100A9         7.50841     7.42086     7.55250     7.55379    6.29102
-CXCL8          1.68719     1.54233     1.63129     1.63792    1.53139
-PLBD1          2.43135     2.44121     2.44252     2.44082    2.40550
-...                ...         ...         ...         ...        ...
-AC007325.4 -0.00271371  0.00667792  0.00417983  0.00809222  0.0110465
-AL354822.1 -0.01036855 -0.00936705 -0.00928158 -0.01539009 -0.0490755
-AC004556.1 -0.04927666 -0.01090129 -0.05200271 -0.04487633  0.0229438
-AC233755.1  0.00000000  0.00000000  0.00000000  0.00000000  0.0000000
-AC240274.1 -0.01510737 -0.01125536 -0.00103067 -0.00380232 -0.0143902
-               logFC.9
-             <numeric>
-S100A8         6.27635
-S100A12        3.88182
-S100A9         4.81815
-CXCL8          1.54518
-PLBD1          1.81260
-...                ...
-AC007325.4 -0.00652380
-AL354822.1 -0.00783011
-AC004556.1 -0.14233685
-AC233755.1  0.00000000
-AC240274.1 -0.01826009
+
DataFrame with 18186 rows and 11 columns
+                p.value          FDR summary.logFC     logFC.2      logFC.3
+              <numeric>    <numeric>     <numeric>   <numeric>    <numeric>
+S100A12    1.57321e-139 2.86104e-135       2.34116     4.13134      2.34116
+S100A8      1.35706e-64  1.23397e-60       6.52478     7.66360      3.33664
+S100A9      1.40449e-61  8.51405e-58       6.19181     7.33443      2.41140
+PLBD1       3.89784e-49  1.77215e-45       1.28043     2.32483      1.28043
+NAMPT       7.45257e-38  2.71065e-34       1.27817     2.67891      1.27817
+...                 ...          ...           ...         ...          ...
+AC007325.4            1            1    0.00966451  0.00966451  0.000585433
+AL354822.1            1            1   -0.00710162 -0.00710162  0.000697440
+AC004556.1            1            1   -0.04593904 -0.05277778 -0.107041903
+AC233755.1            1            1   -0.00643585 -0.00643585  0.000000000
+AC240274.1            1            1   -0.00464419 -0.00464419 -0.003523507
+               logFC.4     logFC.5      logFC.6      logFC.7     logFC.8
+             <numeric>   <numeric>    <numeric>    <numeric>   <numeric>
+S100A12        4.17448     4.16371      4.16654      4.16271     4.11649
+S100A8         7.78141     7.69505      7.80820      7.76011     6.52478
+S100A9         7.42537     7.40085      7.47624      7.47041     6.19181
+PLBD1          2.32076     2.33067      2.33183      2.32822     2.28943
+NAMPT          2.76442     2.68668      2.75854      2.86208     2.81797
+...                ...         ...          ...          ...         ...
+AC007325.4 -0.00472268  0.00533155  0.003317156  0.007609982  0.00966451
+AL354822.1 -0.00667383 -0.00850239 -0.008013634 -0.012927623 -0.01634707
+AC004556.1 -0.04331115 -0.01255718 -0.045939045 -0.042512552  0.01815203
+AC233755.1  0.00000000  0.00000000  0.000000000  0.000000000  0.00000000
+AC240274.1 -0.00702685 -0.00810242  0.000772945 -0.000256299 -0.01576446
+              logFC.9
+            <numeric>
+S100A12       4.12539
+S100A8        6.89910
+S100A9        5.25571
+PLBD1         1.85508
+NAMPT         1.62395
+...               ...
+AC007325.4 -0.0146342
+AL354822.1 -0.0131441
+AC004556.1 -0.1608256
+AC233755.1  0.0000000
+AC240274.1 -0.0229031

We can now select the top 25 up regulated genes for plotting.

@@ -355,7 +360,7 @@

@@ -444,7 +449,7 @@

@@ -946,7 +951,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-10-1.png index 1da84797..42ef8b85 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-10-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-12-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-12-1.png index ea328201..acf9d3a0 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-12-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-15-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-15-1.png index 248b55bc..450b65a0 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-15-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-1.png index 86880eae..e8c0d53a 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-2.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-2.png index d0008e15..d4536a27 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-2.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-5-2.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-6-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-6-1.png index 2e2fc535..fd6a4d2c 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-6-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-7-1.png index d5462ce4..22af23cb 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-9-1.png index b332de20..f651369a 100644 Binary files a/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/bioc/bioc_05_dge_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping.html b/docs/labs/bioc/bioc_06_celltyping.html index 69b9d442..041510de 100644 --- a/docs/labs/bioc/bioc_06_celltyping.html +++ b/docs/labs/bioc/bioc_06_celltyping.html @@ -165,6 +165,11 @@ Info +
@@ -409,9 +414,9 @@


      B cell  CD4 T cell  CD8 T cell         cDC       cMono      ncMono 
-         70         104         125          38         215         160 
+         69         105         124          38         215         160 
     NK cell         pDC Plasma cell  unassigned 
-        294           2           1         194 
+ 294 2 1 195

Then add the predictions to metadata and plot UMAP.

@@ -451,10 +456,10 @@

table(cell_type_pred)
cell_type_pred
-     B cell  CD4 T cell  CD8 T cell         cDC       cMono      ncMono 
-        101         161         293          37         241         164 
-    NK cell         pDC Plasma cell 
-        203           2           1 
+ B cell CD4 T cell CD8 T cell cDC cMono ncMono NK cell + 102 176 300 65 187 189 182 + pDC + 2

Then add the predictions to metadata and plot umap.

@@ -594,7 +599,7 @@

@@ -670,164 +675,162 @@

res
$`1`
-       pathway         pval         padj         ES       NES nMoreExtreme size
-1:       cMono 0.0001612123 0.0005946089  0.9477365  1.935642            0   47
-2:      ncMono 0.0001611344 0.0005946089  0.8883004  1.824343            0   49
-3:         cDC 0.0581929556 0.0654670750 -0.7642090 -1.413663          265   17
-4: Plasma cell 0.0263583815 0.0338893476 -0.7559870 -1.492311          113   24
-5:     NK cell 0.0018440464 0.0027660695 -0.7327226 -1.663502            6   49
-6:  CD8 T cell 0.0011008366 0.0019815059 -0.8963974 -1.673679            4   18
-7:      B cell 0.0002632272 0.0005946089 -0.9032392 -2.032917            0   47
-8:  CD4 T cell 0.0002642706 0.0005946089 -0.9254862 -2.108715            0   50
-                                                leadingEdge
-1:                  S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
-2:             S100A11,AIF1,S100A4,FCER1G,MAFB,SERPINA1,...
-3: HLA-DPB1,HLA-DPA1,HLA-DQB1,HLA-DRB1,HLA-DMA,HLA-DRB5,...
-4:                    ISG20,PEBP1,CYCS,MIF,FKBP11,SPCS2,...
-5:                       GNLY,NKG7,B2M,CTSW,GZMA,FGFBP2,...
-6:                         IL32,CCL5,GZMH,CD3D,CD2,CD8A,...
-7:                 RPS5,CXCR4,RPL23A,CD52,RPL18A,RPL13A,...
-8:                  RPL3,RPS4X,RPS27A,RPL5,EEF1A1,RPL14,...
+      pathway         pval         padj         ES       NES nMoreExtreme size
+1:      cMono 0.0001327492 0.0007607777  0.9515770  1.824481            0   47
+2:     ncMono 0.0003952048 0.0007607777  0.8775149  1.692495            2   49
+3:    NK cell 0.0070510162 0.0105765243 -0.6936107 -1.614830           16   49
+4: CD8 T cell 0.0002904444 0.0007607777 -0.9042254 -1.737436            0   18
+5:     B cell 0.0004050223 0.0007607777 -0.9085213 -2.107981            0   47
+6: CD4 T cell 0.0004226543 0.0007607777 -0.9246391 -2.155367            0   50
+                                leadingEdge
+1:  S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
+2: S100A11,AIF1,S100A4,FCER1G,MAFB,SAT1,...
+3:         GNLY,NKG7,CTSW,GZMA,B2M,GZMM,...
+4:         IL32,CCL5,GZMH,CD3D,CD2,CD8A,...
+5: RPS5,CXCR4,RPL23A,CD52,RPL18A,RPL13A,...
+6:  RPL3,RPS4X,RPS27A,RPL5,EEF1A1,RPL14,...
 
 $`2`
-      pathway         pval         padj         ES       NES nMoreExtreme size
-1:     B cell 0.0002041650 0.0003700658  0.9650595  2.060454            0   47
-2: CD4 T cell 0.0002055921 0.0003700658  0.8591045  1.846955            0   50
-3:        cDC 0.0004203447 0.0006305170  0.9445632  1.709807            1   17
-4: CD8 T cell 0.0021048603 0.0027062490 -0.8921894 -1.641239           10   18
-5:      cMono 0.0001959248 0.0003700658 -0.8185447 -1.761319            0   47
-6:     ncMono 0.0001940994 0.0003700658 -0.8829761 -1.915489            0   49
-7:    NK cell 0.0001940994 0.0003700658 -0.9127279 -1.980031            0   49
+      pathway         pval        padj         ES       NES nMoreExtreme size
+1:     B cell 0.0001973554 0.000365408  0.9639203  2.032368            0   47
+2: CD4 T cell 0.0001970055 0.000365408  0.8696162  1.846240            0   50
+3:        cDC 0.0001979806 0.000365408  0.9506666  1.711665            0   17
+4: CD8 T cell 0.0016083635 0.002067896 -0.8930677 -1.641590            7   18
+5:      cMono 0.0008105370 0.001215805 -0.7964559 -1.721447            3   47
+6:     ncMono 0.0002030045 0.000365408 -0.8998544 -1.960178            0   49
+7:    NK cell 0.0002030045 0.000365408 -0.9119291 -1.986480            0   49
                                                leadingEdge
-1:              MS4A1,CD37,TNFRSF13C,CXCR4,BANK1,CD79B,...
+1:              MS4A1,CD37,CXCR4,TNFRSF13C,BANK1,CD79B,...
 2:                   RPS6,RPL13,RPL32,RPS3A,RPS29,RPL3,...
 3: HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DRB1,HLA-DPA1,HLA-DMA,...
 4:                        CCL5,IL32,GZMH,CD3D,CD2,LYAR,...
-5:                S100A6,S100A9,LYZ,S100A8,TYROBP,FCN1,...
-6:              S100A4,FCER1G,S100A11,AIF1,IFITM3,LST1,...
+5:                S100A6,S100A9,TYROBP,LYZ,S100A8,FCN1,...
+6:              S100A4,FCER1G,S100A11,AIF1,LST1,IFITM3,...
 7:                     HCST,NKG7,ITGB2,GNLY,MYO1F,CST7,...
 
 $`3`
-      pathway         pval         padj         ES       NES nMoreExtreme size
-1:     ncMono 0.0001041124 0.0004694836  0.9309715  1.625137            0   49
-2:      cMono 0.0001043297 0.0004694836  0.9315183  1.624154            0   47
-3:        cDC 0.0168105930 0.0216136195  0.8590261  1.386413          145   17
-4: CD4 T cell 0.0026666667 0.0040000000 -0.7020776 -1.886878            0   50
-5:    NK cell 0.0025188917 0.0040000000 -0.7120017 -1.914447            0   49
-6: CD8 T cell 0.0007980846 0.0023942538 -0.9359176 -2.017558            0   18
-7:     B cell 0.0023980815 0.0040000000 -0.8774013 -2.326466            0   47
-                                               leadingEdge
-1:            AIF1,PSAP,S100A11,FCER1G,S100A4,SERPINA1,...
-2:                S100A9,LYZ,S100A8,FCN1,TYROBP,S100A6,...
-3: HLA-DRA,HLA-DRB1,HLA-DRB5,HLA-DQB1,HLA-DPA1,HLA-DMA,...
-4:                 RPL3,PIK3IP1,IL7R,RPS29,RPS3,RPS27A,...
-5:                       NKG7,GNLY,CST7,GZMA,CTSW,GZMM,...
-6:                        CCL5,IL32,GZMH,CD3D,CD2,CD8A,...
-7:              CXCR4,MS4A1,TNFRSF13C,CD79B,BANK1,RPS5,...
+       pathway         pval         padj         ES       NES nMoreExtreme size
+1:       cMono 0.0001162115 0.0005229518  0.9366715  1.754906            0   47
+2:      ncMono 0.0001152206 0.0005229518  0.9282959  1.748653            0   49
+3:         cDC 0.0058249797 0.0074892596  0.8938512  1.504106           42   17
+4: Plasma cell 0.0304961311 0.0343081475 -0.7003557 -1.517735           66   24
+5:     NK cell 0.0007558579 0.0011583012 -0.7163352 -1.783214            0   49
+6:  CD8 T cell 0.0003958828 0.0011583012 -0.9181589 -1.871343            0   18
+7:  CD4 T cell 0.0007722008 0.0011583012 -0.7661201 -1.917423            0   50
+8:      B cell 0.0007158196 0.0011583012 -0.8988784 -2.222805            0   47
+                                                leadingEdge
+1:                 LYZ,S100A9,S100A8,FCN1,TYROBP,S100A6,...
+2:                AIF1,PSAP,S100A4,FCER1G,S100A11,COTL1,...
+3: HLA-DRA,HLA-DRB1,HLA-DRB5,HLA-DPA1,HLA-DQB1,HLA-DPB1,...
+4:                  ISG20,CYCS,FKBP11,JCHAIN,MZB1,PEBP1,...
+5:                      NKG7,CST7,GZMM,CTSW,GZMA,FGFBP2,...
+6:                         CCL5,IL32,CD3D,GZMH,CD2,CD8A,...
+7:                  PIK3IP1,RPS29,IL7R,RPS27A,RPL3,RPS3,...
+8:               CXCR4,MS4A1,CD79B,RPS5,TNFRSF13C,BANK1,...
 
 $`4`
       pathway         pval         padj         ES       NES nMoreExtreme size
-1: CD4 T cell 0.0001930875 0.0004653568  0.9803622  2.131622            0   50
-2:    NK cell 0.0275077559 0.0412616339 -0.6668272 -1.466630          132   49
-3:        cDC 0.0001991239 0.0004653568 -0.9322686 -1.728863            0   17
-4:        pDC 0.0006202191 0.0011163945 -0.8171519 -1.789912            2   47
-5:      cMono 0.0002067397 0.0004653568 -0.9186945 -2.012333            0   47
-6:     ncMono 0.0002068252 0.0004653568 -0.9263802 -2.037495            0   49
+1: CD4 T cell 0.0002101723 0.0004728878  0.9821321  2.134294            0   50
+2:    NK cell 0.0339112212 0.0508668318 -0.6722891 -1.452462          177   49
+3:        cDC 0.0001847404 0.0004728878 -0.9340033 -1.702542            0   17
+4:        pDC 0.0005699088 0.0010258359 -0.8226825 -1.767306            2   47
+5:      cMono 0.0001899696 0.0004728878 -0.9129405 -1.961201            0   47
+6:     ncMono 0.0001905125 0.0004728878 -0.9484681 -2.049139            0   49
                                                leadingEdge
-1:                  IL7R,LDHB,PIK3IP1,NOSIP,RPL3,RPS12,...
-2:                    NKG7,GNLY,FGFBP2,MYO1F,CST7,GZMA,...
+1:                  IL7R,LDHB,PIK3IP1,RPL3,RPS12,RPL13,...
+2:                   NKG7,GNLY,MYO1F,FGFBP2,CST7,ITGB2,...
 3: HLA-DRA,HLA-DRB1,HLA-DPA1,HLA-DPB1,HLA-DQB1,HLA-DMA,...
-4:                     PLEK,NPC2,IRF8,PLAC8,PTPRE,CTSB,...
-5:                 S100A9,S100A8,LYZ,TYROBP,FCN1,APLP2,...
-6:                    FCER1G,PSAP,IFITM3,LYN,SAT1,LST1,...
+4:                     PLEK,NPC2,PLAC8,IRF8,CTSB,PTPRE,...
+5:                 S100A9,TYROBP,S100A8,LYZ,FCN1,APLP2,...
+6:                   FCER1G,PSAP,IFITM3,LST1,SAT1,AIF1,...
 
 $`5`
       pathway         pval         padj         ES       NES nMoreExtreme size
-1:     B cell 0.0001818182 0.0004016064  0.9624502  2.052882            0   47
-2: CD4 T cell 0.0001812251 0.0004016064  0.8762641  1.886926            0   50
-3:        cDC 0.0001904399 0.0004016064  0.9538608  1.738185            0   17
-4: CD8 T cell 0.0004203447 0.0006305170 -0.9046911 -1.711837            1   18
-5:      cMono 0.0008884940 0.0011423494 -0.7954796 -1.765723            3   47
-6:     ncMono 0.0002231147 0.0004016064 -0.8859954 -1.977394            0   49
-7:    NK cell 0.0002231147 0.0004016064 -0.9087684 -2.028219            0   49
+1:     B cell 0.0001963479 0.0003684749  0.9641767  2.042223            0   47
+2: CD4 T cell 0.0001947799 0.0003684749  0.8825397  1.887378            0   50
+3:        cDC 0.0002023063 0.0003684749  0.9586399  1.728487            0   17
+4: CD8 T cell 0.0013938670 0.0020908005 -0.8995956 -1.657233            6   18
+5:      cMono 0.0018333673 0.0023571865 -0.7742663 -1.687829            8   47
+6:     ncMono 0.0002047083 0.0003684749 -0.9119300 -1.999937            0   49
+7:    NK cell 0.0002047083 0.0003684749 -0.9130713 -2.002440            0   49
                                                leadingEdge
 1:          MS4A1,CD37,CXCR4,TNFRSF13C,BANK1,LINC00926,...
-2:                    RPS6,RPL13,RPL32,RPS3A,RPL9,RPL3,...
+2:                   RPS6,RPL13,RPL32,RPS3A,RPL9,RPS29,...
 3: HLA-DRA,HLA-DQB1,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DMA,...
 4:                        CCL5,IL32,GZMH,CD3D,CD2,LYAR,...
-5:                S100A6,S100A9,LYZ,S100A8,TYROBP,FCN1,...
-6:              S100A4,FCER1G,S100A11,AIF1,PSAP,IFITM3,...
-7:                     HCST,NKG7,ITGB2,GNLY,MYO1F,CST7,...
+5:                S100A6,S100A9,LYZ,TYROBP,S100A8,FCN1,...
+6:                S100A4,FCER1G,S100A11,AIF1,PSAP,LST1,...
+7:                     ITGB2,NKG7,HCST,GNLY,MYO1F,CST7,...
 
 $`6`
       pathway         pval         padj         ES       NES nMoreExtreme size
-1:    NK cell 0.0001968117 0.0003660024  0.9357367  2.012182            0   49
-2: CD4 T cell 0.0001970443 0.0003660024  0.8648575  1.865254            0   50
-3: CD8 T cell 0.0002002804 0.0003660024  0.9667190  1.776197            0   18
-4:        cDC 0.0047732697 0.0071599045 -0.8811814 -1.612760           23   17
-5:     ncMono 0.0002032107 0.0003660024 -0.8655401 -1.895657            0   49
-6:      cMono 0.0002033347 0.0003660024 -0.9182094 -1.999151            0   47
+1:    NK cell 0.0001863586 0.0003882657  0.9383295  1.976633            0   49
+2: CD4 T cell 0.0001846381 0.0003882657  0.8789241  1.861605            0   50
+3: CD8 T cell 0.0001971220 0.0003882657  0.9670054  1.759568            0   18
+4:        pDC 0.0952073931 0.1224095054 -0.6041872 -1.319390          442   47
+5:        cDC 0.0034246575 0.0051369863 -0.8829998 -1.620873           16   17
+6:     ncMono 0.0002157032 0.0003882657 -0.8872842 -1.952006            0   49
+7:      cMono 0.0002149151 0.0003882657 -0.9085014 -1.983934            0   47
                                             leadingEdge
 1:                    NKG7,GNLY,CST7,GZMA,CTSW,GZMM,...
-2:                IL7R,RPS3,RPS29,RPL3,MGAT4A,RPS4X,...
+2:                  IL7R,RPS3,RPS29,RPL3,RPL13,RPS6,...
 3:                    CCL5,IL32,GZMH,CD3D,LYAR,CD8A,...
-4: HLA-DRA,HLA-DMA,HLA-DQB1,HLA-DRB5,BASP1,HLA-DRB1,...
-5:                 FCER1G,AIF1,LST1,FTH1,COTL1,PSAP,...
-6:               S100A9,S100A8,LYZ,TYROBP,FCN1,VCAN,...
+4:                 NPC2,CTSB,IRF8,UNC93B1,PLEK,TCF4,...
+5: HLA-DRA,HLA-DQB1,HLA-DRB5,HLA-DMA,HLA-DRB1,BASP1,...
+6:                 FCER1G,AIF1,LST1,FTH1,COTL1,PSAP,...
+7:                S100A9,S100A8,LYZ,TYROBP,FCN1,TKT,...
 
 $`7`
       pathway         pval         padj         ES       NES nMoreExtreme size
-1:    NK cell 0.0002246686 0.0006740058  0.9822433  2.117581            0   49
-2: CD8 T cell 0.0052356021 0.0067314884  0.8934917  1.648012           23   18
-3:        cDC 0.0007408779 0.0016233766 -0.9096050 -1.649017            3   17
-4:     ncMono 0.0025220681 0.0037831021 -0.7690981 -1.653101           13   49
-5: CD4 T cell 0.0009018759 0.0016233766 -0.8069090 -1.736711            4   50
-6:      cMono 0.0001806685 0.0006740058 -0.8740244 -1.867198            0   47
-7:     B cell 0.0001806685 0.0006740058 -0.8943406 -1.910600            0   47
+1:    NK cell 0.0002319109 0.0006957328  0.9845619  2.113823            0   49
+2: CD8 T cell 0.0048098946 0.0061841503  0.9021784  1.664902           20   18
+3:        cDC 0.0005337129 0.0009606832 -0.9182682 -1.651815            2   17
+4: CD4 T cell 0.0008767315 0.0013150973 -0.7915340 -1.672841            4   50
+5:     ncMono 0.0005272408 0.0009606832 -0.8127856 -1.712117            2   49
+6:      cMono 0.0001757469 0.0006957328 -0.8702759 -1.823002            0   47
+7:     B cell 0.0001757469 0.0006957328 -0.8859815 -1.855901            0   47
                                                leadingEdge
-1:                     GNLY,NKG7,FGFBP2,CST7,PRF1,CTSW,...
+1:                     GNLY,NKG7,CTSW,FGFBP2,CST7,PRF1,...
 2:                   CCL5,GZMH,IL32,LYAR,CD2,LINC01871,...
-3: HLA-DRA,HLA-DRB1,HLA-DQB1,HLA-DPA1,HLA-DMA,HLA-DRB5,...
-4:                      COTL1,FTH1,AIF1,LST1,SAT1,SPI1,...
-5:              TMEM123,RPS13,RPL22,RPS28,RPL35A,RPL36,...
-6:                     S100A9,S100A8,LYZ,FCN1,TKT,VCAN,...
-7:               CD37,RPS11,MS4A1,CD52,BANK1,TNFRSF13C,...
+3: HLA-DRA,HLA-DRB1,HLA-DQB1,HLA-DPA1,HLA-DPB1,HLA-DMA,...
+4:               RPS28,TMEM123,RPL35A,RPS13,RPL9,RPS12,...
+5:                    COTL1,FTH1,AIF1,LST1,SAT1,NAP1L1,...
+6:                     S100A9,S100A8,LYZ,FCN1,TKT,MNDA,...
+7:               CD37,CD52,MS4A1,BANK1,CD79B,TNFRSF13C,...
 
 $`8`
-      pathway         pval        padj         ES       NES nMoreExtreme size
-1:     ncMono 0.0021600605 0.003240091 -0.7537958 -1.411206           19   49
-2:    NK cell 0.0006480181 0.001166433 -0.7784508 -1.457363            5   49
-3:     B cell 0.0004329004 0.001166433 -0.7863661 -1.466871            3   47
-4:        cDC 0.0005745145 0.001166433 -0.8884593 -1.499586            4   17
-5:      cMono 0.0001082251 0.000487013 -0.8319138 -1.551835            0   47
-6: CD4 T cell 0.0001077702 0.000487013 -0.9066494 -1.701390            0   50
+       pathway         pval         padj         ES       NES nMoreExtreme size
+1: Plasma cell 0.0497362472 0.0639466035  0.6759316  1.456101           65   24
+2:     NK cell 0.0014337708 0.0021506562 -0.7686920 -1.449205           12   49
+3:      ncMono 0.0014337708 0.0021506562 -0.7689308 -1.449655           12   49
+4:      B cell 0.0004431642 0.0013294926 -0.7964359 -1.495521            3   47
+5:         cDC 0.0006997901 0.0015745276 -0.8968690 -1.514856            5   17
+6:       cMono 0.0001107910 0.0004985597 -0.8330905 -1.564350            0   47
+7:  CD4 T cell 0.0001100837 0.0004985597 -0.9094132 -1.719188            0   50
                                                leadingEdge
-1:           S100A4,S100A11,AIF1,IFITM2,CEBPB,SERPINA1,...
+1:               JCHAIN,MZB1,DAD1,DERL3,TNFRSF17,MYDGF,...
 2:                   ITGB2,NKG7,GNLY,MYO1F,IFITM1,JAK1,...
-3:                   CD52,RPS23,RPL13A,RPS11,RPL12,FAU,...
-4: HLA-DRA,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DQB1,HLA-DMA,...
-5:                   JUND,S100A6,NFKBIA,TYROBP,LYZ,FOS,...
-6:                RPL34,RPS13,RPL13,EEF1A1,RPS3A,RPL32,...
+3:           S100A4,S100A11,AIF1,IFITM2,CEBPB,SERPINA1,...
+4:                   CD52,RPS23,RPL13A,RPS11,RPL12,FAU,...
+5: HLA-DRA,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DQB1,HLA-DMA,...
+6:                   JUND,S100A6,TYROBP,NFKBIA,LYZ,FOS,...
+7:                 RPL34,EEF1A1,RPL13,RPS13,RPS3A,RPS6,...
 
 $`9`
-       pathway         pval        padj         ES       NES nMoreExtreme size
-1:      ncMono 0.0001191611 0.001072450  0.9705242  1.879820            0   49
-2:         cDC 0.0061555680 0.011080022  0.8911415  1.525520           43   17
-3:       cMono 0.0129496403 0.016649538  0.7656658  1.476902          107   47
-4: Plasma cell 0.0330511890 0.037182588 -0.7002547 -1.547523           81   24
-5:     NK cell 0.0105590062 0.015838509 -0.6315449 -1.603456           16   49
-6:  CD8 T cell 0.0007165890 0.001612325 -0.8974765 -1.869886            1   18
-7:  CD4 T cell 0.0006422608 0.001612325 -0.8507977 -2.161552            0   50
-8:      B cell 0.0006016847 0.001612325 -0.8721690 -2.198723            0   47
-                                               leadingEdge
-1:                  AIF1,LST1,COTL1,FCER1G,PSAP,FCGR3A,...
-2: HLA-DPA1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DRB5,HLA-DMA,...
-3:                   LYZ,TYROBP,S100A6,FCN1,TKT,S100A9,...
-4:                 ISG20,CYCS,FKBP11,PEBP1,JCHAIN,MZB1,...
-5:                    CST7,IFITM1,GZMM,CCL4,CD247,HOPX,...
-6:                        CCL5,IL32,CD3D,GZMH,CD2,LYAR,...
-7:                   RPL31,RPS29,IL7R,RPS3,RPS27A,CCR7,...
-8:       CXCR4,MS4A1,BANK1,TNFRSF13C,LINC00926,RALGPS2,...
+ pathway pval padj ES NES nMoreExtreme size +1: ncMono 0.0001131990 0.001018791 0.9741332 1.797890 0 49 +2: cDC 0.0419888030 0.062983204 0.8400218 1.386229 314 17 +3: CD8 T cell 0.0004108463 0.001562500 -0.9139791 -1.881373 0 18 +4: NK cell 0.0008561644 0.001562500 -0.7548756 -1.882005 0 49 +5: B cell 0.0008244023 0.001562500 -0.7643028 -1.891838 0 47 +6: CD4 T cell 0.0008680556 0.001562500 -0.8712990 -2.177877 0 50 + leadingEdge +1: LST1,AIF1,COTL1,FCER1G,FCGR3A,IFITM3,... +2: HLA-DPA1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DRB5,MTMR14,... +3: CCL5,IL32,GZMH,CD3D,CD2,CD8A,... +4: NKG7,GNLY,CST7,CTSW,GZMA,CD247,... +5: CXCR4,MS4A1,BANK1,TNFRSF13C,LINC00926,RPL13A,... +6: RPL31,LDHB,RPS3,IL7R,RPS29,RPS27A,...

Selecing top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters.

@@ -929,93 +932,93 @@

$`1`
                   pathway         pval       padj        ES      NES
-1:             Neutrophil 0.0001507613 0.01493723 0.9197310 2.010307
-2: CD1C+_B dendritic cell 0.0001589067 0.01493723 0.9293164 1.931839
-3:           Stromal cell 0.0013311148 0.05004992 0.8544544 1.696909
+1:             Neutrophil 0.0001222195 0.01215255 0.9203456 1.876178
+2: CD1C+_B dendritic cell 0.0001292825 0.01215255 0.9243123 1.809278
+3:           Stromal cell 0.0011025358 0.04145535 0.8693509 1.626355
    nMoreExtreme size                                  leadingEdge
-1:            0   80 S100A8,S100A9,S100A12,MNDA,S100A11,NAMPT,...
-2:            0   53      S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
-3:            7   38          VIM,TIMP2,BST1,TIMP1,ANPEP,CD44,...
+1:            0   80 S100A8,S100A9,S100A12,MNDA,NAMPT,S100A11,...
+2:            0   54      S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
+3:            7   38          VIM,TIMP2,BST1,TIMP1,CD44,ANPEP,...
 
 $`2`
                        pathway        pval       padj         ES       NES
-1:           Follicular B cell 0.006354586 0.05430282  0.8587199  1.627043
-2:              Pyramidal cell 0.003853565 0.04168250 -0.9722789 -1.490874
-3: CD4+CD25+ regulatory T cell 0.001541426 0.02414900 -0.9799548 -1.502644
+1:           Follicular B cell 0.008464329 0.06630391  0.8526465  1.600418
+2:              Pyramidal cell 0.004198321 0.04582724 -0.9744811 -1.516437
+3: CD4+CD25+ regulatory T cell 0.002199120 0.03445289 -0.9799105 -1.524886
    nMoreExtreme size                         leadingEdge
-1:           29   22 MS4A1,CD69,CD22,FCER2,CD40,PAX5,...
-2:           19    6                           NRGN,CD3E
-3:            7    6            CD3E,CD3D,CD3G,PTPRC,CD4
+1:           41   22 MS4A1,CD69,CD22,FCER2,CD40,PAX5,...
+2:           20    6                           NRGN,CD3E
+3:           10    6            CD3E,CD3D,CD3G,PTPRC,CD4
 
 $`3`
-                           pathway         pval        padj        ES      NES
-1:                      Neutrophil 0.0001011327 0.007217168 0.8809821 1.569285
-2:          CD1C+_B dendritic cell 0.0001033271 0.007217168 0.8836167 1.550651
-3: Monocyte derived dendritic cell 0.0001151676 0.007217168 0.9481164 1.532539
-   nMoreExtreme size                              leadingEdge
-1:            0   80 S100A9,S100A8,S100A11,CD14,LST1,MNDA,...
-2:            0   53     S100A9,LYZ,S100A8,FCN1,VCAN,CD14,...
-3:            0   17   S100A9,S100A8,CST3,CD14,CD33,ITGAX,...
+                  pathway         pval       padj        ES      NES
+1:             Neutrophil 0.0001081315 0.01063709 0.8977016 1.749619
+2: CD1C+_B dendritic cell 0.0001131606 0.01063709 0.8981095 1.699327
+3:           Stromal cell 0.0003583801 0.02245849 0.8818002 1.619610
+   nMoreExtreme size                                 leadingEdge
+1:            0   80 S100A9,S100A8,S100A11,LST1,CD14,S100A12,...
+2:            0   54        LYZ,S100A9,S100A8,FCN1,VCAN,CD14,...
+3:            2   38       VIM,CD44,TIMP2,TIMP1,ICAM1,PECAM1,...
 
 $`4`
              pathway         pval        padj        ES      NES nMoreExtreme
-1: Naive CD8+ T cell 0.0001888218 0.005616299 0.8620656 2.045525            0
-2: Naive CD4+ T cell 0.0002017756 0.005616299 0.9214751 1.879833            0
-3:       CD4+ T cell 0.0002022654 0.005616299 0.9193037 1.787130            0
+1: Naive CD8+ T cell 0.0002157497 0.006783575 0.8599144 2.048419            0
+2: Naive CD4+ T cell 0.0002164971 0.006783575 0.9296309 1.895090            0
+3:       CD4+ T cell 0.0002150538 0.006783575 0.9271953 1.799035            0
    size                            leadingEdge
-1:   91 LDHB,PIK3IP1,NOSIP,TCF7,RCAN3,NPM1,...
+1:   91 LDHB,PIK3IP1,NOSIP,TCF7,NPM1,RCAN3,...
 2:   34    IL7R,NOSIP,TCF7,EEF1B2,RPS5,MAL,...
 3:   25        IL7R,LTB,CD3E,CD3D,CD3G,CD2,...
 
 $`5`
-                        pathway        pval       padj         ES       NES
-1:            Follicular B cell 0.005346572 0.04188148  0.8501224  1.610208
-2: Hematopoietic precursor cell 0.008534851 0.06171354 -0.9521366 -1.493451
-3:               Pyramidal cell 0.003048161 0.03581589 -0.9725160 -1.525417
-   nMoreExtreme size                         leadingEdge
-1:           27   22 MS4A1,CD69,CD22,CD40,FCER2,PAX5,...
-2:           41    6                          CD14,PTPRC
-3:           14    6                           CD3E,NRGN
+              pathway        pval       padj         ES       NES nMoreExtreme
+1:  Follicular B cell 0.008289527 0.05993966  0.8517164  1.606149           40
+2: Myoepithelial cell 0.008235294 0.05993966 -0.9398262 -1.486394           41
+3:     Pyramidal cell 0.002341463 0.03160284 -0.9730353 -1.502327           11
+   size                         leadingEdge
+1:   22 MS4A1,CD69,CD22,CD40,FCER2,PAX5,...
+2:    7                  ITGB1,BHLHE40,CD44
+3:    6                           CD3E,NRGN
 
 $`6`
                              pathway         pval        padj        ES
-1:             CD4+ cytotoxic T cell 0.0001908761 0.007875995 0.8929282
-2:               Natural killer cell 0.0003821899 0.009483454 0.7967208
-3: Effector CD8+ memory T (Tem) cell 0.0003824092 0.009483454 0.7969411
+1:             CD4+ cytotoxic T cell 0.0001825484 0.008283763 0.8850534
+2:               Natural killer cell 0.0001830831 0.008283763 0.8009472
+3: Effector CD8+ memory T (Tem) cell 0.0003665689 0.011485826 0.7818876
         NES nMoreExtreme size                           leadingEdge
-1: 2.063730            0   86     CCL5,NKG7,GZMH,GNLY,CST7,GZMA,...
-2: 1.835585            1   84     NKG7,GNLY,CD3D,CD3E,GZMA,CD3G,...
-3: 1.824241            1   79 GZMH,GNLY,ARL4C,GZMB,FGFBP2,KLRD1,...
+1: 2.020483            0   86     CCL5,NKG7,GNLY,GZMH,CST7,GZMA,...
+2: 1.824333            0   84     NKG7,GNLY,CD3D,CD3E,GZMA,CD3G,...
+3: 1.765723            1   79 GNLY,GZMH,ARL4C,GZMB,FGFBP2,KLRD1,...
 
 $`7`
                              pathway         pval       padj        ES      NES
-1:             CD4+ cytotoxic T cell 0.0002165909 0.01025753 0.9480244 2.205220
-2: Effector CD8+ memory T (Tem) cell 0.0002165909 0.01025753 0.8968211 2.068348
-3:               Natural killer cell 0.0002182453 0.01025753 0.8507701 1.972715
+1:             CD4+ cytotoxic T cell 0.0002382087 0.01130895 0.9485749 2.191249
+2: Effector CD8+ memory T (Tem) cell 0.0002406160 0.01130895 0.8946982 2.041811
+3:               Natural killer cell 0.0002387205 0.01130895 0.8572499 1.974925
    nMoreExtreme size                           leadingEdge
-1:            0   86   GNLY,NKG7,GZMB,FGFBP2,CCL5,CST7,...
+1:            0   86   GNLY,NKG7,CCL5,GZMB,CTSW,FGFBP2,...
 2:            0   79 GNLY,GZMB,FGFBP2,KLRD1,SPON2,GZMH,...
-3:            0   84   GNLY,NKG7,GZMB,GZMA,CD247,KLRD1,...
+3:            0   84   GNLY,NKG7,GZMB,CD247,GZMA,KLRD1,...
 
 $`8`
-            pathway        pval       padj         ES       NES nMoreExtreme
-1:    Megakaryocyte 0.002577320 0.08490323  0.7934901  1.757021            2
-2:       Neutrophil 0.008846794 0.11600655 -0.6842598 -1.340588           84
-3: Mesenchymal cell 0.009494346 0.11600655 -0.7144618 -1.363128           88
-   size                                 leadingEdge
-1:   25         PPBP,PF4,GP9,ITGA2B,CD9,RASGRP2,...
-2:   80 PTPRC,ITGB2,S100A11,CD44,IFITM2,S100A12,...
-3:   58         S100A4,PTPRC,VIM,CD44,ZEB2,CTSC,...
+               pathway        pval      padj         ES       NES nMoreExtreme
+1:       Megakaryocyte 0.008771930 0.1649123  0.8128385  1.763957           10
+2:          Eosinophil 0.007063238 0.1480465 -0.7453288 -1.396081           63
+3: Natural killer cell 0.003492433 0.1480465 -0.7084403 -1.396899           32
+   size                          leadingEdge
+1:   25  PPBP,PF4,GP9,ITGA2B,CD9,RASGRP2,...
+2:   47   CD52,PTPRC,CD48,CD44,CD53,CD69,...
+3:   84 PTPRC,NKG7,GNLY,CD69,CD81,FCGR3A,...
 
 $`9`
-                 pathway         pval       padj        ES      NES
-1:      Mesenchymal cell 0.0001175917 0.02210724 0.8495970 1.678997
-2:          Stromal cell 0.0007528231 0.04569762 0.8602790 1.630578
-3: Endometrial stem cell 0.0029594138 0.06821588 0.9013667 1.560572
-   nMoreExtreme size                           leadingEdge
-1:            0   58   COTL1,S100A4,VIM,CTSC,HES4,ZEB2,...
-2:            5   38 VIM,PECAM1,TIMP1,CD44,TIMP2,ICAM3,...
-3:           20   18 PECAM1,CD44,PTPRC,ITGA4,ITGB1,ENG,...
+ pathway pval padj ES NES nMoreExtreme +1: Mesenchymal cell 0.0001106929 0.02081027 0.8631721 1.606935 0 +2: Stromal cell 0.0009342520 0.04648863 0.8552856 1.544942 7 +3: Hemangioblast 0.0003017502 0.02836451 0.9907663 1.516461 1 + size leadingEdge +1: 58 COTL1,S100A4,CTSC,HES4,VIM,ZEB2,... +2: 38 PECAM1,TIMP1,VIM,TIMP2,PTPRC,CD44,... +3: 8 PECAM1,CD34

#CT_GSEA8:

@@ -1435,7 +1438,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-14-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-14-1.png index d72d6244..f933ffaf 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-14-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-18-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-18-1.png index 64a22c8f..b446723a 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-18-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-19-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-19-1.png index 3e373e71..db1f57b7 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-19-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-25-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-25-1.png index 9debb8ac..a46a2f78 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-25-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-26-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-26-1.png index 599b9203..5ce28e50 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-26-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-32-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-32-1.png index 40f5217f..fd62c637 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-32-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-32-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-33-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-33-1.png index ffb05198..b07bccaa 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-33-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-33-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-37-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-37-1.png index a598bbab..065ef701 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-37-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-37-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-7-1.png index a8ef4fa9..b962c618 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-9-1.png index ba0fdb9a..d51957b0 100644 Binary files a/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/bioc/bioc_06_celltyping_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/labs/bioc/bioc_08_spatial.html b/docs/labs/bioc/bioc_08_spatial.html index ac246fd9..466bd197 100644 --- a/docs/labs/bioc/bioc_08_spatial.html +++ b/docs/labs/bioc/bioc_08_spatial.html @@ -165,6 +165,11 @@ Info +
@@ -310,13 +315,9 @@

system(paste0("tar -xvzf ", file.path("data", i), " -C ", dirname(file.path("data", i)))) }
-
Downloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz to data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz
-Uncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz
-Downloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz to data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz
+
Uncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz
 Uncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz
-Downloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz to data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz
 Uncompressing data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz
-Downloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz to data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz
 Uncompressing data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz
@@ -659,8 +660,8 @@

gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
-Ncells  10071341  537.9   14514548  775.2  14514548  775.2
-Vcells 191849982 1463.7  373707381 2851.2 373703568 2851.2
+Ncells 10077614 538.3 14514560 775.2 14514560 775.2 +Vcells 191871231 1463.9 373705667 2851.2 373705055 2851.2

Then we run dimensionality reduction and clustering as before.

@@ -748,7 +749,7 @@

@@ -799,9 +800,9 @@

rm(ar) gc()
-
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
-Ncells  10176004  543.5   18544292  990.4  18544292  990.4
-Vcells 577825608 4408.5  833436874 6358.7 578228452 4411.6
+
            used   (Mb) gc trigger   (Mb)  max used (Mb)
+Ncells  10176084  543.5   18536281  990.0  18536281  990
+Vcells 576826421 4400.9  831998051 6347.7 577229270 4404
# check number of cells per subclass
 ar_sce$subclass <- sub("/", "_", sub(" ", "_", ar_sce$subclass))
@@ -1310,7 +1311,7 @@ 

Published with Quarto v1.3.450

- + diff --git a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-17-1.png b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-17-1.png index 3551233d..799f3315 100644 Binary files a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-17-1.png and b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-20-1.png b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-20-1.png index 423318ec..e07658dd 100644 Binary files a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-20-1.png and b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-26-1.png b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-26-1.png index ddad1d5e..f42859d7 100644 Binary files a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-26-1.png and b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-27-1.png b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-27-1.png index 5b204128..7075af19 100644 Binary files a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-27-1.png and b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-28-1.png b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-28-1.png index e7032722..2d9e2d0f 100644 Binary files a/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-28-1.png and b/docs/labs/bioc/bioc_08_spatial_files/figure-html/unnamed-chunk-28-1.png differ diff --git a/docs/labs/index.html b/docs/labs/index.html index bcd4d18a..03e4fe72 100644 --- a/docs/labs/index.html +++ b/docs/labs/index.html @@ -212,6 +212,11 @@ Info +
@@ -256,7 +261,7 @@

Labs

Seurat

-
+ -
+ -
+

@@ -378,7 +383,7 @@

Bioconductor

-
+

@@ -486,7 +491,7 @@

Scanpy

-
+ -
+ -
+ -
+ -
+ -
+
- + diff --git a/docs/labs/scanpy/scanpy_01_qc.html b/docs/labs/scanpy/scanpy_01_qc.html index a95de639..7ffd6a1f 100644 --- a/docs/labs/scanpy/scanpy_01_qc.html +++ b/docs/labs/scanpy/scanpy_01_qc.html @@ -165,6 +165,11 @@ Info +
@@ -905,7 +910,7 @@

@@ -1081,9 +1086,9 @@

+Session information updated at 2024-01-23 11:22

@@ -1335,7 +1340,7 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-11-output-1.png b/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-11-output-1.png index 77ff024a..1173ffc4 100644 Binary files a/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-11-output-1.png and b/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-11-output-1.png differ diff --git a/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-17-output-1.png b/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-17-output-1.png index c9606aa6..c97313c9 100644 Binary files a/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-17-output-1.png and b/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-17-output-1.png differ diff --git a/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-22-output-1.png b/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-22-output-1.png index eeb25667..c3f7950f 100644 Binary files a/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-22-output-1.png and b/docs/labs/scanpy/scanpy_01_qc_files/figure-html/cell-22-output-1.png differ diff --git a/docs/labs/scanpy/scanpy_02_dimred.html b/docs/labs/scanpy/scanpy_02_dimred.html index d03c1a8e..212993b3 100644 --- a/docs/labs/scanpy/scanpy_02_dimred.html +++ b/docs/labs/scanpy/scanpy_02_dimred.html @@ -162,6 +162,11 @@ Info +
@@ -365,7 +370,7 @@

regressing out ['total_counts', 'pct_counts_mt']
     sparse input is densified and may lead to high memory use
-    finished (0:00:46)
+ finished (0:00:50)

@@ -825,9 +830,9 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_03_integration.html b/docs/labs/scanpy/scanpy_03_integration.html index cef2edce..351449e3 100644 --- a/docs/labs/scanpy/scanpy_03_integration.html +++ b/docs/labs/scanpy/scanpy_03_integration.html @@ -162,6 +162,11 @@ Info +
@@ -440,7 +445,7 @@

@@ -746,14 +751,14 @@

computing neighbors finished: added to `.uns['neighbors']` `.obsp['distances']`, distances for each pair of neighbors - `.obsp['connectivities']`, weighted adjacency matrix (0:00:01) + `.obsp['connectivities']`, weighted adjacency matrix (0:00:00) computing UMAP finished: added - 'X_umap', UMAP coordinates (adata.obsm) (0:00:11) + 'X_umap', UMAP coordinates (adata.obsm) (0:00:10) computing tSNE using sklearn.manifold.TSNE finished: added - 'X_tsne', tSNE coordinates (adata.obsm) (0:00:13) + 'X_tsne', tSNE coordinates (adata.obsm) (0:00:12)

We can now plot the unintegrated and the integrated space reduced dimensions.

@@ -952,9 +957,9 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_04_clustering.html b/docs/labs/scanpy/scanpy_04_clustering.html index 4570fb41..0ab70270 100644 --- a/docs/labs/scanpy/scanpy_04_clustering.html +++ b/docs/labs/scanpy/scanpy_04_clustering.html @@ -162,6 +162,11 @@ Info +
@@ -501,7 +506,7 @@

tmp = pd.crosstab(adata.obs['leiden_0.6'],adata.obs['sample'], normalize='index') tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.4, 1),loc='upper right')

-
<matplotlib.legend.Legend at 0x7fff4670faf0>
+
<matplotlib.legend.Legend at 0x7fff4680fbe0>
@@ -524,7 +529,7 @@

tmp = pd.crosstab(adata.obs['sample'],adata.obs['leiden_0.6'], normalize='index')
 tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.4, 1), loc='upper right')

-
<matplotlib.legend.Legend at 0x7fff46ac6dd0>
+
<matplotlib.legend.Legend at 0x7fff46bb3be0>
@@ -669,9 +674,9 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_05_dge.html b/docs/labs/scanpy/scanpy_05_dge.html index 6d89e888..e9c349ba 100644 --- a/docs/labs/scanpy/scanpy_05_dge.html +++ b/docs/labs/scanpy/scanpy_05_dge.html @@ -165,6 +165,11 @@ Info +
@@ -391,7 +396,7 @@

sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key="wilcoxon")

ranking genes
-    finished (0:00:12)
+ finished (0:00:09)
@@ -410,7 +415,7 @@

< sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key = "logreg")

ranking genes
-    finished (0:00:31)
+ finished (0:00:20)
@@ -1170,9 +1175,9 @@

+Session information updated at 2024-01-23 11:29

@@ -1424,7 +1429,7 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-14-output-1.png b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-14-output-1.png index bd92d0b2..e62f734d 100644 Binary files a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-14-output-1.png and b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-14-output-1.png differ diff --git a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-1.png b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-1.png index df06c98a..80c59f83 100644 Binary files a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-1.png and b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-1.png differ diff --git a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-2.png b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-2.png index 68238c73..775fc0c2 100644 Binary files a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-2.png and b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-16-output-2.png differ diff --git a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-1.png b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-1.png index 32a92ab4..3349a0c7 100644 Binary files a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-1.png and b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-1.png differ diff --git a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-2.png b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-2.png index 6ce7bd3c..4dd05465 100644 Binary files a/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-2.png and b/docs/labs/scanpy/scanpy_05_dge_files/figure-html/cell-20-output-2.png differ diff --git a/docs/labs/scanpy/scanpy_06_celltyping.html b/docs/labs/scanpy/scanpy_06_celltyping.html index a25dcf8a..509c3ea9 100644 --- a/docs/labs/scanpy/scanpy_06_celltyping.html +++ b/docs/labs/scanpy/scanpy_06_celltyping.html @@ -165,6 +165,11 @@ Info +
@@ -681,7 +686,7 @@

tmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['predicted'], normalize='index')
 tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')

-
<matplotlib.legend.Legend at 0x7fff4ef2ba30>
+
<matplotlib.legend.Legend at 0x7fff4ec62b60>
@@ -701,7 +706,7 @@

sc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)

running ingest
-    finished (0:00:22)
+ finished (0:00:20)
@@ -716,7 +721,7 @@

tmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['louvain'], normalize='index')
 tmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')

-
<matplotlib.legend.Legend at 0x7fff310cc460>
+
<matplotlib.legend.Legend at 0x7fff4e07aa40>
@@ -1070,9 +1075,9 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_07_trajectory.html b/docs/labs/scanpy/scanpy_07_trajectory.html index 4139a35c..6450ff39 100644 --- a/docs/labs/scanpy/scanpy_07_trajectory.html +++ b/docs/labs/scanpy/scanpy_07_trajectory.html @@ -165,6 +165,11 @@ Info +
@@ -992,7 +997,7 @@

+Session information updated at 2024-01-23 11:32

@@ -1244,7 +1249,7 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/scanpy/scanpy_08_spatial.html b/docs/labs/scanpy/scanpy_08_spatial.html index 40eaf948..40f74d85 100644 --- a/docs/labs/scanpy/scanpy_08_spatial.html +++ b/docs/labs/scanpy/scanpy_08_spatial.html @@ -165,6 +165,11 @@ Info +
@@ -657,7 +662,7 @@

@@ -736,7 +741,7 @@

sc.pl.umap(adata_cortex, color="subclass", legend_loc='on data')

normalizing counts per cell
-    finished (0:00:01)
+    finished (0:00:00)
 extracting highly variable genes
-    finished (0:00:06)
+    finished (0:00:03)
 --> added
     'highly_variable', boolean vector (adata.var)
     'means', float vector (adata.var)
@@ -1442,7 +1447,7 @@ 

@@ -2054,7 +2059,7 @@

Published with Quarto v1.3.450

- + diff --git a/docs/labs/seurat/seurat_01_qc.html b/docs/labs/seurat/seurat_01_qc.html index 90dc5c22..65efb3a4 100644 --- a/docs/labs/seurat/seurat_01_qc.html +++ b/docs/labs/seurat/seurat_01_qc.html @@ -165,6 +165,11 @@ Info +
@@ -202,7 +207,7 @@

Published

-

16-Jan-2024

+

23-Jan-2024

@@ -332,7 +337,8 @@

2 Collate

-

We can now load the expression matrices and merge them into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column Chemistry in the metadata for plotting later on.

+

We can now merge them objects into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column type in the metadata to define covid and ctrl samples.

+

But first, we need to create Seurat objects using each of the expression matrices we loaded. We define each sample in the project slot, so in each object, the sample id can be found in the metadata slot orig.ident.

sdata.cov1 <- CreateSeuratObject(cov.1, project = "covid_1")
 sdata.cov15 <- CreateSeuratObject(cov.15, project = "covid_15")
@@ -366,8 +372,8 @@ 

gc()

           used  (Mb) gc trigger (Mb)  max used   (Mb)
-Ncells  3325459 177.6    4998412  267   4998412  267.0
-Vcells 58182395 443.9  150859912 1151 136166604 1038.9
+Ncells 3325437 177.6 4998403 267 4998403 267.0 +Vcells 58182452 443.9 150860051 1151 136166661 1038.9

Here is how the count matrix and the metadata look like for every cell.

@@ -442,7 +448,7 @@

-

As you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 sample having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.

+

As you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 and covid_16 samples having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. We can also plot the different QC-measures as scatter plots.

FeatureScatter(alldata, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = .5)
@@ -555,7 +561,7 @@

5.4 Filter genes

-

As the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis.

+

As the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis. In this case we will also remove the HB genes.

dim(data.filt)
@@ -571,11 +577,11 @@

# data.filt <- data.filt[ ! grepl("^RP[SL]", rownames(data.filt)), ] # Filter Hemoglobin gene (optional if that is a problem on your data) -data.filt <- data.filt[!grepl("^HB[^(P)]", rownames(data.filt)), ] +data.filt <- data.filt[!grepl("^HB[^(PES)]", rownames(data.filt)), ] dim(data.filt)

-
[1] 18851  7431
+
[1] 18854  7431
@@ -597,7 +603,7 @@

"description", "gene_biotype", "chromosome_name", "start_position" ), mart = mart, useCache = F)) -write.csv(genes_table, file = "data/results/genes_table.csv")

+write.csv(genes_table, file = "data/covid/results/genes_table.csv")
genes_file <- file.path(path_results, "genes_table.csv")
@@ -634,7 +640,19 @@ 

+
+
+ +
+
+Discuss +
+
+
+

Here, we can see clearly that we have three males and five females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?

+
+

7 Cell cycle state

@@ -755,7 +773,7 @@

data.filt <- data.filt[, data.filt@meta.data[, DF.name] == "Singlet"]
 dim(data.filt)

-
[1] 18851  7134
+
[1] 18854  7134
@@ -1092,7 +1110,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-23-1.png b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-23-1.png index 5c5c6b40..d3af4cd8 100644 Binary files a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-23-1.png and b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-24-1.png b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-24-1.png index a0619e9b..ca31ef29 100644 Binary files a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-24-1.png and b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-26-1.png b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-26-1.png index b3c6d8d8..800681ae 100644 Binary files a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-26-1.png and b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-27-1.png b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-27-1.png index 11ea75bc..a1d6f0c9 100644 Binary files a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-27-1.png and b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-30-1.png b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-30-1.png index df40fd9e..6ccff8cc 100644 Binary files a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-30-1.png and b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-30-1.png differ diff --git a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-31-1.png b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-31-1.png index 6a8cc92d..c55ac0a6 100644 Binary files a/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-31-1.png and b/docs/labs/seurat/seurat_01_qc_files/figure-html/unnamed-chunk-31-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred.html b/docs/labs/seurat/seurat_02_dimred.html index c8bc6458..85623c41 100644 --- a/docs/labs/seurat/seurat_02_dimred.html +++ b/docs/labs/seurat/seurat_02_dimred.html @@ -165,6 +165,11 @@ Info +
@@ -960,7 +965,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-10-1.png index f11490e0..855ad6f2 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-10-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-13-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-13-1.png index 76431794..00c8025f 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-13-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-14-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-14-1.png index 9d6eec21..14e1443d 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-14-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-17-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-17-1.png index 2d23fe3e..cbb535e4 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-17-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-18-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-18-1.png index fe762382..5e3b67d2 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-18-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-3-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-3-1.png index be981ab4..ab3bff8a 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-3-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-6-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-6-1.png index 71e4c841..a2c80289 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-6-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-7-1.png index bc5ed0e6..e0992051 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-8-1.png index 4c9b362b..cc7bd688 100644 Binary files a/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/seurat/seurat_02_dimred_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/seurat/seurat_03_integration.html b/docs/labs/seurat/seurat_03_integration.html index a8c3e7d2..5cda348e 100644 --- a/docs/labs/seurat/seurat_03_integration.html +++ b/docs/labs/seurat/seurat_03_integration.html @@ -165,6 +165,11 @@ Info +
@@ -442,8 +447,8 @@

gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
-Ncells   3414313  182.4    4989403  266.5   4989403  266.5
-Vcells 203222859 1550.5  564336378 4305.6 879335547 6708.8
+Ncells 3414481 182.4 4989417 266.5 4989417 266.5 +Vcells 203242618 1550.7 556191914 4243.5 868979728 6629.8

Let’s plot some marker genes for different cell types onto the embedding.

@@ -561,7 +566,7 @@

Published with Quarto v1.3.450 - + diff --git a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-11-1.png b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-11-1.png index 3413c9aa..3c1a9fe6 100644 Binary files a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-11-1.png and b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-14-1.png b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-14-1.png index 0b64ee1d..05366d60 100644 Binary files a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-14-1.png and b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-3-1.png b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-3-1.png index f63e3198..6b2c3287 100644 Binary files a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-3-1.png and b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-4-1.png b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-4-1.png index 6cd233fd..044f0e1e 100644 Binary files a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-4-1.png and b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-9-1.png index cf42f399..fe0c44f5 100644 Binary files a/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/seurat/seurat_03_integration_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering.html b/docs/labs/seurat/seurat_04_clustering.html index 64baeb8b..81b5c138 100644 --- a/docs/labs/seurat/seurat_04_clustering.html +++ b/docs/labs/seurat/seurat_04_clustering.html @@ -165,6 +165,11 @@ Info +
@@ -202,7 +207,7 @@

Published
-

16-Jan-2024

+

19-Jan-2024

@@ -228,9 +233,9 @@

On this page

-
  • 4 Session info
  • +
  • 4 Distribution of clusters
  • +
  • 5 Session info
  • @@ -472,8 +477,9 @@

    saveRDS(alldata, "data/covid/results/seurat_covid_qc_dr_int_cl.rds")
    -
    -

    3.3 Distribution of clusters

    +
    +
    +

    4 Distribution of clusters

    Now, we can select one of our clustering methods and compare the proportion of samples across the clusters.

    Select the “CCA_snn_res.0.5” and plot proportion of samples per cluster and also proportion covid vs ctrl.

    @@ -518,9 +524,8 @@

    - -
    -

    4 Session info

    +
    +

    5 Session info

    Click here @@ -844,7 +849,7 @@

    Published with Quarto v1.3.450 - + diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-13-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-13-1.png index 34ef6a8d..33e07142 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-13-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-15-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-15-1.png index 41ec9701..153b8d2b 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-15-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-16-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-16-1.png index 9afa6c80..7721d718 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-16-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-1.png index c02a8fa1..3ee5c2c5 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-2.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-2.png index 393f3d87..5537d6f2 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-2.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-4-2.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-6-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-6-1.png index 53da4ce5..6f37879e 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-6-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-7-1.png index 5367ec66..5264477a 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-8-1.png index 3fc43163..83e75da6 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-9-1.png index 3bd99bdc..efb724f4 100644 Binary files a/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/seurat/seurat_04_clustering_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge.html b/docs/labs/seurat/seurat_05_dge.html index f03318d3..f3920c60 100644 --- a/docs/labs/seurat/seurat_05_dge.html +++ b/docs/labs/seurat/seurat_05_dge.html @@ -165,6 +165,11 @@ Info +
    @@ -287,7 +292,7 @@

    On this page

    
        0    1    2    3    4    5    6    7    8 
    -2056 1259 1113  646  535  494  365  337  329 
    +2063 1297 1073 642 546 489 368 336 320
    @@ -332,7 +337,7 @@

    @@ -559,7 +564,7 @@

    
      covid_1 covid_15 covid_16 covid_17  ctrl_13  ctrl_14  ctrl_19   ctrl_5 
    -      95       32       37      173       64       62       37      146 
    + 93 32 37 173 62 62 37 146

    @@ -661,17 +666,17 @@

    topTags(qlf)
    Coefficient:  bulk.labelsCtrl 
    -            logFC   logCPM        F       PValue        FDR
    -S100A8  -2.672605 6.972711 37.41996 6.779653e-06 0.01083389
    -S100A9  -2.512717 7.374885 27.28588 5.193871e-05 0.04149903
    -STAG3   -3.378653 7.540873 24.35275 8.987020e-05 0.04787086
    -PIM3    -1.412489 7.839512 17.02383 6.030641e-04 0.23537510
    -IGHA1   -2.676072 6.965149 16.09405 7.364678e-04 0.23537510
    -DYNC1H1  1.279395 6.711434 12.94684 1.976508e-03 0.52641010
    -PHACTR1 -1.207474 7.908323 11.47741 3.176723e-03 0.67568316
    -CCR7    -1.301642 8.017766 11.28727 3.382644e-03 0.67568316
    -WDFY2    1.172984 7.133247 10.76672 4.049332e-03 0.69392707
    -MOB3A   -1.128665 7.131236 10.56187 4.342472e-03 0.69392707
    + logFC logCPM F PValue FDR +S100A8 -2.769215 6.963840 45.76310 1.792203e-06 0.002996563 +S100A9 -2.605746 7.463864 29.05267 3.622977e-05 0.030288086 +STAG3 -3.130834 7.358135 20.80773 2.141285e-04 0.119340964 +IGHA1 -2.777404 6.965359 18.84381 3.484837e-04 0.145666204 +DYNC1H1 1.371425 6.575657 14.60978 1.187609e-03 0.397136505 +PIM3 -1.391713 7.788553 12.91552 1.991325e-03 0.499573279 +TLE1 -1.135713 7.356197 12.76593 2.091515e-03 0.499573279 +TRAF3IP3 1.206566 7.489238 11.85600 2.799146e-03 0.585021456 +WDFY2 1.178937 7.063276 11.38622 3.275591e-03 0.608532060 +AHNAK 1.163878 7.833990 11.02406 3.680067e-03 0.615307195

    As you can see, we have very few significant genes. Since we only have 4 vs 4 samples, we should not expect too many genes with this method.

    @@ -1050,33 +1055,32 @@

    Published with Quarto v1.3.450 - + diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-10-1.png index 00167c65..c290b7d7 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-10-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-12-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-12-1.png index 2921128b..24a86fb2 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-12-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-13-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-13-1.png index 2ae0ede9..f7c724da 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-13-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-17-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-17-1.png index 9efdac33..be428cf2 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-17-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-18-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-18-1.png index 689f4dd9..7cf35521 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-18-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-19-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-19-1.png index 8b130243..3e86fff0 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-19-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-19-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-20-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-20-1.png index 45443bbc..5f11df24 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-20-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-23-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-23-1.png index bf3c24a7..e9c976bb 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-23-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-24-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-24-1.png index 5d1777a9..4b8b9e86 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-24-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-27-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-27-1.png index 485de53d..1746038e 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-27-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-34-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-34-1.png index b26c1364..bbc146fb 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-34-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-34-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-39-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-39-1.png index 7499072a..37c9f710 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-39-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-39-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-4-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-4-1.png index 254068b3..404e26b2 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-4-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-7-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-7-1.png index a095db4a..b98421b9 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-7-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-8-1.png index 97131516..16366cea 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-9-1.png b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-9-1.png index f56a4879..838002b1 100644 Binary files a/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-9-1.png and b/docs/labs/seurat/seurat_05_dge_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping.html b/docs/labs/seurat/seurat_06_celltyping.html index 68f619cf..ee0ff7a5 100644 --- a/docs/labs/seurat/seurat_06_celltyping.html +++ b/docs/labs/seurat/seurat_06_celltyping.html @@ -165,6 +165,11 @@ Info +
    @@ -289,8 +294,8 @@

    ctrl
    An object of class Seurat 
    -18851 features across 1126 samples within 1 assay 
    -Active assay: RNA (18851 features, 2000 variable features)
    +18854 features across 1126 samples within 1 assay 
    +Active assay: RNA (18854 features, 2000 variable features)
      6 dimensional reductions calculated: umap, tsne, umap_raw, pca_harmony, harmony, umap_harmony
    @@ -432,8 +437,8 @@

    ●  Matching reference with new dataset...
          ─ 2000 features present in reference loadings
    -     ─ 1782 features shared between reference and new dataset
    -     ─ 89.1% of features in the reference are present in new dataset
    +     ─ 1783 features shared between reference and new dataset
    +     ─ 89.15% of features in the reference are present in new dataset
     ●  Aligning new data to reference...
     ●  Classifying cells...
     DONE!
    @@ -472,7 +477,7 @@

    @@ -505,7 +510,7 @@

    unlist(lapply(DGE_list, nrow))
       0    1    2    3    4    5    6    7    8 
    -3349 4118 3271 2504 2061 2581 2426 3487 2355 
    +3307 4102 3289 2478 2017 2522 2483 3513 2298
    @@ -564,97 +569,95 @@

    res

    $`0`
    -   pathway       pval        padj        ES      NES nMoreExtreme size
    -1:   cMono 0.00009999 0.000299970 0.9594422 2.067666            0   48
    -2:  ncMono 0.00009999 0.000299970 0.8385199 1.797428            0   43
    -3:     cDC 0.00009999 0.000299970 0.8394045 1.795307            0   41
    -4:     pDC 0.00180415 0.004059336 0.7492218 1.535717           17   21
    -5: NK cell 0.02711970 0.048815461 0.7545862 1.436992          260   10
    -6:  B cell 0.06447382 0.096710725 0.6666689 1.329777          638   15
    +   pathway        pval        padj        ES      NES nMoreExtreme size
    +1:   cMono 0.000099990 0.000299970 0.9596774 2.060568            0   48
    +2:     cDC 0.000099990 0.000299970 0.8397658 1.790162            0   41
    +3:  ncMono 0.000099990 0.000299970 0.8371549 1.787799            0   43
    +4:     pDC 0.001203369 0.002707581 0.7436632 1.519896           11   21
    +5: NK cell 0.029631165 0.053336096 0.7493872 1.424591          285   10
    +6:  B cell 0.059484067 0.089226100 0.6673689 1.326559          587   15
                                         leadingEdge
     1:      S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
    -2:     CTSS,TYMP,CST3,S100A11,AIF1,SERPINA1,...
    -3:            LYZ,GRN,TYMP,CST3,AIF1,LGALS2,...
    +2:            LYZ,GRN,TYMP,CST3,AIF1,LGALS2,...
    +3:     CTSS,TYMP,CST3,S100A11,AIF1,SERPINA1,...
     4:         GRN,MS4A6A,CST3,MPEG1,CTSB,TGFBI,...
     5:      TYROBP,FCER1G,SRGN,CCL3,MYO1F,ITGB2,...
     6: NCF1,LY86,MARCH1,POU2F2,HLA-DMB,HLA-DRB5,...
     
     $`1`
            pathway         pval         padj        ES      NES nMoreExtreme size
    -1:     NK cell 0.0000999900 0.0004007213 0.9459800 2.369826            0   48
    -2:  CD8 T cell 0.0001001803 0.0004007213 0.9230826 2.201075            0   25
    -3:      ncMono 0.0008014655 0.0016029311 0.9101411 1.755775            6    6
    -4:         pDC 0.0078939059 0.0126302494 0.7711439 1.640731           74   10
    -5: Plasma cell 0.0007002101 0.0016029311 0.6711407 1.625909            6   30
    -                                     leadingEdge
    -1:          GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...
    -2:           GNLY,GZMB,FGFBP2,PRF1,NKG7,CTSW,...
    -3:                            FCGR3A,IFITM2,RHOC
    -4: GZMB,C12orf75,HSP90B1,ALOX5AP,PLAC8,RRBP1,...
    -5:      FKBP11,CD38,SDF2L1,PRDM1,PPIB,SLAMF7,...
    +1:     NK cell 0.0000999900 0.0004014049 0.9481456 2.370524            0   48
    +2:  CD8 T cell 0.0001003512 0.0004014049 0.9281929 2.209379            0   25
    +3:      ncMono 0.0004561524 0.0012164063 0.9189116 1.778055            3    6
    +4:         pDC 0.0096296296 0.0154074074 0.7758141 1.647353           90   10
    +5: Plasma cell 0.0013024747 0.0026049494 0.6721222 1.626799           12   30
    +                                 leadingEdge
    +1:      GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...
    +2:       GNLY,GZMB,FGFBP2,PRF1,NKG7,CTSW,...
    +3:                        FCGR3A,IFITM2,RHOC
    +4: GZMB,C12orf75,HSP90B1,ALOX5AP,PLAC8,RRBP1
    +5:  FKBP11,PRDM1,CD38,SDF2L1,PPIB,SLAMF7,...
     
     $`2`
            pathway         pval         padj        ES      NES nMoreExtreme size
    -1:  CD8 T cell 0.0001001101 0.0003503854 0.9406368 2.161149            0   29
    -2:     NK cell 0.0001000500 0.0003503854 0.8208967 1.898566            0   32
    -3:  CD4 T cell 0.0014347202 0.0033476805 0.8706473 1.681457           12    7
    -4: Plasma cell 0.0744595677 0.1042433947 0.5638039 1.298014          743   30
    +1:  CD8 T cell 0.0001000801 0.0003502802 0.9332541 2.141759            0   29
    +2:     NK cell 0.0001000600 0.0003502802 0.8217604 1.895062            0   31
    +3:  CD4 T cell 0.0019867550 0.0046357616 0.8693316 1.678458           17    7
    +4: Plasma cell 0.0817572301 0.1144601221 0.5564210 1.279938          816   30
                                    leadingEdge
    -1:       CD3D,CD8A,CD3G,CCL5,CD8B,GZMH,...
    -2:       CCL5,GZMA,CCL4,NKG7,GZMM,CST7,...
    +1:       CD3D,CD8A,CD3G,CD8B,CCL5,GZMH,...
    +2:       CCL5,GZMA,CCL4,GZMM,NKG7,CST7,...
     3:        CD3D,CD3G,CD3E,IL7R,PIK3IP1,TCF7
     4: FKBP11,PRDM1,PEBP1,PPIB,SEC11C,SUB1,...
     
     $`3`
            pathway         pval         padj        ES      NES nMoreExtreme size
    -1:      B cell 0.0000999900 0.0002706726 0.9072478 1.989836            0   46
    -2:         cDC 0.0001015022 0.0002706726 0.8950426 1.806256            0   14
    -3:         pDC 0.0001008878 0.0002706726 0.8292186 1.700818            0   17
    -4: Plasma cell 0.0348925962 0.0558281540 0.7900880 1.456388          319    7
    +1:      B cell 0.0000999900 0.0004060914 0.9070112 2.004342            0   46
    +2:         cDC 0.0001015228 0.0004060914 0.8951817 1.814950            0   14
    +3:         pDC 0.0004026170 0.0010736454 0.7937887 1.648175            3   18
    +4: Plasma cell 0.0800554312 0.1280886899 0.7211352 1.360348          750    8
                                                 leadingEdge
     1:      CD79A,LINC00926,TCL1A,MS4A1,TNFRSF13C,CD79B,...
    -2: CD74,HLA-DQB1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DQA1,...
    +2: CD74,HLA-DQB1,HLA-DRA,HLA-DRB1,HLA-DPB1,HLA-DQA1,...
     3:            CD74,BCL11A,TCF4,IRF8,HERPUD1,TSPAN13,...
    -4:                       PLPP5,ISG20,HERPUD1,MZB1,ITM2C
    +4:                 PLPP5,ISG20,HERPUD1,MZB1,ITM2C,DERL3
     
     $`4`
           pathway         pval         padj        ES      NES nMoreExtreme size
    -1: CD4 T cell 0.0001015744 0.0003199659 0.9121965 1.771474            0   14
    -2: CD8 T cell 0.0001066553 0.0003199659 0.9014219 1.638647            0    8
    +1: CD4 T cell 0.0001014610 0.0006087662 0.9093092 1.771592            0   14
    +2: CD8 T cell 0.0007438104 0.0022314313 0.8911450 1.626159            6    8
                              leadingEdge
     1: IL7R,LTB,LDHB,MAL,RCAN3,NOSIP,...
     2:      CD3D,IL32,CD3G,CD2,CD3E,CD8B
     
     $`5`
        pathway       pval      padj        ES      NES nMoreExtreme size
    -1:  B cell 0.07818977 0.2503001 0.8293407 1.392212          678    5
    -2:     pDC 0.04285714 0.2503001 0.7176562 1.385862          425   18
    -3:  ncMono 0.08343337 0.2503001 0.6474875 1.279034          833   28
    +1:     pDC 0.03817025 0.2740774 0.7278455 1.398044          377   17
    +2:  ncMono 0.06090609 0.2740774 0.6549736 1.297265          608   30
                                   leadingEdge
    -1:                   PDLIM1,HLA-DRB5,STX7
    -2:   PTCRA,TXN,C12orf75,CST3,CTSB,APP,...
    -3: OAZ1,TIMP1,IFITM3,FKBP1A,CD68,CST3,...
    +1:   PTCRA,TXN,C12orf75,CST3,APP,CTSB,...
    +2: OAZ1,TIMP1,IFITM3,FKBP1A,CD68,CST3,...
     
     $`6`
            pathway         pval         padj        ES      NES nMoreExtreme size
    -1:      B cell 0.0000999900 0.0005417852 0.8919905 1.838712            0   45
    -2:         cDC 0.0002031694 0.0005417852 0.8894057 1.705469            1   14
    -3:         pDC 0.0002015316 0.0005417852 0.8313241 1.622237            1   17
    -4: Plasma cell 0.0232629013 0.0281224853 0.7396460 1.418299          228   14
    +1:      B cell 0.0000999900 0.0004067521 0.8905126 1.833414            0   45
    +2:         cDC 0.0001016880 0.0004067521 0.8877832 1.700394            0   14
    +3:         pDC 0.0003024803 0.0008066142 0.8341772 1.624004            2   17
    +4: Plasma cell 0.0277580792 0.0341582581 0.7291313 1.403876          273   15
                                                 leadingEdge
     1:        CD79A,MS4A1,BANK1,HLA-DQA1,CD74,TNFRSF13C,...
     2: HLA-DQA1,CD74,HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DPA1,...
    -3:             CD74,JCHAIN,SPIB,TCF4,CCDC50,HERPUD1,...
    -4:                JCHAIN,HERPUD1,ISG20,PEBP1,MZB1,ITM2C
    +3:             CD74,JCHAIN,SPIB,TCF4,HERPUD1,CCDC50,...
    +4:                JCHAIN,HERPUD1,ISG20,ITM2C,PEBP1,MZB1
     
     $`7`
        pathway       pval         padj        ES      NES nMoreExtreme size
    -1:  ncMono 0.00009999 0.0002666667 0.9644737 2.033813            0   49
    -2:   cMono 0.00010000 0.0002666667 0.8854337 1.838288            0   36
    -3:     cDC 0.00009999 0.0002666667 0.8309648 1.730082            0   38
    -4: NK cell 0.01025485 0.0205096964 0.7621593 1.478759          100   14
    -5:     pDC 0.02631313 0.0421010019 0.7165790 1.398343          259   15
    -6:  B cell 0.05732420 0.0764322654 0.6694322 1.321810          568   17
    +1:  ncMono 0.00009999 0.0002667734 0.9653377 2.038133            0   49
    +2:   cMono 0.00010004 0.0002667734 0.8842478 1.838505            0   36
    +3:     cDC 0.00010002 0.0002667734 0.8287084 1.729668            0   38
    +4: NK cell 0.00700721 0.0140144206 0.7660007 1.492316           68   14
    +5:     pDC 0.02330058 0.0372809239 0.7210229 1.413360          229   15
    +6:  B cell 0.05925627 0.0790083644 0.6660721 1.322621          587   17
                                                   leadingEdge
     1:                CDKN1C,LST1,FCGR3A,MS4A7,AIF1,COTL1,...
     2:               LST1,AIF1,COTL1,SERPINA1,FCER1G,CST3,...
    @@ -665,10 +668,10 @@ 

    @@ -920,9 +923,9 @@

    $`0`
           pathway      pval        padj        ES      NES nMoreExtreme size
    -1: Neutrophil 9.999e-05 0.002274773 0.8596747 1.768180            0   22
    -2:   Monocyte 9.999e-05 0.002274773 0.8152552 1.737522            0   40
    -3: Eosinophil 9.999e-05 0.002274773 0.8683785 1.723714            0   13
    +1: Neutrophil 9.999e-05 0.001819818 0.8582790 1.761773            0   22
    +2:   Monocyte 9.999e-05 0.001819818 0.8133774 1.731423            0   40
    +3: Eosinophil 9.999e-05 0.001819818 0.8660814 1.711202            0   13
                                      leadingEdge
     1: S100A8,S100A9,CD14,CSF3R,S100A6,PLAUR,...
     2:   S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
    @@ -934,19 +937,19 @@ 

    +1: 0 12 CCR7,TCF7,IL7R,LEF1,TSHZ2,RCAN3,... +2: 0 15 CCR7,TCF7,LEF1,TSHZ2,RCAN3,MAL,... +3: 0 10 CCR7,TCF7,LEF1,LTB,TSHZ2,MAL,...

    #CT_GSEA8:

    @@ -1386,7 +1389,7 @@

    Published with Quarto v1.3.450 - + diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-10-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-10-1.png index 14ce79a9..de851287 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-10-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-11-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-11-1.png index 6e16ceba..a597bf70 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-11-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-15-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-15-1.png index 651d132b..8e236f50 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-15-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-16-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-16-1.png index 19b65de7..dcbe9be7 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-16-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-21-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-21-1.png index 0e41d27a..38032726 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-21-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-22-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-22-1.png index 35b587fe..2e7b1403 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-22-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-26-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-26-1.png index 203a27b0..82131c4f 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-26-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-8-1.png b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-8-1.png index 24003659..af4b8f94 100644 Binary files a/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-8-1.png and b/docs/labs/seurat/seurat_06_celltyping_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/labs/seurat/seurat_07_trajectory.html b/docs/labs/seurat/seurat_07_trajectory.html index 2d5d0e96..4f6c5785 100644 --- a/docs/labs/seurat/seurat_07_trajectory.html +++ b/docs/labs/seurat/seurat_07_trajectory.html @@ -173,6 +173,11 @@ Info +
    @@ -266,14 +271,14 @@

    On this page

    1 Loading libraries

    suppressPackageStartupMessages({
    -    library(Seurat)
    -    library(plotly)
    -    options(rgl.printRglwidget = TRUE)
    -    library(Matrix)
    -    library(sparseMatrixStats)
    -    library(slingshot)
    -    library(tradeSeq)
    -    library(patchwork)
    +  library(Seurat)
    +  library(plotly)
    +  options(rgl.printRglwidget = TRUE)
    +  library(Matrix)
    +  library(sparseMatrixStats)
    +  library(slingshot)
    +  library(tradeSeq)
    +  library(patchwork)
     })
     
     # Define some color palette
    @@ -285,11 +290,11 @@ 

    # Add graph to the base R graphics plot
     draw_graph <- function(layout, graph, lwd = 0.2, col = "grey") {
    -    res <- rep(x = 1:(length(graph@p) - 1), times = (graph@p[-1] - graph@p[-length(graph@p)]))
    -    segments(
    -        x0 = layout[graph@i + 1, 1], x1 = layout[res, 1],
    -        y0 = layout[graph@i + 1, 2], y1 = layout[res, 2], lwd = lwd, col = col
    -    )
    +  res <- rep(x = 1:(length(graph@p) - 1), times = (graph@p[-1] - graph@p[-length(graph@p)]))
    +  segments(
    +    x0 = layout[graph@i + 1, 1], x1 = layout[res, 1],
    +    y0 = layout[graph@i + 1, 2], y1 = layout[res, 2], lwd = lwd, col = col
    +  )
     }

    @@ -332,7 +337,7 @@

    pl <- list() for (i in vars) { - pl[[i]] <- DimPlot(obj, group.by = i, label = T) + theme_void() + NoLegend() + pl[[i]] <- DimPlot(obj, group.by = i, label = T) + theme_void() + NoLegend() } wrap_plots(pl)
    @@ -416,7 +421,7 @@

    pl <- list(DimPlot(obj, group.by = "clusters_use", label = T) + theme_void() + NoLegend()) for (i in vars) { - pl[[i]] <- FeaturePlot(obj, features = i, order = T) + theme_void() + NoLegend() + pl[[i]] <- FeaturePlot(obj, features = i, order = T) + theme_void() + NoLegend() } wrap_plots(pl)

    @@ -449,7 +454,7 @@

    p_State

    - +
    @@ -464,11 +469,11 @@

    set.seed(1) lineages <- as.SlingshotDataSet(getLineages( - data = obj@reductions$umap3d@cell.embeddings, - clusterLabels = obj$clusters_use, - dist.method = "mnn", # It can be: "simple", "scaled.full", "scaled.diag", "slingshot" or "mnn" - end.clus = ENDS, # You can also define the ENDS! - start.clus = "34" + data = obj@reductions$umap3d@cell.embeddings, + clusterLabels = obj$clusters_use, + dist.method = "mnn", # It can be: "simple", "scaled.full", "scaled.diag", "slingshot" or "mnn" + end.clus = ENDS, # You can also define the ENDS! + start.clus = "34" )) # define where to START the trajectories @@ -483,9 +488,9 @@

    lineages@reducedDim <- obj@reductions$umap@cell.embeddings { - plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16) - lines(lineages, lwd = 1, col = "black", cex = 2) - text(centroids2d, labels = rownames(centroids2d), cex = 0.8, font = 2, col = "white") + plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16) + lines(lineages, lwd = 1, col = "black", cex = 2) + text(centroids2d, labels = rownames(centroids2d), cex = 0.8, font = 2, col = "white") }

    @@ -508,11 +513,11 @@

    # Define curves
     curves <- as.SlingshotDataSet(getCurves(
    -    data          = lineages,
    -    thresh        = 1e-1,
    -    stretch       = 1e-1,
    -    allow.breaks  = F,
    -    approx_points = 100
    +  data          = lineages,
    +  thresh        = 1e-1,
    +  stretch       = 1e-1,
    +  allow.breaks  = F,
    +  approx_points = 100
     ))
     
     curves
    @@ -542,9 +547,9 @@

    # Plots
     {
    -    plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)
    -    lines(curves, lwd = 2, col = "black")
    -    text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2)
    +  plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)
    +  lines(curves, lwd = 2, col = "black")
    +  text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2)
     }
    @@ -564,12 +569,12 @@

    o <- order(x) { - plot(obj@reductions$umap@cell.embeddings[o, ], - main = paste0("pseudotime"), pch = 16, cex = 0.4, axes = F, xlab = "", ylab = "", - col = colorRampPalette(c("grey70", "orange3", "firebrick", "purple4"))(99)[x[o] * 98 + 1] - ) - points(centroids2d, cex = 2.5, pch = 16, col = "#FFFFFF99") - text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2) + plot(obj@reductions$umap@cell.embeddings[o, ], + main = paste0("pseudotime"), pch = 16, cex = 0.4, axes = F, xlab = "", ylab = "", + col = colorRampPalette(c("grey70", "orange3", "firebrick", "purple4"))(99)[x[o] * 98 + 1] + ) + points(centroids2d, cex = 2.5, pch = 16, col = "#FFFFFF99") + text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2) }

    @@ -624,8 +629,8 @@

    sel_cells <- split(colnames(obj@assays$RNA@data), obj$clusters_use)
     sel_cells <- unlist(lapply(sel_cells, function(x) {
    -    set.seed(1)
    -    return(sample(x, 20))
    +  set.seed(1)
    +  return(sample(x, 20))
     }))
     
     gv <- as.data.frame(na.omit(scran::modelGeneVar(obj@assays$RNA@data[, sel_cells])))
    @@ -648,11 +653,11 @@ 

    sceGAM <- fitGAM(
    -    counts = drop0(obj@assays$RNA@data[sel_genes, sel_cells]),
    -    pseudotime = pseudotime[sel_cells, ],
    -    cellWeights = cellWeights[sel_cells, ],
    -    nknots = 5, verbose = T, parallel = T, sce = TRUE,
    -    BPPARAM = BiocParallel::MulticoreParam()
    +  counts = drop0(obj@assays$RNA@data[sel_genes, sel_cells]),
    +  pseudotime = pseudotime[sel_cells, ],
    +  cellWeights = cellWeights[sel_cells, ],
    +  nknots = 5, verbose = T, parallel = T, sce = TRUE,
    +  BPPARAM = BiocParallel::MulticoreParam()
     )

    Download the precomputed file.

    @@ -691,15 +696,15 @@

    lc <- sapply(lineages@lineages, function(x) {
    -    rev(x)[1]
    +  rev(x)[1]
     })
     names(lc) <- gsub("Lineage", "L", names(lc))
     
     {
    -    plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)
    -    lines(curves, lwd = 2, col = "black")
    -    points(centroids2d[lc, ], col = "black", pch = 16, cex = 4)
    -    text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white")
    +  plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)
    +  lines(curves, lwd = 2, col = "black")
    +  points(centroids2d[lc, ], col = "black", pch = 16, cex = 4)
    +  text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white")
     }
    @@ -731,23 +736,23 @@

    par(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))
     {
    -    plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "")
    -    lines(curves, lwd = 2, col = "black")
    -    points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T)
    -    text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T)
    +  plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "")
    +  lines(curves, lwd = 2, col = "black")
    +  points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T)
    +  text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T)
     }
     
     vars <- rownames(res[1:15, ])
     vars <- na.omit(vars[vars != "NA"])
     
     for (i in vars) {
    -    x <- drop0(obj@assays$RNA@data)[i, ]
    -    x <- (x - min(x)) / (max(x) - min(x))
    -    o <- order(x)
    -    plot(obj@reductions$umap@cell.embeddings[o, ],
    -        main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "",
    -        col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1]
    -    )
    +  x <- drop0(obj@assays$RNA@data)[i, ]
    +  x <- (x - min(x)) / (max(x) - min(x))
    +  o <- order(x)
    +  plot(obj@reductions$umap@cell.embeddings[o, ],
    +    main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "",
    +    col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1]
    +  )
     }
    @@ -786,20 +791,20 @@

    par(mfrow = c(4, 4), mar = c(.1, .1, 2, 1)) { - plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") - lines(curves, lwd = 2, col = "black") - points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) - text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) + plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") + lines(curves, lwd = 2, col = "black") + points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) + text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) } for (i in vars) { - x <- drop0(obj@assays$RNA@data)[i, ] - x <- (x - min(x)) / (max(x) - min(x)) - o <- order(x) - plot(obj@reductions$umap@cell.embeddings[o, ], - main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", - col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] - ) + x <- drop0(obj@assays$RNA@data)[i, ] + x <- (x - min(x)) / (max(x) - min(x)) + o <- order(x) + plot(obj@reductions$umap@cell.embeddings[o, ], + main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", + col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] + ) }

    @@ -843,20 +848,20 @@

    par(mfrow = c(4, 4), mar = c(.1, .1, 2, 1)) { - plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") - lines(curves, lwd = 2, col = "black") - points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) - text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) + plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = "", ylab = "") + lines(curves, lwd = 2, col = "black") + points(centroids2d[lc, ], col = "black", pch = 15, cex = 3, xpd = T) + text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = "white", xpd = T) } for (i in vars) { - x <- drop0(obj@assays$RNA@data)[i, ] - x <- (x - min(x)) / (max(x) - min(x)) - o <- order(x) - plot(obj@reductions$umap@cell.embeddings[o, ], - main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", - col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] - ) + x <- drop0(obj@assays$RNA@data)[i, ] + x <- (x - min(x)) / (max(x) - min(x)) + o <- order(x) + plot(obj@reductions$umap@cell.embeddings[o, ], + main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = "", ylab = "", + col = colorRampPalette(c("lightgray", "grey60", "navy"))(99)[x[o] * 98 + 1] + ) }

    @@ -1237,7 +1242,7 @@

    Published with Quarto v1.3.450

    - + diff --git a/docs/labs/seurat/seurat_08_spatial.html b/docs/labs/seurat/seurat_08_spatial.html index 69efbcc5..f92c2ddf 100644 --- a/docs/labs/seurat/seurat_08_spatial.html +++ b/docs/labs/seurat/seurat_08_spatial.html @@ -165,6 +165,11 @@ Info +
    @@ -393,8 +398,8 @@

    gc()

                used   (Mb) gc trigger   (Mb)  max used   (Mb)
    -Ncells   3360689  179.5    5248232  280.3   5248232  280.3
    -Vcells 189921766 1449.0  375078860 2861.7 357748466 2729.5
    +Ncells 3360672 179.5 5248490 280.3 5248490 280.3 +Vcells 189921808 1449.0 375078910 2861.7 357748714 2729.5

    As you can see, the mitochondrial genes are among the top expressed genes. Also the lncRNA gene Bc1 (brain cytoplasmic RNA 1). Also one hemoglobin gene.

    @@ -521,8 +526,8 @@

    gc()

                used   (Mb) gc trigger   (Mb)   max used   (Mb)
    -Ncells   3530604  188.6    5248232  280.3    5248232  280.3
    -Vcells 546165895 4167.0 1148293022 8760.8 1147468533 8754.5
    +Ncells 3530587 188.6 5248490 280.3 5248490 280.3 +Vcells 546165937 4167.0 1148293145 8760.8 1147467666 8754.5

    Then we run dimensionality reduction and clustering as before.

    @@ -1257,7 +1262,7 @@

    Published with Quarto v1.3.450

    - + diff --git a/docs/other/containers.html b/docs/other/containers.html index 459e76f0..e25e60fa 100644 --- a/docs/other/containers.html +++ b/docs/other/containers.html @@ -157,6 +157,11 @@ Info +
    @@ -878,7 +883,7 @@

    Published with Quarto v1.3.450

    - + diff --git a/docs/other/docker.html b/docs/other/docker.html index db6ad63c..c19bfca3 100644 --- a/docs/other/docker.html +++ b/docs/other/docker.html @@ -156,6 +156,11 @@ Info +
    diff --git a/docs/other/faq.html b/docs/other/faq.html index 94610fc3..5bad3676 100644 --- a/docs/other/faq.html +++ b/docs/other/faq.html @@ -157,6 +157,11 @@ Info +
    diff --git a/docs/other/uppmax.html b/docs/other/uppmax.html index 8a436a39..df75c40c 100644 --- a/docs/other/uppmax.html +++ b/docs/other/uppmax.html @@ -157,6 +157,11 @@ Info +
    @@ -562,7 +567,7 @@

    Published with Quarto v1.3.450

    - + diff --git a/docs/search.json b/docs/search.json index 0ed0d6d2..90153000 100644 --- a/docs/search.json +++ b/docs/search.json @@ -46,7 +46,7 @@ "href": "labs/seurat/seurat_01_qc.html#meta-qc_collate", "title": " Quality Control", "section": "2 Collate", - "text": "2 Collate\nWe can now load the expression matrices and merge them into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column Chemistry in the metadata for plotting later on.\n\nsdata.cov1 <- CreateSeuratObject(cov.1, project = \"covid_1\")\nsdata.cov15 <- CreateSeuratObject(cov.15, project = \"covid_15\")\nsdata.cov17 <- CreateSeuratObject(cov.17, project = \"covid_17\")\nsdata.cov16 <- CreateSeuratObject(cov.16, project = \"covid_16\")\nsdata.ctrl5 <- CreateSeuratObject(ctrl.5, project = \"ctrl_5\")\nsdata.ctrl13 <- CreateSeuratObject(ctrl.13, project = \"ctrl_13\")\nsdata.ctrl14 <- CreateSeuratObject(ctrl.14, project = \"ctrl_14\")\nsdata.ctrl19 <- CreateSeuratObject(ctrl.19, project = \"ctrl_19\")\n\n\n# add metadata\nsdata.cov1$type <- \"Covid\"\nsdata.cov15$type <- \"Covid\"\nsdata.cov16$type <- \"Covid\"\nsdata.cov17$type <- \"Covid\"\n\nsdata.ctrl5$type <- \"Ctrl\"\nsdata.ctrl13$type <- \"Ctrl\"\nsdata.ctrl14$type <- \"Ctrl\"\nsdata.ctrl19$type <- \"Ctrl\"\n\n# Merge datasets into one single seurat object\nalldata <- merge(sdata.cov1, c(sdata.cov15, sdata.cov16, sdata.cov17, sdata.ctrl5, sdata.ctrl13, sdata.ctrl14, sdata.ctrl19), add.cell.ids = c(\"covid_1\", \"covid_15\", \"covid_16\", \"covid_17\", \"ctrl_5\", \"ctrl_13\", \"ctrl_14\", \"ctrl_19\"))\n\nOnce you have created the merged Seurat object, the count matrices and individual count matrices and objects are not needed anymore. It is a good idea to remove them and run garbage collect to free up some memory.\n\n# remove all objects that will not be used.\nrm(cov.1, cov.15, cov.16, cov.17, ctrl.5, ctrl.13, ctrl.14, ctrl.19, sdata.cov1, sdata.cov15, sdata.cov16, sdata.cov17, sdata.ctrl5, sdata.ctrl13, sdata.ctrl14, sdata.ctrl19)\n# run garbage collect to free up memory\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3325459 177.6 4998412 267 4998412 267.0\nVcells 58182395 443.9 150859912 1151 136166604 1038.9\n\n\nHere is how the count matrix and the metadata look like for every cell.\n\nas.data.frame(alldata@assays$RNA@counts[1:10, 1:2])\n\n\n\n \n\n\nhead(alldata@meta.data, 10)" + "text": "2 Collate\nWe can now merge them objects into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column type in the metadata to define covid and ctrl samples.\nBut first, we need to create Seurat objects using each of the expression matrices we loaded. We define each sample in the project slot, so in each object, the sample id can be found in the metadata slot orig.ident.\n\nsdata.cov1 <- CreateSeuratObject(cov.1, project = \"covid_1\")\nsdata.cov15 <- CreateSeuratObject(cov.15, project = \"covid_15\")\nsdata.cov17 <- CreateSeuratObject(cov.17, project = \"covid_17\")\nsdata.cov16 <- CreateSeuratObject(cov.16, project = \"covid_16\")\nsdata.ctrl5 <- CreateSeuratObject(ctrl.5, project = \"ctrl_5\")\nsdata.ctrl13 <- CreateSeuratObject(ctrl.13, project = \"ctrl_13\")\nsdata.ctrl14 <- CreateSeuratObject(ctrl.14, project = \"ctrl_14\")\nsdata.ctrl19 <- CreateSeuratObject(ctrl.19, project = \"ctrl_19\")\n\n\n# add metadata\nsdata.cov1$type <- \"Covid\"\nsdata.cov15$type <- \"Covid\"\nsdata.cov16$type <- \"Covid\"\nsdata.cov17$type <- \"Covid\"\n\nsdata.ctrl5$type <- \"Ctrl\"\nsdata.ctrl13$type <- \"Ctrl\"\nsdata.ctrl14$type <- \"Ctrl\"\nsdata.ctrl19$type <- \"Ctrl\"\n\n# Merge datasets into one single seurat object\nalldata <- merge(sdata.cov1, c(sdata.cov15, sdata.cov16, sdata.cov17, sdata.ctrl5, sdata.ctrl13, sdata.ctrl14, sdata.ctrl19), add.cell.ids = c(\"covid_1\", \"covid_15\", \"covid_16\", \"covid_17\", \"ctrl_5\", \"ctrl_13\", \"ctrl_14\", \"ctrl_19\"))\n\nOnce you have created the merged Seurat object, the count matrices and individual count matrices and objects are not needed anymore. It is a good idea to remove them and run garbage collect to free up some memory.\n\n# remove all objects that will not be used.\nrm(cov.1, cov.15, cov.16, cov.17, ctrl.5, ctrl.13, ctrl.14, ctrl.19, sdata.cov1, sdata.cov15, sdata.cov16, sdata.cov17, sdata.ctrl5, sdata.ctrl13, sdata.ctrl14, sdata.ctrl19)\n# run garbage collect to free up memory\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3325437 177.6 4998403 267 4998403 267.0\nVcells 58182452 443.9 150860051 1151 136166661 1038.9\n\n\nHere is how the count matrix and the metadata look like for every cell.\n\nas.data.frame(alldata@assays$RNA@counts[1:10, 1:2])\n\n\n\n \n\n\nhead(alldata@meta.data, 10)" }, { "objectID": "labs/seurat/seurat_01_qc.html#meta-qc_calqc", @@ -60,21 +60,21 @@ "href": "labs/seurat/seurat_01_qc.html#meta-qc_plotqc", "title": " Quality Control", "section": "4 Plot QC", - "text": "4 Plot QC\nNow we can plot some of the QC variables as violin plots.\n\nfeats <- c(\"nFeature_RNA\", \"nCount_RNA\", \"percent_mito\", \"percent_ribo\", \"percent_hb\", \"percent_plat\")\nVlnPlot(alldata, group.by = \"orig.ident\", features = feats, pt.size = 0.1, ncol = 3) + NoLegend()\n\n\n\n\n\n\n\n\nAs you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 sample having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.\n\nFeatureScatter(alldata, \"nCount_RNA\", \"nFeature_RNA\", group.by = \"orig.ident\", pt.size = .5)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nPlot additional QC stats that we have calculated as scatter plots. How are the different measures correlated? Can you explain why?" + "text": "4 Plot QC\nNow we can plot some of the QC variables as violin plots.\n\nfeats <- c(\"nFeature_RNA\", \"nCount_RNA\", \"percent_mito\", \"percent_ribo\", \"percent_hb\", \"percent_plat\")\nVlnPlot(alldata, group.by = \"orig.ident\", features = feats, pt.size = 0.1, ncol = 3) + NoLegend()\n\n\n\n\n\n\n\n\nAs you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 and covid_16 samples having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. We can also plot the different QC-measures as scatter plots.\n\nFeatureScatter(alldata, \"nCount_RNA\", \"nFeature_RNA\", group.by = \"orig.ident\", pt.size = .5)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nPlot additional QC stats that we have calculated as scatter plots. How are the different measures correlated? Can you explain why?" }, { "objectID": "labs/seurat/seurat_01_qc.html#meta-qc_filter", "href": "labs/seurat/seurat_01_qc.html#meta-qc_filter", "title": " Quality Control", "section": "5 Filtering", - "text": "5 Filtering\n\n5.1 Detection-based filtering\nA standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.\n\nselected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)\nselected_f <- rownames(alldata)[Matrix::rowSums(alldata) > 3]\n\ndata.filt <- subset(alldata, features = selected_f, cells = selected_c)\ndim(data.filt)\n\n[1] 18877 10656\n\ntable(data.filt$orig.ident)\n\n\n covid_1 covid_15 covid_16 covid_17 ctrl_13 ctrl_14 ctrl_19 ctrl_5 \n 1254 1283 1127 1371 1417 1399 1434 1371 \n\n\nExtremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In this case, we will run doublet prediction further down, so we will skip this step now, but the code below is an example of how it can be run:\n\n# skip and run DoubletFinder instead\n# data.filt <- subset(data.filt, cells=WhichCells(data.filt, expression = nFeature_RNA < 4100))\n\nAdditionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.\n\n# Compute the proportion of counts of each gene per cell\n# Use sparse matrix operations, if your dataset is large, doing matrix devisions the regular way will take a very long time.\n\nC <- data.filt@assays$RNA@counts\nC@x <- C@x / rep.int(colSums(C), diff(C@p)) * 100\nmost_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]\nboxplot(as.matrix(t(C[most_expressed, ])),\n cex = 0.1, las = 1, xlab = \"Percent counts per cell\",\n col = (scales::hue_pal())(20)[20:1], horizontal = TRUE\n)\n\n\n\n\n\n\n\n\nAs you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.\n\n\n5.2 Mito/Ribo filtering\nWe also have quite a lot of cells with high proportion of mitochondrial and low proportion of ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent_mito variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 20% mitochondrial reads and that will be used as a cutoff. We will also remove cells with less than 5% ribosomal reads.\n\nselected_mito <- WhichCells(data.filt, expression = percent_mito < 20)\nselected_ribo <- WhichCells(data.filt, expression = percent_ribo > 5)\n\n# and subset the object to only keep those cells\ndata.filt <- subset(data.filt, cells = selected_mito)\ndata.filt <- subset(data.filt, cells = selected_ribo)\ndim(data.filt)\n\n[1] 18877 7431\n\ntable(data.filt$orig.ident)\n\n\n covid_1 covid_15 covid_16 covid_17 ctrl_13 ctrl_14 ctrl_19 ctrl_5 \n 900 599 373 1101 1173 1063 1170 1052 \n\n\nAs you can see, a large proportion of sample covid_15 is filtered out. Also, there is still quite a lot of variation in percent_mito, so it will have to be dealt with in the data analysis step. We can also notice that the percent_ribo are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.\n\n\n5.3 Plot filtered QC\nLets plot the same QC-stats another time.\n\nfeats <- c(\"nFeature_RNA\", \"nCount_RNA\", \"percent_mito\", \"percent_ribo\", \"percent_hb\")\nVlnPlot(data.filt, group.by = \"orig.ident\", features = feats, pt.size = 0.1, ncol = 3) + NoLegend()\n\n\n\n\n\n\n\n\n\n\n5.4 Filter genes\nAs the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis.\n\ndim(data.filt)\n\n[1] 18877 7431\n\n# Filter MALAT1\ndata.filt <- data.filt[!grepl(\"MALAT1\", rownames(data.filt)), ]\n\n# Filter Mitocondrial\ndata.filt <- data.filt[!grepl(\"^MT-\", rownames(data.filt)), ]\n\n# Filter Ribossomal gene (optional if that is a problem on your data)\n# data.filt <- data.filt[ ! grepl(\"^RP[SL]\", rownames(data.filt)), ]\n\n# Filter Hemoglobin gene (optional if that is a problem on your data)\ndata.filt <- data.filt[!grepl(\"^HB[^(P)]\", rownames(data.filt)), ]\n\ndim(data.filt)\n\n[1] 18851 7431" + "text": "5 Filtering\n\n5.1 Detection-based filtering\nA standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.\n\nselected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)\nselected_f <- rownames(alldata)[Matrix::rowSums(alldata) > 3]\n\ndata.filt <- subset(alldata, features = selected_f, cells = selected_c)\ndim(data.filt)\n\n[1] 18877 10656\n\ntable(data.filt$orig.ident)\n\n\n covid_1 covid_15 covid_16 covid_17 ctrl_13 ctrl_14 ctrl_19 ctrl_5 \n 1254 1283 1127 1371 1417 1399 1434 1371 \n\n\nExtremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In this case, we will run doublet prediction further down, so we will skip this step now, but the code below is an example of how it can be run:\n\n# skip and run DoubletFinder instead\n# data.filt <- subset(data.filt, cells=WhichCells(data.filt, expression = nFeature_RNA < 4100))\n\nAdditionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.\n\n# Compute the proportion of counts of each gene per cell\n# Use sparse matrix operations, if your dataset is large, doing matrix devisions the regular way will take a very long time.\n\nC <- data.filt@assays$RNA@counts\nC@x <- C@x / rep.int(colSums(C), diff(C@p)) * 100\nmost_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]\nboxplot(as.matrix(t(C[most_expressed, ])),\n cex = 0.1, las = 1, xlab = \"Percent counts per cell\",\n col = (scales::hue_pal())(20)[20:1], horizontal = TRUE\n)\n\n\n\n\n\n\n\n\nAs you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.\n\n\n5.2 Mito/Ribo filtering\nWe also have quite a lot of cells with high proportion of mitochondrial and low proportion of ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent_mito variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 20% mitochondrial reads and that will be used as a cutoff. We will also remove cells with less than 5% ribosomal reads.\n\nselected_mito <- WhichCells(data.filt, expression = percent_mito < 20)\nselected_ribo <- WhichCells(data.filt, expression = percent_ribo > 5)\n\n# and subset the object to only keep those cells\ndata.filt <- subset(data.filt, cells = selected_mito)\ndata.filt <- subset(data.filt, cells = selected_ribo)\ndim(data.filt)\n\n[1] 18877 7431\n\ntable(data.filt$orig.ident)\n\n\n covid_1 covid_15 covid_16 covid_17 ctrl_13 ctrl_14 ctrl_19 ctrl_5 \n 900 599 373 1101 1173 1063 1170 1052 \n\n\nAs you can see, a large proportion of sample covid_15 is filtered out. Also, there is still quite a lot of variation in percent_mito, so it will have to be dealt with in the data analysis step. We can also notice that the percent_ribo are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.\n\n\n5.3 Plot filtered QC\nLets plot the same QC-stats another time.\n\nfeats <- c(\"nFeature_RNA\", \"nCount_RNA\", \"percent_mito\", \"percent_ribo\", \"percent_hb\")\nVlnPlot(data.filt, group.by = \"orig.ident\", features = feats, pt.size = 0.1, ncol = 3) + NoLegend()\n\n\n\n\n\n\n\n\n\n\n5.4 Filter genes\nAs the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis. In this case we will also remove the HB genes.\n\ndim(data.filt)\n\n[1] 18877 7431\n\n# Filter MALAT1\ndata.filt <- data.filt[!grepl(\"MALAT1\", rownames(data.filt)), ]\n\n# Filter Mitocondrial\ndata.filt <- data.filt[!grepl(\"^MT-\", rownames(data.filt)), ]\n\n# Filter Ribossomal gene (optional if that is a problem on your data)\n# data.filt <- data.filt[ ! grepl(\"^RP[SL]\", rownames(data.filt)), ]\n\n# Filter Hemoglobin gene (optional if that is a problem on your data)\ndata.filt <- data.filt[!grepl(\"^HB[^(PES)]\", rownames(data.filt)), ]\n\ndim(data.filt)\n\n[1] 18854 7431" }, { "objectID": "labs/seurat/seurat_01_qc.html#meta-qc_sex", "href": "labs/seurat/seurat_01_qc.html#meta-qc_sex", "title": " Quality Control", "section": "6 Sample sex", - "text": "6 Sample sex\nWhen working with human or animal samples, you should ideally constrain you experiments to a single sex to avoid including sex bias in the conclusions. However this may not always be possible. By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is. It can also bee a good way to detect if there has been any sample mixups, if the sample metadata sex does not agree with the computational predictions.\nTo get chromosome information for all genes, you should ideally parse the information from the gtf file that you used in the mapping pipeline as it has the exact same annotation version/gene naming. However, it may not always be available, as in this case where we have downloaded public data. R package biomaRt can be used to fetch annotation information. The code to run biomaRt is provided. As the biomart instances quite often are unresponsive, we will download and use a file that was created in advance.\n\n# this code chunk is not executed\nsuppressMessages(library(biomaRt))\n\n# initialize connection to mart, may take some time if the sites are unresponsive.\nmart <- useMart(\"ENSEMBL_MART_ENSEMBL\", dataset = \"hsapiens_gene_ensembl\")\n\n# fetch chromosome info plus some other annotations\ngenes_table <- try(biomaRt::getBM(attributes = c(\n \"ensembl_gene_id\", \"external_gene_name\",\n \"description\", \"gene_biotype\", \"chromosome_name\", \"start_position\"\n), mart = mart, useCache = F))\n\nwrite.csv(genes_table, file = \"data/results/genes_table.csv\")\n\n\ngenes_file <- file.path(path_results, \"genes_table.csv\")\n\nif (!file.exists(genes_file)) download.file(file.path(path_data, \"covid/results/genes_table.csv\"), destfile = genes_file)\ngenes.table <- read.csv(genes_file)\n\ngenes.table <- genes.table[genes.table$external_gene_name %in% rownames(data.filt), ]\n\nNow that we have the chromosome information, we can calculate per cell the proportion of reads that comes from chromosome Y.\n\nchrY.gene <- genes.table$external_gene_name[genes.table$chromosome_name == \"Y\"]\ndata.filt$pct_chrY <- colSums(data.filt@assays$RNA@counts[chrY.gene, ]) / colSums(data.filt@assays$RNA@counts)\n\nThen plot XIST expression vs chrY proportion. As you can see, the samples are clearly on either side, even if some cells do not have detection of either.\n\nFeatureScatter(data.filt, feature1 = \"XIST\", feature2 = \"pct_chrY\")\n\n\n\n\n\n\n\n\nPlot as violins.\n\nVlnPlot(data.filt, features = c(\"XIST\", \"pct_chrY\"))\n\n\n\n\n\n\n\n\nHere, we can see clearly that we have two males and 4 females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?" + "text": "6 Sample sex\nWhen working with human or animal samples, you should ideally constrain you experiments to a single sex to avoid including sex bias in the conclusions. However this may not always be possible. By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is. It can also bee a good way to detect if there has been any sample mixups, if the sample metadata sex does not agree with the computational predictions.\nTo get chromosome information for all genes, you should ideally parse the information from the gtf file that you used in the mapping pipeline as it has the exact same annotation version/gene naming. However, it may not always be available, as in this case where we have downloaded public data. R package biomaRt can be used to fetch annotation information. The code to run biomaRt is provided. As the biomart instances quite often are unresponsive, we will download and use a file that was created in advance.\n\n# this code chunk is not executed\nsuppressMessages(library(biomaRt))\n\n# initialize connection to mart, may take some time if the sites are unresponsive.\nmart <- useMart(\"ENSEMBL_MART_ENSEMBL\", dataset = \"hsapiens_gene_ensembl\")\n\n# fetch chromosome info plus some other annotations\ngenes_table <- try(biomaRt::getBM(attributes = c(\n \"ensembl_gene_id\", \"external_gene_name\",\n \"description\", \"gene_biotype\", \"chromosome_name\", \"start_position\"\n), mart = mart, useCache = F))\n\nwrite.csv(genes_table, file = \"data/covid/results/genes_table.csv\")\n\n\ngenes_file <- file.path(path_results, \"genes_table.csv\")\n\nif (!file.exists(genes_file)) download.file(file.path(path_data, \"covid/results/genes_table.csv\"), destfile = genes_file)\ngenes.table <- read.csv(genes_file)\n\ngenes.table <- genes.table[genes.table$external_gene_name %in% rownames(data.filt), ]\n\nNow that we have the chromosome information, we can calculate per cell the proportion of reads that comes from chromosome Y.\n\nchrY.gene <- genes.table$external_gene_name[genes.table$chromosome_name == \"Y\"]\ndata.filt$pct_chrY <- colSums(data.filt@assays$RNA@counts[chrY.gene, ]) / colSums(data.filt@assays$RNA@counts)\n\nThen plot XIST expression vs chrY proportion. As you can see, the samples are clearly on either side, even if some cells do not have detection of either.\n\nFeatureScatter(data.filt, feature1 = \"XIST\", feature2 = \"pct_chrY\")\n\n\n\n\n\n\n\n\nPlot as violins.\n\nVlnPlot(data.filt, features = c(\"XIST\", \"pct_chrY\"))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nHere, we can see clearly that we have three males and five females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?" }, { "objectID": "labs/seurat/seurat_01_qc.html#meta-qc_cellcycle", @@ -88,7 +88,7 @@ "href": "labs/seurat/seurat_01_qc.html#meta-qc_doublet", "title": " Quality Control", "section": "8 Predict doublets", - "text": "8 Predict doublets\nDoublets/Multiples of cells in the same well/droplet is a common issue in scRNAseq protocols. Especially in droplet-based methods with overloading of cells. In a typical 10x experiment the proportion of doublets is linearly dependent on the amount of loaded cells. As indicated from the Chromium user guide, doublet rates are about as follows:\n\nMost doublet detectors simulates doublets by merging cell counts and predicts doublets as cells that have similar embeddings as the simulated doublets. Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.\n\n\n\n\n\n\nCaution\n\n\n\nIdeally doublet prediction should be run on each sample separately, especially if your different samples have different proportions of cell types. In this case, the data is subsampled so we have very few cells per sample and all samples are sorted PBMCs so it is okay to run them together.\n\n\nHere, we will use DoubletFinder to predict doublet cells. But before doing doublet detection we need to run scaling, variable gene selection and pca, as well as UMAP for visualization. These steps will be explored in more detail in coming exercises.\n\ndata.filt <- FindVariableFeatures(data.filt, verbose = F)\ndata.filt <- ScaleData(data.filt, vars.to.regress = c(\"nFeature_RNA\", \"percent_mito\"), verbose = F)\ndata.filt <- RunPCA(data.filt, verbose = F, npcs = 20)\ndata.filt <- RunUMAP(data.filt, dims = 1:10, verbose = F)\n\nThen we run doubletFinder, selecting first 10 PCs and a pK value of 0.9. To optimize the parameters, you can run the paramSweep function in the package.\n\nsuppressMessages(library(DoubletFinder))\n# Can run parameter optimization with paramSweep\n\n# sweep.res <- paramSweep_v3(data.filt)\n# sweep.stats <- summarizeSweep(sweep.res, GT = FALSE)\n# bcmvn <- find.pK(sweep.stats)\n# barplot(bcmvn$BCmetric, names.arg = bcmvn$pK, las=2)\n\n# define the expected number of doublet cellscells.\nnExp <- round(ncol(data.filt) * 0.04) # expect 4% doublets\ndata.filt <- doubletFinder_v3(data.filt, pN = 0.25, pK = 0.09, nExp = nExp, PCs = 1:10)\n\n[1] \"Creating 2477 artificial doublets...\"\n[1] \"Creating Seurat object...\"\n[1] \"Normalizing Seurat object...\"\n[1] \"Finding variable genes...\"\n[1] \"Scaling data...\"\n[1] \"Running PCA...\"\n[1] \"Calculating PC distance matrix...\"\n[1] \"Computing pANN...\"\n[1] \"Classifying doublets..\"\n\n\n\n# name of the DF prediction can change, so extract the correct column name.\nDF.name <- colnames(data.filt@meta.data)[grepl(\"DF.classification\", colnames(data.filt@meta.data))]\n\nwrap_plots(\n DimPlot(data.filt, group.by = \"orig.ident\") + NoAxes(),\n DimPlot(data.filt, group.by = DF.name) + NoAxes(),\n ncol = 2\n)\n\n\n\n\n\n\n\n\nWe should expect that two cells have more detected genes than a single cell, lets check if our predicted doublets also have more detected genes in general.\n\nVlnPlot(data.filt, features = \"nFeature_RNA\", group.by = DF.name, pt.size = .1)\n\n\n\n\n\n\n\n\nNow, lets remove all predicted doublets from our data.\n\ndata.filt <- data.filt[, data.filt@meta.data[, DF.name] == \"Singlet\"]\ndim(data.filt)\n\n[1] 18851 7134" + "text": "8 Predict doublets\nDoublets/Multiples of cells in the same well/droplet is a common issue in scRNAseq protocols. Especially in droplet-based methods with overloading of cells. In a typical 10x experiment the proportion of doublets is linearly dependent on the amount of loaded cells. As indicated from the Chromium user guide, doublet rates are about as follows:\n\nMost doublet detectors simulates doublets by merging cell counts and predicts doublets as cells that have similar embeddings as the simulated doublets. Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.\n\n\n\n\n\n\nCaution\n\n\n\nIdeally doublet prediction should be run on each sample separately, especially if your different samples have different proportions of cell types. In this case, the data is subsampled so we have very few cells per sample and all samples are sorted PBMCs so it is okay to run them together.\n\n\nHere, we will use DoubletFinder to predict doublet cells. But before doing doublet detection we need to run scaling, variable gene selection and pca, as well as UMAP for visualization. These steps will be explored in more detail in coming exercises.\n\ndata.filt <- FindVariableFeatures(data.filt, verbose = F)\ndata.filt <- ScaleData(data.filt, vars.to.regress = c(\"nFeature_RNA\", \"percent_mito\"), verbose = F)\ndata.filt <- RunPCA(data.filt, verbose = F, npcs = 20)\ndata.filt <- RunUMAP(data.filt, dims = 1:10, verbose = F)\n\nThen we run doubletFinder, selecting first 10 PCs and a pK value of 0.9. To optimize the parameters, you can run the paramSweep function in the package.\n\nsuppressMessages(library(DoubletFinder))\n# Can run parameter optimization with paramSweep\n\n# sweep.res <- paramSweep_v3(data.filt)\n# sweep.stats <- summarizeSweep(sweep.res, GT = FALSE)\n# bcmvn <- find.pK(sweep.stats)\n# barplot(bcmvn$BCmetric, names.arg = bcmvn$pK, las=2)\n\n# define the expected number of doublet cellscells.\nnExp <- round(ncol(data.filt) * 0.04) # expect 4% doublets\ndata.filt <- doubletFinder_v3(data.filt, pN = 0.25, pK = 0.09, nExp = nExp, PCs = 1:10)\n\n[1] \"Creating 2477 artificial doublets...\"\n[1] \"Creating Seurat object...\"\n[1] \"Normalizing Seurat object...\"\n[1] \"Finding variable genes...\"\n[1] \"Scaling data...\"\n[1] \"Running PCA...\"\n[1] \"Calculating PC distance matrix...\"\n[1] \"Computing pANN...\"\n[1] \"Classifying doublets..\"\n\n\n\n# name of the DF prediction can change, so extract the correct column name.\nDF.name <- colnames(data.filt@meta.data)[grepl(\"DF.classification\", colnames(data.filt@meta.data))]\n\nwrap_plots(\n DimPlot(data.filt, group.by = \"orig.ident\") + NoAxes(),\n DimPlot(data.filt, group.by = DF.name) + NoAxes(),\n ncol = 2\n)\n\n\n\n\n\n\n\n\nWe should expect that two cells have more detected genes than a single cell, lets check if our predicted doublets also have more detected genes in general.\n\nVlnPlot(data.filt, features = \"nFeature_RNA\", group.by = DF.name, pt.size = .1)\n\n\n\n\n\n\n\n\nNow, lets remove all predicted doublets from our data.\n\ndata.filt <- data.filt[, data.filt@meta.data[, DF.name] == \"Singlet\"]\ndim(data.filt)\n\n[1] 18854 7134" }, { "objectID": "labs/seurat/seurat_01_qc.html#meta-qc_save", @@ -200,7 +200,7 @@ "href": "labs/seurat/seurat_03_integration.html#cca", "title": " Data Integration", "section": "2 CCA", - "text": "2 CCA\nWe identify anchors using the FindIntegrationAnchors function, which takes a list of Seurat objects as input.\n\nalldata.anchors <- FindIntegrationAnchors(object.list = alldata.list, dims = 1:30,reduction = \"cca\", anchor.features = hvgs_all)\n\nWe then pass these anchors to the IntegrateData function, which returns a Seurat object.\n\nalldata.int <- IntegrateData(anchorset = alldata.anchors, dims = 1:30, new.assay.name = \"CCA\")\n\nWe can observe that a new assay slot is now created under the name CCA. If you do not specify the assay name the default will be integrated.\n\nnames(alldata.int@assays)\n\n[1] \"RNA\" \"CCA\"\n\n# by default, Seurat now sets the integrated assay as the default assay, so any operation you now perform will be on the integrated data.\nalldata.int@active.assay\n\n[1] \"CCA\"\n\n\nAfter running IntegrateData, the Seurat object will contain a new Assay with the integrated (or batch-corrected) expression matrix. Note that the original (uncorrected values) are still stored in the object in the “RNA” assay, so you can switch back and forth. We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP and TSNE. The integrated datasets cluster by cell type, instead of by technology.\nAs CCA is the active.assay now the functions will by default run on the data in that assay. But you could also specify in each of the functions to run them in a specific assay with the parameter assay = \"CCA\".\n\n#Run Dimensionality reduction on integrated space\nalldata.int <- ScaleData(alldata.int, verbose = FALSE)\nalldata.int <- RunPCA(alldata.int, npcs = 30, verbose = FALSE)\nalldata.int <- RunUMAP(alldata.int, dims = 1:30)\nalldata.int <- RunTSNE(alldata.int, dims = 1:30)\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nwrap_plots(\n DimPlot(alldata, reduction = \"pca\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"PCA raw_data\"),\n DimPlot(alldata, reduction = \"tsne\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"tSNE raw_data\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"UMAP raw_data\"),\n \n DimPlot(alldata.int, reduction = \"pca\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"PCA integrated\"),\n DimPlot(alldata.int, reduction = \"tsne\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"tSNE integrated\"),\n DimPlot(alldata.int, reduction = \"umap\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"UMAP integrated\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n2.1 Clean memory\nAgain we have a lot of large objects in the memory. We have the original data alldata but also the integrated data in alldata.int. We also have the split objects in alldata.list and the anchors in alldata.anchors. In principle we only need the integrated object for now, but we will also keep the list for running Scanorama further down in the tutorial.\nWe also want to keep is the orignial umap for visualization purposes, so we copy it over to the integrated object.\n\nalldata.int@reductions$umap_raw = alldata@reductions$umap\n\n# remove all objects that will not be used.\nrm(alldata, alldata.anchors)\n# run garbage collect to free up memory\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3414313 182.4 4989403 266.5 4989403 266.5\nVcells 203222859 1550.5 564336378 4305.6 879335547 6708.8\n\n\nLet’s plot some marker genes for different cell types onto the embedding.\n\n\n\nMarkers\nCell Type\n\n\n\n\nCD3E\nT cells\n\n\nCD3E CD4\nCD4+ T cells\n\n\nCD3E CD8A\nCD8+ T cells\n\n\nGNLY, NKG7\nNK cells\n\n\nMS4A1\nB cells\n\n\nCD14, LYZ, CST3, MS4A7\nCD14+ Monocytes\n\n\nFCGR3A, LYZ, CST3, MS4A7\nFCGR3A+ Monocytes\n\n\nFCER1A, CST3\nDCs\n\n\n\n\nmyfeatures <- c(\"CD3E\", \"CD4\", \"CD8A\", \"NKG7\", \"GNLY\", \"MS4A1\", \"CD14\", \"LYZ\", \"MS4A7\", \"FCGR3A\", \"CST3\", \"FCER1A\")\nFeaturePlot(alldata.int, reduction = \"umap\", dims = 1:2, features = myfeatures, ncol = 4, order = T) + NoLegend() + NoAxes() + NoGrid()" + "text": "2 CCA\nWe identify anchors using the FindIntegrationAnchors function, which takes a list of Seurat objects as input.\n\nalldata.anchors <- FindIntegrationAnchors(object.list = alldata.list, dims = 1:30,reduction = \"cca\", anchor.features = hvgs_all)\n\nWe then pass these anchors to the IntegrateData function, which returns a Seurat object.\n\nalldata.int <- IntegrateData(anchorset = alldata.anchors, dims = 1:30, new.assay.name = \"CCA\")\n\nWe can observe that a new assay slot is now created under the name CCA. If you do not specify the assay name the default will be integrated.\n\nnames(alldata.int@assays)\n\n[1] \"RNA\" \"CCA\"\n\n# by default, Seurat now sets the integrated assay as the default assay, so any operation you now perform will be on the integrated data.\nalldata.int@active.assay\n\n[1] \"CCA\"\n\n\nAfter running IntegrateData, the Seurat object will contain a new Assay with the integrated (or batch-corrected) expression matrix. Note that the original (uncorrected values) are still stored in the object in the “RNA” assay, so you can switch back and forth. We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP and TSNE. The integrated datasets cluster by cell type, instead of by technology.\nAs CCA is the active.assay now the functions will by default run on the data in that assay. But you could also specify in each of the functions to run them in a specific assay with the parameter assay = \"CCA\".\n\n#Run Dimensionality reduction on integrated space\nalldata.int <- ScaleData(alldata.int, verbose = FALSE)\nalldata.int <- RunPCA(alldata.int, npcs = 30, verbose = FALSE)\nalldata.int <- RunUMAP(alldata.int, dims = 1:30)\nalldata.int <- RunTSNE(alldata.int, dims = 1:30)\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nwrap_plots(\n DimPlot(alldata, reduction = \"pca\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"PCA raw_data\"),\n DimPlot(alldata, reduction = \"tsne\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"tSNE raw_data\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"UMAP raw_data\"),\n \n DimPlot(alldata.int, reduction = \"pca\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"PCA integrated\"),\n DimPlot(alldata.int, reduction = \"tsne\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"tSNE integrated\"),\n DimPlot(alldata.int, reduction = \"umap\", group.by = \"orig.ident\")+NoAxes()+ggtitle(\"UMAP integrated\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n2.1 Clean memory\nAgain we have a lot of large objects in the memory. We have the original data alldata but also the integrated data in alldata.int. We also have the split objects in alldata.list and the anchors in alldata.anchors. In principle we only need the integrated object for now, but we will also keep the list for running Scanorama further down in the tutorial.\nWe also want to keep is the orignial umap for visualization purposes, so we copy it over to the integrated object.\n\nalldata.int@reductions$umap_raw = alldata@reductions$umap\n\n# remove all objects that will not be used.\nrm(alldata, alldata.anchors)\n# run garbage collect to free up memory\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3414481 182.4 4989417 266.5 4989417 266.5\nVcells 203242618 1550.7 556191914 4243.5 868979728 6629.8\n\n\nLet’s plot some marker genes for different cell types onto the embedding.\n\n\n\nMarkers\nCell Type\n\n\n\n\nCD3E\nT cells\n\n\nCD3E CD4\nCD4+ T cells\n\n\nCD3E CD8A\nCD8+ T cells\n\n\nGNLY, NKG7\nNK cells\n\n\nMS4A1\nB cells\n\n\nCD14, LYZ, CST3, MS4A7\nCD14+ Monocytes\n\n\nFCGR3A, LYZ, CST3, MS4A7\nFCGR3A+ Monocytes\n\n\nFCER1A, CST3\nDCs\n\n\n\n\nmyfeatures <- c(\"CD3E\", \"CD4\", \"CD8A\", \"NKG7\", \"GNLY\", \"MS4A1\", \"CD14\", \"LYZ\", \"MS4A7\", \"FCGR3A\", \"CST3\", \"FCER1A\")\nFeaturePlot(alldata.int, reduction = \"umap\", dims = 1:2, features = myfeatures, ncol = 4, order = T) + NoLegend() + NoAxes() + NoGrid()" }, { "objectID": "labs/seurat/seurat_03_integration.html#harmony", @@ -214,7 +214,7 @@ "href": "labs/seurat/seurat_03_integration.html#scanorama", "title": " Data Integration", "section": "4 Scanorama", - "text": "4 Scanorama\nAnother integration method is Scanorama (see Nat. Biotech.). This method is implemented in python, but we can run it through the Reticulate package.\n\nassaylist <- list()\ngenelist <- list()\nfor(i in 1:length(alldata.list)) {\n assaylist[[i]] <- t(as.matrix(GetAssayData(alldata.list[[i]], \"data\")[hvgs_all,]))\n genelist[[i]] <- hvgs_all\n}\n\nlapply(assaylist,dim)\n\n[[1]]\n[1] 873 2000\n\n[[2]]\n[1] 556 2000\n\n[[3]]\n[1] 358 2000\n\n[[4]]\n[1] 1050 2000\n\n[[5]]\n[1] 1034 2000\n\n[[6]]\n[1] 1126 2000\n\n[[7]]\n[1] 998 2000\n\n[[8]]\n[1] 1139 2000\n\n\n\n# Activate scanorama Python venv\nscanorama <- reticulate::import(\"scanorama\")\n\nintegrated.data <- scanorama$integrate(datasets_full = assaylist,\n genes_list = genelist )\n\n# Now we create a new dim reduction object in the format that Seurat uses\nintdimred <- do.call(rbind, integrated.data[[1]])\ncolnames(intdimred) <- paste0(\"PC_\", 1:100)\nrownames(intdimred) <- colnames(alldata.int)\n\n# Add standard deviations in order to draw Elbow Plots in Seurat\nstdevs <- apply(intdimred, MARGIN = 2, FUN = sd)\n\n# Create a new dim red object.\nalldata.int[[\"scanorama\"]] <- CreateDimReducObject(\n embeddings = intdimred,\n stdev = stdevs,\n key = \"PC_\",\n assay = \"RNA\")\n\n\n#Here we use all PCs computed from Scanorama for UMAP calculation\nalldata.int <- RunUMAP(alldata.int, dims = 1:100, reduction = \"scanorama\",reduction.name = \"umap_scanorama\")\n\nDimPlot(alldata.int, reduction = \"umap_scanorama\", group.by = \"orig.ident\") + NoAxes() + ggtitle(\"Harmony UMAP\")" + "text": "4 Scanorama\nAnother integration method is Scanorama (see Nat. Biotech.). This method is implemented in python, but we can run it through the Reticulate package.\n\nassaylist <- list()\ngenelist <- list()\nfor(i in 1:length(alldata.list)) {\n assaylist[[i]] <- t(as.matrix(GetAssayData(alldata.list[[i]], \"data\")[hvgs_all,]))\n genelist[[i]] <- hvgs_all\n}\n\nlapply(assaylist,dim)\n\n[[1]]\n[1] 873 2000\n\n[[2]]\n[1] 556 2000\n\n[[3]]\n[1] 358 2000\n\n[[4]]\n[1] 1051 2000\n\n[[5]]\n[1] 1034 2000\n\n[[6]]\n[1] 1126 2000\n\n[[7]]\n[1] 997 2000\n\n[[8]]\n[1] 1139 2000\n\n\n\n# Activate scanorama Python venv\nscanorama <- reticulate::import(\"scanorama\")\n\nintegrated.data <- scanorama$integrate(datasets_full = assaylist,\n genes_list = genelist )\n\n# Now we create a new dim reduction object in the format that Seurat uses\nintdimred <- do.call(rbind, integrated.data[[1]])\ncolnames(intdimred) <- paste0(\"PC_\", 1:100)\nrownames(intdimred) <- colnames(alldata.int)\n\n# Add standard deviations in order to draw Elbow Plots in Seurat\nstdevs <- apply(intdimred, MARGIN = 2, FUN = sd)\n\n# Create a new dim red object.\nalldata.int[[\"scanorama\"]] <- CreateDimReducObject(\n embeddings = intdimred,\n stdev = stdevs,\n key = \"PC_\",\n assay = \"RNA\")\n\n\n#Here we use all PCs computed from Scanorama for UMAP calculation\nalldata.int <- RunUMAP(alldata.int, dims = 1:100, reduction = \"scanorama\",reduction.name = \"umap_scanorama\")\n\nDimPlot(alldata.int, reduction = \"umap_scanorama\", group.by = \"orig.ident\") + NoAxes() + ggtitle(\"Harmony UMAP\")" }, { "objectID": "labs/seurat/seurat_03_integration.html#overview-all-methods", @@ -263,21 +263,21 @@ "href": "labs/seurat/seurat_04_clustering.html#meta-clust_hier", "title": " Clustering", "section": "3 Hierarchical clustering", - "text": "3 Hierarchical clustering\n\n3.1 Defining distance between cells\nThe base R stats package already contains a function dist that calculates distances between all pairs of samples. Since we want to compute distances between samples, rather than among genes, we need to transpose the data before applying it to the dist function. This can be done by simply adding the transpose function t() to the data. The distance methods available in dist are: ‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’ or ‘minkowski’.\n\nd <- dist(alldata@reductions[[\"pca\"]]@cell.embeddings, method = \"euclidean\")\n\nAs you might have realized, correlation is not a method implemented in the dist() function. However, we can create our own distances and transform them to a distance object. We can first compute sample correlations using the cor function.\nAs you already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):\n[adj = ]\nOnce we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.\n\n# Compute sample correlations\nsample_cor <- cor(Matrix::t(alldata@reductions[[\"pca\"]]@cell.embeddings))\n\n# Transform the scale from correlations\nsample_cor <- (1 - sample_cor) / 2\n\n# Convert it to a distance object\nd2 <- as.dist(sample_cor)\n\n\n\n3.2 Clustering cells\nAfter having calculated the distances between samples calculated, we can now proceed with the hierarchical clustering per-se. We will use the function hclust for this purpose, in which we can simply run it with the distance objects created above. The methods available are: ‘ward.D’, ‘ward.D2’, ‘single’, ‘complete’, ‘average’, ‘mcquitty’, ‘median’ or ‘centroid’. It is possible to plot the dendrogram for all cells, but this is very time consuming and we will omit for this tutorial.\n\n# euclidean\nh_euclidean <- hclust(d, method = \"ward.D2\")\n\n# correlation\nh_correlation <- hclust(d2, method = \"ward.D2\")\n\nOnce your dendrogram is created, the next step is to define which samples belong to a particular cluster. After identifying the dendrogram, we can now literally cut the tree at a fixed threshold (with cutree) at different levels to define the clusters. We can either define the number of clusters or decide on a height. We can simply try different clustering levels.\n\n# euclidean distance\nalldata$hc_euclidean_5 <- cutree(h_euclidean, k = 5)\nalldata$hc_euclidean_10 <- cutree(h_euclidean, k = 10)\nalldata$hc_euclidean_15 <- cutree(h_euclidean, k = 15)\n\n# correlation distance\nalldata$hc_corelation_5 <- cutree(h_correlation, k = 5)\nalldata$hc_corelation_10 <- cutree(h_correlation, k = 10)\nalldata$hc_corelation_15 <- cutree(h_correlation, k = 15)\n\nwrap_plots(\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_euclidean_5\") + ggtitle(\"hc_euc_5\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_euclidean_10\") + ggtitle(\"hc_euc_10\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_euclidean_15\") + ggtitle(\"hc_euc_15\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_corelation_5\") + ggtitle(\"hc_cor_5\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_corelation_10\") + ggtitle(\"hc_cor_10\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_corelation_15\") + ggtitle(\"hc_cor_15\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nFinally, lets save the clustered data for further analysis.\n\nsaveRDS(alldata, \"data/covid/results/seurat_covid_qc_dr_int_cl.rds\")\n\n\n\n3.3 Distribution of clusters\nNow, we can select one of our clustering methods and compare the proportion of samples across the clusters.\nSelect the “CCA_snn_res.0.5” and plot proportion of samples per cluster and also proportion covid vs ctrl.\n\np1 <- ggplot(alldata@meta.data, aes(x = CCA_snn_res.0.5, fill = orig.ident)) +\n geom_bar(position = \"fill\")\np2 <- ggplot(alldata@meta.data, aes(x = CCA_snn_res.0.5, fill = type)) +\n geom_bar(position = \"fill\")\n\np1 + p2\n\n\n\n\n\n\n\n\nIn this case we have quite good representation of each sample in each cluster. But there are clearly some biases with more cells from one sample in some clusters and also more covid cells in some of the clusters.\nWe can also plot it in the other direction, the proportion of each cluster per sample.\n\nggplot(alldata@meta.data, aes(x = orig.ident, fill = CCA_snn_res.0.5)) +\n geom_bar(position = \"fill\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nBy now you should know how to plot different features onto your data. Take the QC metrics that were calculated in the first exercise, that should be stored in your data object, and plot it as violin plots per cluster using the clustering method of your choice. For example, plot number of UMIS, detected genes, percent mitochondrial reads. Then, check carefully if there is any bias in how your data is separated due to quality metrics. Could it be explained biologically, or could you have technical bias there?" + "text": "3 Hierarchical clustering\n\n3.1 Defining distance between cells\nThe base R stats package already contains a function dist that calculates distances between all pairs of samples. Since we want to compute distances between samples, rather than among genes, we need to transpose the data before applying it to the dist function. This can be done by simply adding the transpose function t() to the data. The distance methods available in dist are: ‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’ or ‘minkowski’.\n\nd <- dist(alldata@reductions[[\"pca\"]]@cell.embeddings, method = \"euclidean\")\n\nAs you might have realized, correlation is not a method implemented in the dist() function. However, we can create our own distances and transform them to a distance object. We can first compute sample correlations using the cor function.\nAs you already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):\n[adj = ]\nOnce we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.\n\n# Compute sample correlations\nsample_cor <- cor(Matrix::t(alldata@reductions[[\"pca\"]]@cell.embeddings))\n\n# Transform the scale from correlations\nsample_cor <- (1 - sample_cor) / 2\n\n# Convert it to a distance object\nd2 <- as.dist(sample_cor)\n\n\n\n3.2 Clustering cells\nAfter having calculated the distances between samples calculated, we can now proceed with the hierarchical clustering per-se. We will use the function hclust for this purpose, in which we can simply run it with the distance objects created above. The methods available are: ‘ward.D’, ‘ward.D2’, ‘single’, ‘complete’, ‘average’, ‘mcquitty’, ‘median’ or ‘centroid’. It is possible to plot the dendrogram for all cells, but this is very time consuming and we will omit for this tutorial.\n\n# euclidean\nh_euclidean <- hclust(d, method = \"ward.D2\")\n\n# correlation\nh_correlation <- hclust(d2, method = \"ward.D2\")\n\nOnce your dendrogram is created, the next step is to define which samples belong to a particular cluster. After identifying the dendrogram, we can now literally cut the tree at a fixed threshold (with cutree) at different levels to define the clusters. We can either define the number of clusters or decide on a height. We can simply try different clustering levels.\n\n# euclidean distance\nalldata$hc_euclidean_5 <- cutree(h_euclidean, k = 5)\nalldata$hc_euclidean_10 <- cutree(h_euclidean, k = 10)\nalldata$hc_euclidean_15 <- cutree(h_euclidean, k = 15)\n\n# correlation distance\nalldata$hc_corelation_5 <- cutree(h_correlation, k = 5)\nalldata$hc_corelation_10 <- cutree(h_correlation, k = 10)\nalldata$hc_corelation_15 <- cutree(h_correlation, k = 15)\n\nwrap_plots(\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_euclidean_5\") + ggtitle(\"hc_euc_5\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_euclidean_10\") + ggtitle(\"hc_euc_10\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_euclidean_15\") + ggtitle(\"hc_euc_15\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_corelation_5\") + ggtitle(\"hc_cor_5\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_corelation_10\") + ggtitle(\"hc_cor_10\"),\n DimPlot(alldata, reduction = \"umap\", group.by = \"hc_corelation_15\") + ggtitle(\"hc_cor_15\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nFinally, lets save the clustered data for further analysis.\n\nsaveRDS(alldata, \"data/covid/results/seurat_covid_qc_dr_int_cl.rds\")" }, { "objectID": "labs/seurat/seurat_04_clustering.html#meta-session", "href": "labs/seurat/seurat_04_clustering.html#meta-session", "title": " Clustering", - "section": "4 Session info", - "text": "4 Session info\n\n\nClick here\n\n\nsessionInfo()\n\nR version 4.3.0 (2023-04-21)\nPlatform: x86_64-pc-linux-gnu (64-bit)\nRunning under: Ubuntu 22.04.3 LTS\n\nMatrix products: default\nBLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 \nLAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0\n\nlocale:\n [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C \n [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 \n [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 \n [7] LC_PAPER=en_US.UTF-8 LC_NAME=C \n [9] LC_ADDRESS=C LC_TELEPHONE=C \n[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C \n\ntime zone: Etc/UTC\ntzcode source: system (glibc)\n\nattached base packages:\n[1] stats graphics grDevices utils datasets methods base \n\nother attached packages:\n[1] clustree_0.5.0 ggraph_2.1.0 pheatmap_1.0.12 ggplot2_3.4.2 \n[5] patchwork_1.1.2 SeuratObject_4.1.3 Seurat_4.3.0 \n\nloaded via a namespace (and not attached):\n [1] RColorBrewer_1.1-3 rstudioapi_0.14 jsonlite_1.8.5 \n [4] magrittr_2.0.3 spatstat.utils_3.0-3 farver_2.1.1 \n [7] rmarkdown_2.22 vctrs_0.6.2 ROCR_1.0-11 \n [10] spatstat.explore_3.2-1 htmltools_0.5.5 sctransform_0.3.5 \n [13] parallelly_1.36.0 KernSmooth_2.23-20 htmlwidgets_1.6.2 \n [16] ica_1.0-3 plyr_1.8.8 plotly_4.10.2 \n [19] zoo_1.8-12 igraph_1.4.3 mime_0.12 \n [22] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.5-4 \n [25] R6_2.5.1 fastmap_1.1.1 fitdistrplus_1.1-11 \n [28] future_1.32.0 shiny_1.7.4 digest_0.6.31 \n [31] colorspace_2.1-0 tensor_1.5 irlba_2.3.5.1 \n [34] labeling_0.4.2 progressr_0.13.0 fansi_1.0.4 \n [37] spatstat.sparse_3.0-1 httr_1.4.6 polyclip_1.10-4 \n [40] abind_1.4-5 compiler_4.3.0 withr_2.5.0 \n [43] backports_1.4.1 viridis_0.6.3 ggforce_0.4.1 \n [46] MASS_7.3-58.4 tools_4.3.0 lmtest_0.9-40 \n [49] httpuv_1.6.11 future.apply_1.11.0 goftest_1.2-3 \n [52] glue_1.6.2 nlme_3.1-162 promises_1.2.0.1 \n [55] grid_4.3.0 checkmate_2.2.0 Rtsne_0.16 \n [58] cluster_2.1.4 reshape2_1.4.4 generics_0.1.3 \n [61] gtable_0.3.3 spatstat.data_3.0-1 tidyr_1.3.0 \n [64] data.table_1.14.8 tidygraph_1.2.3 sp_1.6-1 \n [67] utf8_1.2.3 spatstat.geom_3.2-1 RcppAnnoy_0.0.20 \n [70] ggrepel_0.9.3 RANN_2.6.1 pillar_1.9.0 \n [73] stringr_1.5.0 later_1.3.1 splines_4.3.0 \n [76] dplyr_1.1.2 tweenr_2.0.2 lattice_0.21-8 \n [79] survival_3.5-5 deldir_1.0-9 tidyselect_1.2.0 \n [82] miniUI_0.1.1.1 pbapply_1.7-0 knitr_1.43 \n [85] gridExtra_2.3 scattermore_1.2 xfun_0.39 \n [88] graphlayouts_1.0.0 matrixStats_1.0.0 stringi_1.7.12 \n [91] lazyeval_0.2.2 yaml_2.3.7 evaluate_0.21 \n [94] codetools_0.2-19 tibble_3.2.1 cli_3.6.1 \n [97] uwot_0.1.14 xtable_1.8-4 reticulate_1.30 \n[100] munsell_0.5.0 Rcpp_1.0.10 globals_0.16.2 \n[103] spatstat.random_3.1-5 png_0.1-8 parallel_4.3.0 \n[106] ellipsis_0.3.2 listenv_0.9.0 viridisLite_0.4.2 \n[109] scales_1.2.1 ggridges_0.5.4 leiden_0.4.3 \n[112] purrr_1.0.1 rlang_1.1.1 cowplot_1.1.1" + "section": "5 Session info", + "text": "5 Session info\n\n\nClick here\n\n\nsessionInfo()\n\nR version 4.3.0 (2023-04-21)\nPlatform: x86_64-pc-linux-gnu (64-bit)\nRunning under: Ubuntu 22.04.3 LTS\n\nMatrix products: default\nBLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 \nLAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0\n\nlocale:\n [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C \n [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 \n [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 \n [7] LC_PAPER=en_US.UTF-8 LC_NAME=C \n [9] LC_ADDRESS=C LC_TELEPHONE=C \n[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C \n\ntime zone: Etc/UTC\ntzcode source: system (glibc)\n\nattached base packages:\n[1] stats graphics grDevices utils datasets methods base \n\nother attached packages:\n[1] clustree_0.5.0 ggraph_2.1.0 pheatmap_1.0.12 ggplot2_3.4.2 \n[5] patchwork_1.1.2 SeuratObject_4.1.3 Seurat_4.3.0 \n\nloaded via a namespace (and not attached):\n [1] RColorBrewer_1.1-3 rstudioapi_0.14 jsonlite_1.8.5 \n [4] magrittr_2.0.3 spatstat.utils_3.0-3 farver_2.1.1 \n [7] rmarkdown_2.22 vctrs_0.6.2 ROCR_1.0-11 \n [10] spatstat.explore_3.2-1 htmltools_0.5.5 sctransform_0.3.5 \n [13] parallelly_1.36.0 KernSmooth_2.23-20 htmlwidgets_1.6.2 \n [16] ica_1.0-3 plyr_1.8.8 plotly_4.10.2 \n [19] zoo_1.8-12 igraph_1.4.3 mime_0.12 \n [22] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.5-4 \n [25] R6_2.5.1 fastmap_1.1.1 fitdistrplus_1.1-11 \n [28] future_1.32.0 shiny_1.7.4 digest_0.6.31 \n [31] colorspace_2.1-0 tensor_1.5 irlba_2.3.5.1 \n [34] labeling_0.4.2 progressr_0.13.0 fansi_1.0.4 \n [37] spatstat.sparse_3.0-1 httr_1.4.6 polyclip_1.10-4 \n [40] abind_1.4-5 compiler_4.3.0 withr_2.5.0 \n [43] backports_1.4.1 viridis_0.6.3 ggforce_0.4.1 \n [46] MASS_7.3-58.4 tools_4.3.0 lmtest_0.9-40 \n [49] httpuv_1.6.11 future.apply_1.11.0 goftest_1.2-3 \n [52] glue_1.6.2 nlme_3.1-162 promises_1.2.0.1 \n [55] grid_4.3.0 checkmate_2.2.0 Rtsne_0.16 \n [58] cluster_2.1.4 reshape2_1.4.4 generics_0.1.3 \n [61] gtable_0.3.3 spatstat.data_3.0-1 tidyr_1.3.0 \n [64] data.table_1.14.8 tidygraph_1.2.3 sp_1.6-1 \n [67] utf8_1.2.3 spatstat.geom_3.2-1 RcppAnnoy_0.0.20 \n [70] ggrepel_0.9.3 RANN_2.6.1 pillar_1.9.0 \n [73] stringr_1.5.0 later_1.3.1 splines_4.3.0 \n [76] dplyr_1.1.2 tweenr_2.0.2 lattice_0.21-8 \n [79] survival_3.5-5 deldir_1.0-9 tidyselect_1.2.0 \n [82] miniUI_0.1.1.1 pbapply_1.7-0 knitr_1.43 \n [85] gridExtra_2.3 scattermore_1.2 xfun_0.39 \n [88] graphlayouts_1.0.0 matrixStats_1.0.0 stringi_1.7.12 \n [91] lazyeval_0.2.2 yaml_2.3.7 evaluate_0.21 \n [94] codetools_0.2-19 tibble_3.2.1 cli_3.6.1 \n [97] uwot_0.1.14 xtable_1.8-4 reticulate_1.30 \n[100] munsell_0.5.0 Rcpp_1.0.10 globals_0.16.2 \n[103] spatstat.random_3.1-5 png_0.1-8 parallel_4.3.0 \n[106] ellipsis_0.3.2 listenv_0.9.0 viridisLite_0.4.2 \n[109] scales_1.2.1 ggridges_0.5.4 leiden_0.4.3 \n[112] purrr_1.0.1 rlang_1.1.1 cowplot_1.1.1" }, { "objectID": "labs/seurat/seurat_05_dge.html", "href": "labs/seurat/seurat_05_dge.html", "title": " Differential gene expression", "section": "", - "text": "Note\n\n\n\nCode chunks run R commands unless otherwise specified.\nIn this tutorial we will cover about Differetial gene expression, which comprises an extensive range of topics and methods. In single cell, differential expresison can have multiple functionalities such as of identifying marker genes for cell populations, as well as differentially regulated genes across conditions (healthy vs control). We will also exercise on how to account the batch information in your test.\nWe can first load the data from the clustering session. Moreover, we can already decide which clustering resolution to use. First let’s define using the louvain clustering to identifying differentially expressed genes.\nsuppressPackageStartupMessages({\n library(Seurat)\n library(dplyr)\n library(patchwork)\n library(ggplot2)\n library(pheatmap)\n library(enrichR)\n library(Matrix)\n library(edgeR)\n library(MAST)\n})\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\npath_file <- \"data/covid/results/seurat_covid_qc_dr_int_cl.rds\"\nif (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"covid/results/seurat_covid_qc_dr_int_cl.rds\"), destfile = path_file)\nalldata <- readRDS(path_file)\n# Set the identity as louvain with resolution 0.5\nsel.clust <- \"CCA_snn_res.0.5\"\n\nalldata <- SetIdent(alldata, value = sel.clust)\ntable(alldata@active.ident)\n\n\n 0 1 2 3 4 5 6 7 8 \n2056 1259 1113 646 535 494 365 337 329\n# plot this clustering\nwrap_plots(\n DimPlot(alldata, label = T) + NoAxes(),\n DimPlot(alldata, group.by = \"orig.ident\") + NoAxes(),\n DimPlot(alldata, group.by = \"type\") + NoAxes(),\n ncol = 3\n)" + "text": "Note\n\n\n\nCode chunks run R commands unless otherwise specified.\nIn this tutorial we will cover about Differetial gene expression, which comprises an extensive range of topics and methods. In single cell, differential expresison can have multiple functionalities such as of identifying marker genes for cell populations, as well as differentially regulated genes across conditions (healthy vs control). We will also exercise on how to account the batch information in your test.\nWe can first load the data from the clustering session. Moreover, we can already decide which clustering resolution to use. First let’s define using the louvain clustering to identifying differentially expressed genes.\nsuppressPackageStartupMessages({\n library(Seurat)\n library(dplyr)\n library(patchwork)\n library(ggplot2)\n library(pheatmap)\n library(enrichR)\n library(Matrix)\n library(edgeR)\n library(MAST)\n})\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\npath_file <- \"data/covid/results/seurat_covid_qc_dr_int_cl.rds\"\nif (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"covid/results/seurat_covid_qc_dr_int_cl.rds\"), destfile = path_file)\nalldata <- readRDS(path_file)\n# Set the identity as louvain with resolution 0.5\nsel.clust <- \"CCA_snn_res.0.5\"\n\nalldata <- SetIdent(alldata, value = sel.clust)\ntable(alldata@active.ident)\n\n\n 0 1 2 3 4 5 6 7 8 \n2063 1297 1073 642 546 489 368 336 320\n# plot this clustering\nwrap_plots(\n DimPlot(alldata, label = T) + NoAxes(),\n DimPlot(alldata, group.by = \"orig.ident\") + NoAxes(),\n DimPlot(alldata, group.by = \"type\") + NoAxes(),\n ncol = 3\n)" }, { "objectID": "labs/seurat/seurat_05_dge.html#meta-dge_cmg", @@ -298,7 +298,7 @@ "href": "labs/seurat/seurat_05_dge.html#patient-batch-effects", "title": " Differential gene expression", "section": "3 Patient Batch effects", - "text": "3 Patient Batch effects\nWhen we are testing for Covid vs Control we are running a DGE test for 4 vs 4 individuals. That will be very sensitive to sample differences unless we find a way to control for it. So first, lets check how the top DGEs are expressed across the individuals within cluster 3:\n\nVlnPlot(cell_selection, group.by = \"orig.ident\", features = as.character(unique(top5_cell_selection$gene)), ncol = 4, assay = \"RNA\", pt.size = 0)\n\n\n\n\n\n\n\n\nAs you can see, many of the genes detected as DGE in Covid are unique to one or 2 patients.\nWe can examine more genes with a DotPlot:\n\nDGE_cell_selection %>%\n group_by(direction) %>%\n top_n(-20, p_val) -> top20_cell_selection\nDotPlot(cell_selection, features = rev(as.character(unique(top20_cell_selection$gene))), group.by = \"orig.ident\", assay = \"RNA\") + coord_flip() + RotatedAxis()\n\n\n\n\n\n\n\n\nAs you can see, most of the DGEs are driven by the covid_17 patient. It is also a sample with very high number of cells:\n\ntable(cell_selection$orig.ident)\n\n\n covid_1 covid_15 covid_16 covid_17 ctrl_13 ctrl_14 ctrl_19 ctrl_5 \n 95 32 37 173 64 62 37 146" + "text": "3 Patient Batch effects\nWhen we are testing for Covid vs Control we are running a DGE test for 4 vs 4 individuals. That will be very sensitive to sample differences unless we find a way to control for it. So first, lets check how the top DGEs are expressed across the individuals within cluster 3:\n\nVlnPlot(cell_selection, group.by = \"orig.ident\", features = as.character(unique(top5_cell_selection$gene)), ncol = 4, assay = \"RNA\", pt.size = 0)\n\n\n\n\n\n\n\n\nAs you can see, many of the genes detected as DGE in Covid are unique to one or 2 patients.\nWe can examine more genes with a DotPlot:\n\nDGE_cell_selection %>%\n group_by(direction) %>%\n top_n(-20, p_val) -> top20_cell_selection\nDotPlot(cell_selection, features = rev(as.character(unique(top20_cell_selection$gene))), group.by = \"orig.ident\", assay = \"RNA\") + coord_flip() + RotatedAxis()\n\n\n\n\n\n\n\n\nAs you can see, most of the DGEs are driven by the covid_17 patient. It is also a sample with very high number of cells:\n\ntable(cell_selection$orig.ident)\n\n\n covid_1 covid_15 covid_16 covid_17 ctrl_13 ctrl_14 ctrl_19 ctrl_5 \n 93 32 37 173 62 62 37 146" }, { "objectID": "labs/seurat/seurat_05_dge.html#subsample", @@ -312,7 +312,7 @@ "href": "labs/seurat/seurat_05_dge.html#pseudobulk", "title": " Differential gene expression", "section": "5 Pseudobulk", - "text": "5 Pseudobulk\nOne option is to treat the samples as pseudobulks and do differential expression for the 4 patients vs 4 controls. You do lose some information about cell variability within each patient, but instead you gain the advantage of mainly looking for effects that are seen in multiple patients.\nHowever, having only 4 patients is perhaps too low, with many more patients it will work better to run pseudobulk analysis.\nFor a fair comparison we should have equal number of cells per sample when we create the pseudobulk, so we will use the subsampled object.\n\n# get the count matrix for all cells\nDGE_DATA <- sub_data@assays$RNA@counts\n\n# Compute pseudobulk\nmm <- Matrix::sparse.model.matrix(~ 0 + sub_data$orig.ident)\npseudobulk <- DGE_DATA %*% mm\n\nThen run edgeR:\n\n# define the groups\nbulk.labels <- c(\"Covid\", \"Covid\", \"Covid\", \"Covid\", \"Ctrl\", \"Ctrl\", \"Ctrl\", \"Ctrl\")\n\ndge.list <- DGEList(counts = pseudobulk, group = factor(bulk.labels))\nkeep <- filterByExpr(dge.list)\ndge.list <- dge.list[keep, , keep.lib.sizes = FALSE]\n\ndge.list <- calcNormFactors(dge.list)\ndesign <- model.matrix(~bulk.labels)\n\ndge.list <- estimateDisp(dge.list, design)\n\nfit <- glmQLFit(dge.list, design)\nqlf <- glmQLFTest(fit, coef = 2)\ntopTags(qlf)\n\nCoefficient: bulk.labelsCtrl \n logFC logCPM F PValue FDR\nS100A8 -2.672605 6.972711 37.41996 6.779653e-06 0.01083389\nS100A9 -2.512717 7.374885 27.28588 5.193871e-05 0.04149903\nSTAG3 -3.378653 7.540873 24.35275 8.987020e-05 0.04787086\nPIM3 -1.412489 7.839512 17.02383 6.030641e-04 0.23537510\nIGHA1 -2.676072 6.965149 16.09405 7.364678e-04 0.23537510\nDYNC1H1 1.279395 6.711434 12.94684 1.976508e-03 0.52641010\nPHACTR1 -1.207474 7.908323 11.47741 3.176723e-03 0.67568316\nCCR7 -1.301642 8.017766 11.28727 3.382644e-03 0.67568316\nWDFY2 1.172984 7.133247 10.76672 4.049332e-03 0.69392707\nMOB3A -1.128665 7.131236 10.56187 4.342472e-03 0.69392707\n\n\nAs you can see, we have very few significant genes. Since we only have 4 vs 4 samples, we should not expect too many genes with this method.\nAgain as dotplot including top 10 genes:\n\nres.edgeR <- topTags(qlf, 100)$table\nres.edgeR$dir <- ifelse(res.edgeR$logFC > 0, \"Covid\", \"Ctrl\")\nres.edgeR$gene <- rownames(res.edgeR)\n\nres.edgeR %>%\n group_by(dir) %>%\n top_n(-10, PValue) %>%\n arrange(dir) -> top.edgeR\n\nDotPlot(cell_selection,\n features = as.character(unique(top.edgeR$gene)), group.by = \"orig.ident\",\n assay = \"RNA\"\n) + coord_flip() + ggtitle(\"EdgeR pseudobulk\") + RotatedAxis()\n\n\n\n\n\n\n\n\nAs you can see, even if we get few genes detected the seem to make sense across all the patients." + "text": "5 Pseudobulk\nOne option is to treat the samples as pseudobulks and do differential expression for the 4 patients vs 4 controls. You do lose some information about cell variability within each patient, but instead you gain the advantage of mainly looking for effects that are seen in multiple patients.\nHowever, having only 4 patients is perhaps too low, with many more patients it will work better to run pseudobulk analysis.\nFor a fair comparison we should have equal number of cells per sample when we create the pseudobulk, so we will use the subsampled object.\n\n# get the count matrix for all cells\nDGE_DATA <- sub_data@assays$RNA@counts\n\n# Compute pseudobulk\nmm <- Matrix::sparse.model.matrix(~ 0 + sub_data$orig.ident)\npseudobulk <- DGE_DATA %*% mm\n\nThen run edgeR:\n\n# define the groups\nbulk.labels <- c(\"Covid\", \"Covid\", \"Covid\", \"Covid\", \"Ctrl\", \"Ctrl\", \"Ctrl\", \"Ctrl\")\n\ndge.list <- DGEList(counts = pseudobulk, group = factor(bulk.labels))\nkeep <- filterByExpr(dge.list)\ndge.list <- dge.list[keep, , keep.lib.sizes = FALSE]\n\ndge.list <- calcNormFactors(dge.list)\ndesign <- model.matrix(~bulk.labels)\n\ndge.list <- estimateDisp(dge.list, design)\n\nfit <- glmQLFit(dge.list, design)\nqlf <- glmQLFTest(fit, coef = 2)\ntopTags(qlf)\n\nCoefficient: bulk.labelsCtrl \n logFC logCPM F PValue FDR\nS100A8 -2.769215 6.963840 45.76310 1.792203e-06 0.002996563\nS100A9 -2.605746 7.463864 29.05267 3.622977e-05 0.030288086\nSTAG3 -3.130834 7.358135 20.80773 2.141285e-04 0.119340964\nIGHA1 -2.777404 6.965359 18.84381 3.484837e-04 0.145666204\nDYNC1H1 1.371425 6.575657 14.60978 1.187609e-03 0.397136505\nPIM3 -1.391713 7.788553 12.91552 1.991325e-03 0.499573279\nTLE1 -1.135713 7.356197 12.76593 2.091515e-03 0.499573279\nTRAF3IP3 1.206566 7.489238 11.85600 2.799146e-03 0.585021456\nWDFY2 1.178937 7.063276 11.38622 3.275591e-03 0.608532060\nAHNAK 1.163878 7.833990 11.02406 3.680067e-03 0.615307195\n\n\nAs you can see, we have very few significant genes. Since we only have 4 vs 4 samples, we should not expect too many genes with this method.\nAgain as dotplot including top 10 genes:\n\nres.edgeR <- topTags(qlf, 100)$table\nres.edgeR$dir <- ifelse(res.edgeR$logFC > 0, \"Covid\", \"Ctrl\")\nres.edgeR$gene <- rownames(res.edgeR)\n\nres.edgeR %>%\n group_by(dir) %>%\n top_n(-10, PValue) %>%\n arrange(dir) -> top.edgeR\n\nDotPlot(cell_selection,\n features = as.character(unique(top.edgeR$gene)), group.by = \"orig.ident\",\n assay = \"RNA\"\n) + coord_flip() + ggtitle(\"EdgeR pseudobulk\") + RotatedAxis()\n\n\n\n\n\n\n\n\nAs you can see, even if we get few genes detected the seem to make sense across all the patients." }, { "objectID": "labs/seurat/seurat_05_dge.html#mast-random-effect", @@ -340,7 +340,7 @@ "href": "labs/seurat/seurat_05_dge.html#meta-session", "title": " Differential gene expression", "section": "9 Session info", - "text": "9 Session info\n\n\nClick here\n\n\nsessionInfo()\n\nR version 4.3.0 (2023-04-21)\nPlatform: x86_64-pc-linux-gnu (64-bit)\nRunning under: Ubuntu 22.04.3 LTS\n\nMatrix products: default\nBLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 \nLAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0\n\nlocale:\n [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C \n [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 \n [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 \n [7] LC_PAPER=en_US.UTF-8 LC_NAME=C \n [9] LC_ADDRESS=C LC_TELEPHONE=C \n[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C \n\ntime zone: Etc/UTC\ntzcode source: system (glibc)\n\nattached base packages:\n[1] stats4 stats graphics grDevices utils datasets methods \n[8] base \n\nother attached packages:\n [1] fgsea_1.28.0 msigdbr_7.5.1 \n [3] lme4_1.1-33 MAST_1.28.0 \n [5] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0\n [7] Biobase_2.62.0 GenomicRanges_1.54.1 \n [9] GenomeInfoDb_1.38.5 IRanges_2.36.0 \n[11] S4Vectors_0.40.2 BiocGenerics_0.48.1 \n[13] MatrixGenerics_1.14.0 matrixStats_1.0.0 \n[15] edgeR_4.0.7 limma_3.58.1 \n[17] Matrix_1.5-4 enrichR_3.2 \n[19] pheatmap_1.0.12 ggplot2_3.4.2 \n[21] patchwork_1.1.2 dplyr_1.1.2 \n[23] SeuratObject_4.1.3 Seurat_4.3.0 \n\nloaded via a namespace (and not attached):\n [1] RcppAnnoy_0.0.20 splines_4.3.0 later_1.3.1 \n [4] bitops_1.0-7 tibble_3.2.1 polyclip_1.10-4 \n [7] lifecycle_1.0.3 globals_0.16.2 lattice_0.21-8 \n [10] MASS_7.3-58.4 magrittr_2.0.3 plotly_4.10.2 \n [13] rmarkdown_2.22 yaml_2.3.7 httpuv_1.6.11 \n [16] sctransform_0.3.5 sp_1.6-1 spatstat.sparse_3.0-1 \n [19] reticulate_1.30 cowplot_1.1.1 pbapply_1.7-0 \n [22] minqa_1.2.5 RColorBrewer_1.1-3 abind_1.4-5 \n [25] zlibbioc_1.48.0 Rtsne_0.16 purrr_1.0.1 \n [28] RCurl_1.98-1.12 WriteXLS_6.4.0 GenomeInfoDbData_1.2.11\n [31] ggrepel_0.9.3 irlba_2.3.5.1 listenv_0.9.0 \n [34] spatstat.utils_3.0-3 goftest_1.2-3 spatstat.random_3.1-5 \n [37] fitdistrplus_1.1-11 parallelly_1.36.0 leiden_0.4.3 \n [40] codetools_0.2-19 DelayedArray_0.28.0 tidyselect_1.2.0 \n [43] farver_2.1.1 spatstat.explore_3.2-1 jsonlite_1.8.5 \n [46] ellipsis_0.3.2 progressr_0.13.0 ggridges_0.5.4 \n [49] survival_3.5-5 tools_4.3.0 progress_1.2.2 \n [52] ica_1.0-3 Rcpp_1.0.10 glue_1.6.2 \n [55] gridExtra_2.3 SparseArray_1.2.3 xfun_0.39 \n [58] withr_2.5.0 fastmap_1.1.1 boot_1.3-28.1 \n [61] fansi_1.0.4 digest_0.6.31 R6_2.5.1 \n [64] mime_0.12 colorspace_2.1-0 scattermore_1.2 \n [67] tensor_1.5 spatstat.data_3.0-1 utf8_1.2.3 \n [70] tidyr_1.3.0 generics_0.1.3 data.table_1.14.8 \n [73] prettyunits_1.1.1 httr_1.4.6 htmlwidgets_1.6.2 \n [76] S4Arrays_1.2.0 uwot_0.1.14 pkgconfig_2.0.3 \n [79] gtable_0.3.3 lmtest_0.9-40 XVector_0.42.0 \n [82] htmltools_0.5.5 scales_1.2.1 png_0.1-8 \n [85] knitr_1.43 rstudioapi_0.14 reshape2_1.4.4 \n [88] rjson_0.2.21 nlme_3.1-162 curl_5.0.1 \n [91] nloptr_2.0.3 zoo_1.8-12 stringr_1.5.0 \n [94] KernSmooth_2.23-20 parallel_4.3.0 miniUI_0.1.1.1 \n [97] pillar_1.9.0 grid_4.3.0 vctrs_0.6.2 \n[100] RANN_2.6.1 promises_1.2.0.1 xtable_1.8-4 \n[103] cluster_2.1.4 evaluate_0.21 cli_3.6.1 \n[106] locfit_1.5-9.8 compiler_4.3.0 rlang_1.1.1 \n[109] crayon_1.5.2 future.apply_1.11.0 labeling_0.4.2 \n[112] plyr_1.8.8 stringi_1.7.12 BiocParallel_1.36.0 \n[115] viridisLite_0.4.2 deldir_1.0-9 babelgene_22.9 \n[118] munsell_0.5.0 lazyeval_0.2.2 spatstat.geom_3.2-1 \n[121] hms_1.1.3 future_1.32.0 statmod_1.5.0 \n[124] shiny_1.7.4 ROCR_1.0-11 igraph_1.4.3 \n[127] fastmatch_1.1-3" + "text": "9 Session info\n\n\nClick here\n\n\nsessionInfo()\n\nR version 4.3.0 (2023-04-21)\nPlatform: x86_64-pc-linux-gnu (64-bit)\nRunning under: Ubuntu 22.04.3 LTS\n\nMatrix products: default\nBLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 \nLAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0\n\nlocale:\n [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C \n [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 \n [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 \n [7] LC_PAPER=en_US.UTF-8 LC_NAME=C \n [9] LC_ADDRESS=C LC_TELEPHONE=C \n[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C \n\ntime zone: Etc/UTC\ntzcode source: system (glibc)\n\nattached base packages:\n[1] stats4 stats graphics grDevices utils datasets methods \n[8] base \n\nother attached packages:\n [1] fgsea_1.28.0 msigdbr_7.5.1 \n [3] lme4_1.1-33 MAST_1.28.0 \n [5] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0\n [7] Biobase_2.62.0 GenomicRanges_1.54.1 \n [9] GenomeInfoDb_1.38.5 IRanges_2.36.0 \n[11] S4Vectors_0.40.2 BiocGenerics_0.48.1 \n[13] MatrixGenerics_1.14.0 matrixStats_1.0.0 \n[15] edgeR_4.0.7 limma_3.58.1 \n[17] Matrix_1.5-4 enrichR_3.2 \n[19] pheatmap_1.0.12 ggplot2_3.4.2 \n[21] patchwork_1.1.2 dplyr_1.1.2 \n[23] SeuratObject_4.1.3 Seurat_4.3.0 \n\nloaded via a namespace (and not attached):\n [1] RcppAnnoy_0.0.20 splines_4.3.0 later_1.3.1 \n [4] bitops_1.0-7 tibble_3.2.1 polyclip_1.10-4 \n [7] lifecycle_1.0.3 globals_0.16.2 lattice_0.21-8 \n [10] MASS_7.3-58.4 magrittr_2.0.3 plotly_4.10.2 \n [13] rmarkdown_2.22 yaml_2.3.7 httpuv_1.6.11 \n [16] sctransform_0.3.5 sp_1.6-1 spatstat.sparse_3.0-1 \n [19] reticulate_1.30 cowplot_1.1.1 pbapply_1.7-0 \n [22] minqa_1.2.5 RColorBrewer_1.1-3 abind_1.4-5 \n [25] zlibbioc_1.48.0 Rtsne_0.16 purrr_1.0.1 \n [28] RCurl_1.98-1.12 WriteXLS_6.4.0 GenomeInfoDbData_1.2.11\n [31] ggrepel_0.9.3 irlba_2.3.5.1 listenv_0.9.0 \n [34] spatstat.utils_3.0-3 goftest_1.2-3 spatstat.random_3.1-5 \n [37] fitdistrplus_1.1-11 parallelly_1.36.0 leiden_0.4.3 \n [40] codetools_0.2-19 DelayedArray_0.28.0 tidyselect_1.2.0 \n [43] farver_2.1.1 spatstat.explore_3.2-1 jsonlite_1.8.5 \n [46] ellipsis_0.3.2 progressr_0.13.0 ggridges_0.5.4 \n [49] survival_3.5-5 tools_4.3.0 ica_1.0-3 \n [52] Rcpp_1.0.10 glue_1.6.2 gridExtra_2.3 \n [55] SparseArray_1.2.3 xfun_0.39 withr_2.5.0 \n [58] fastmap_1.1.1 boot_1.3-28.1 fansi_1.0.4 \n [61] digest_0.6.31 R6_2.5.1 mime_0.12 \n [64] colorspace_2.1-0 scattermore_1.2 tensor_1.5 \n [67] spatstat.data_3.0-1 utf8_1.2.3 tidyr_1.3.0 \n [70] generics_0.1.3 data.table_1.14.8 httr_1.4.6 \n [73] htmlwidgets_1.6.2 S4Arrays_1.2.0 uwot_0.1.14 \n [76] pkgconfig_2.0.3 gtable_0.3.3 lmtest_0.9-40 \n [79] XVector_0.42.0 htmltools_0.5.5 scales_1.2.1 \n [82] png_0.1-8 knitr_1.43 rstudioapi_0.14 \n [85] reshape2_1.4.4 rjson_0.2.21 nlme_3.1-162 \n [88] curl_5.0.1 nloptr_2.0.3 zoo_1.8-12 \n [91] stringr_1.5.0 KernSmooth_2.23-20 parallel_4.3.0 \n [94] miniUI_0.1.1.1 pillar_1.9.0 grid_4.3.0 \n [97] vctrs_0.6.2 RANN_2.6.1 promises_1.2.0.1 \n[100] xtable_1.8-4 cluster_2.1.4 evaluate_0.21 \n[103] cli_3.6.1 locfit_1.5-9.8 compiler_4.3.0 \n[106] rlang_1.1.1 crayon_1.5.2 future.apply_1.11.0 \n[109] labeling_0.4.2 plyr_1.8.8 stringi_1.7.12 \n[112] BiocParallel_1.36.0 viridisLite_0.4.2 deldir_1.0-9 \n[115] babelgene_22.9 munsell_0.5.0 lazyeval_0.2.2 \n[118] spatstat.geom_3.2-1 future_1.32.0 statmod_1.5.0 \n[121] shiny_1.7.4 ROCR_1.0-11 igraph_1.4.3 \n[124] fastmatch_1.1-3" }, { "objectID": "labs/seurat/seurat_06_celltyping.html", @@ -354,7 +354,7 @@ "href": "labs/seurat/seurat_06_celltyping.html#meta-ct_read", "title": " Celltype prediction", "section": "1 Read data", - "text": "1 Read data\nFirst, lets load required libraries\n\nsuppressPackageStartupMessages({\n library(Seurat)\n library(dplyr)\n library(patchwork)\n library(ggplot2)\n library(pheatmap)\n # remotes::install_github(\"powellgenomicslab/scPred\")\n library(scPred)\n})\n\nLet’s read in the saved Covid-19 data object from the clustering step.\n\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\npath_file <- \"data/covid/results/seurat_covid_qc_dr_int_cl.rds\"\nif (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"covid/results/seurat_covid_qc_dr_int_cl.rds\"), destfile = path_file)\nalldata <- readRDS(path_file)\n\nSubset one patient.\n\nctrl <- alldata[, alldata$orig.ident == \"ctrl_13\"]\n\n# set active assay to RNA and remove the CCA assay\nctrl@active.assay <- \"RNA\"\nctrl[[\"CCA\"]] <- NULL\nctrl\n\nAn object of class Seurat \n18851 features across 1126 samples within 1 assay \nActive assay: RNA (18851 features, 2000 variable features)\n 6 dimensional reductions calculated: umap, tsne, umap_raw, pca_harmony, harmony, umap_harmony" + "text": "1 Read data\nFirst, lets load required libraries\n\nsuppressPackageStartupMessages({\n library(Seurat)\n library(dplyr)\n library(patchwork)\n library(ggplot2)\n library(pheatmap)\n # remotes::install_github(\"powellgenomicslab/scPred\")\n library(scPred)\n})\n\nLet’s read in the saved Covid-19 data object from the clustering step.\n\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\npath_file <- \"data/covid/results/seurat_covid_qc_dr_int_cl.rds\"\nif (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"covid/results/seurat_covid_qc_dr_int_cl.rds\"), destfile = path_file)\nalldata <- readRDS(path_file)\n\nSubset one patient.\n\nctrl <- alldata[, alldata$orig.ident == \"ctrl_13\"]\n\n# set active assay to RNA and remove the CCA assay\nctrl@active.assay <- \"RNA\"\nctrl[[\"CCA\"]] <- NULL\nctrl\n\nAn object of class Seurat \n18854 features across 1126 samples within 1 assay \nActive assay: RNA (18854 features, 2000 variable features)\n 6 dimensional reductions calculated: umap, tsne, umap_raw, pca_harmony, harmony, umap_harmony" }, { "objectID": "labs/seurat/seurat_06_celltyping.html#meta-ct_ref", @@ -375,7 +375,7 @@ "href": "labs/seurat/seurat_06_celltyping.html#meta-ct_scpred", "title": " Celltype prediction", "section": "4 scPred", - "text": "4 scPred\nscPred will train a classifier based on all principal components. First, getFeatureSpace will create a scPred object stored in the @misc slot where it extracts the PCs that best separates the different celltypes. Then trainModel will do the actual training for each celltype.\n\nreference <- getFeatureSpace(reference, \"cell_type\")\n\n● Extracting feature space for each cell type...\nDONE!\n\nreference <- trainModel(reference)\n\n● Training models for each cell type...\nmaximum number of iterations reached 0.0001152056 -0.0001143117DONE!\n\n\nWe can then print how well the training worked for the different celltypes by printing the number of PCs used for each, the ROC value and Sensitivity/Specificity. Which celltypes do you think are harder to classify based on this dataset?\n\nget_scpred(reference)\n\n'scPred' object\n✔ Prediction variable = cell_type \n✔ Discriminant features per cell type\n✔ Training model(s)\nSummary\n\n|Cell type | n| Features|Method | ROC| Sens| Spec|\n|:-----------|----:|--------:|:---------|-----:|-----:|-----:|\n|B cell | 280| 50|svmRadial | 1.000| 0.964| 1.000|\n|CD4 T cell | 1620| 50|svmRadial | 0.997| 0.972| 0.975|\n|CD8 T cell | 945| 50|svmRadial | 0.985| 0.899| 0.978|\n|cDC | 26| 50|svmRadial | 0.995| 0.547| 1.000|\n|cMono | 212| 50|svmRadial | 0.994| 0.958| 0.970|\n|ncMono | 79| 50|svmRadial | 0.998| 0.570| 1.000|\n|NK cell | 312| 50|svmRadial | 0.999| 0.933| 0.996|\n|pDC | 20| 50|svmRadial | 1.000| 0.700| 1.000|\n|Plasma cell | 6| 50|svmRadial | 1.000| 0.800| 1.000|\n\n\nYou can optimize parameters for each dataset by chaning parameters and testing different types of models, see more at: https://powellgenomicslab.github.io/scPred/articles/introduction.html. But for now, we will continue with this model. Now, lets predict celltypes on our data, where scPred will align the two datasets with Harmony and then perform classification.\n\nctrl <- scPredict(ctrl, reference)\n\n● Matching reference with new dataset...\n ─ 2000 features present in reference loadings\n ─ 1782 features shared between reference and new dataset\n ─ 89.1% of features in the reference are present in new dataset\n● Aligning new data to reference...\n● Classifying cells...\nDONE!\n\n\n\nDimPlot(ctrl, group.by = \"scpred_prediction\", label = T, repel = T) + NoAxes()\n\n\n\n\n\n\n\n\nNow plot how many cells of each celltypes can be found in each cluster.\n\nggplot(ctrl@meta.data, aes(x = CCA_snn_res.0.5, fill = scpred_prediction)) +\n geom_bar() +\n theme_classic()" + "text": "4 scPred\nscPred will train a classifier based on all principal components. First, getFeatureSpace will create a scPred object stored in the @misc slot where it extracts the PCs that best separates the different celltypes. Then trainModel will do the actual training for each celltype.\n\nreference <- getFeatureSpace(reference, \"cell_type\")\n\n● Extracting feature space for each cell type...\nDONE!\n\nreference <- trainModel(reference)\n\n● Training models for each cell type...\nmaximum number of iterations reached 0.0001152056 -0.0001143117DONE!\n\n\nWe can then print how well the training worked for the different celltypes by printing the number of PCs used for each, the ROC value and Sensitivity/Specificity. Which celltypes do you think are harder to classify based on this dataset?\n\nget_scpred(reference)\n\n'scPred' object\n✔ Prediction variable = cell_type \n✔ Discriminant features per cell type\n✔ Training model(s)\nSummary\n\n|Cell type | n| Features|Method | ROC| Sens| Spec|\n|:-----------|----:|--------:|:---------|-----:|-----:|-----:|\n|B cell | 280| 50|svmRadial | 1.000| 0.964| 1.000|\n|CD4 T cell | 1620| 50|svmRadial | 0.997| 0.972| 0.975|\n|CD8 T cell | 945| 50|svmRadial | 0.985| 0.899| 0.978|\n|cDC | 26| 50|svmRadial | 0.995| 0.547| 1.000|\n|cMono | 212| 50|svmRadial | 0.994| 0.958| 0.970|\n|ncMono | 79| 50|svmRadial | 0.998| 0.570| 1.000|\n|NK cell | 312| 50|svmRadial | 0.999| 0.933| 0.996|\n|pDC | 20| 50|svmRadial | 1.000| 0.700| 1.000|\n|Plasma cell | 6| 50|svmRadial | 1.000| 0.800| 1.000|\n\n\nYou can optimize parameters for each dataset by chaning parameters and testing different types of models, see more at: https://powellgenomicslab.github.io/scPred/articles/introduction.html. But for now, we will continue with this model. Now, lets predict celltypes on our data, where scPred will align the two datasets with Harmony and then perform classification.\n\nctrl <- scPredict(ctrl, reference)\n\n● Matching reference with new dataset...\n ─ 2000 features present in reference loadings\n ─ 1783 features shared between reference and new dataset\n ─ 89.15% of features in the reference are present in new dataset\n● Aligning new data to reference...\n● Classifying cells...\nDONE!\n\n\n\nDimPlot(ctrl, group.by = \"scpred_prediction\", label = T, repel = T) + NoAxes()\n\n\n\n\n\n\n\n\nNow plot how many cells of each celltypes can be found in each cluster.\n\nggplot(ctrl@meta.data, aes(x = CCA_snn_res.0.5, fill = scpred_prediction)) +\n geom_bar() +\n theme_classic()" }, { "objectID": "labs/seurat/seurat_06_celltyping.html#meta-ct_compare", @@ -389,7 +389,7 @@ "href": "labs/seurat/seurat_06_celltyping.html#meta-ct_gsea", "title": " Celltype prediction", "section": "6 GSEA with celltype markers", - "text": "6 GSEA with celltype markers\nAnother option, where celltype can be classified on cluster level is to use gene set enrichment among the DEGs with known markers for different celltypes. Similar to how we did functional enrichment for the DEGs in the Differential expression exercise. There are some resources for celltype gene sets that can be used. Such as CellMarker, PanglaoDB or celltype gene sets at MSigDB. We can also look at overlap between DEGs in a reference dataset and the dataset you are analysing.\n\n6.1 DEG overlap\nFirst, lets extract top DEGs for our Covid-19 dataset and the reference dataset. When we run differential expression for our dataset, we want to report as many genes as possible, hence we set the cutoffs quite lenient.\n\n# run differential expression in our dataset, using clustering at resolution 0.5\nalldata <- SetIdent(alldata, value = \"CCA_snn_res.0.5\")\nDGE_table <- FindAllMarkers(\n alldata,\n logfc.threshold = 0,\n test.use = \"wilcox\",\n min.pct = 0.1,\n min.diff.pct = 0,\n only.pos = TRUE,\n max.cells.per.ident = 20,\n return.thresh = 1,\n assay = \"RNA\"\n)\n\n# split into a list\nDGE_list <- split(DGE_table, DGE_table$cluster)\n\nunlist(lapply(DGE_list, nrow))\n\n 0 1 2 3 4 5 6 7 8 \n3349 4118 3271 2504 2061 2581 2426 3487 2355 \n\n\n\n# Compute differential gene expression in reference dataset (that has cell annotation)\nreference <- SetIdent(reference, value = \"cell_type\")\nreference_markers <- FindAllMarkers(\n reference,\n min.pct = .1,\n min.diff.pct = .2,\n only.pos = T,\n max.cells.per.ident = 20,\n return.thresh = 1\n)\n\n# Identify the top cell marker genes in reference dataset\n# select top 50 with hihgest foldchange among top 100 signifcant genes.\nreference_markers <- reference_markers[order(reference_markers$avg_log2FC, decreasing = T), ]\nreference_markers %>%\n group_by(cluster) %>%\n top_n(-100, p_val) %>%\n top_n(50, avg_log2FC) -> top50_cell_selection\n\n# Transform the markers into a list\nref_list <- split(top50_cell_selection$gene, top50_cell_selection$cluster)\n\nunlist(lapply(ref_list, length))\n\n CD8 T cell CD4 T cell cMono B cell NK cell pDC \n 30 15 50 50 50 50 \n ncMono cDC Plasma cell \n 50 50 50 \n\n\nNow we can run GSEA for the DEGs from our dataset and check for enrichment of top DEGs in the reference dataset.\n\nsuppressPackageStartupMessages(library(fgsea))\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n gene_rank <- setNames(x$avg_log2FC, x$gene)\n fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.1, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 2, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\nres\n\n$`0`\n pathway pval padj ES NES nMoreExtreme size\n1: cMono 0.00009999 0.000299970 0.9594422 2.067666 0 48\n2: ncMono 0.00009999 0.000299970 0.8385199 1.797428 0 43\n3: cDC 0.00009999 0.000299970 0.8394045 1.795307 0 41\n4: pDC 0.00180415 0.004059336 0.7492218 1.535717 17 21\n5: NK cell 0.02711970 0.048815461 0.7545862 1.436992 260 10\n6: B cell 0.06447382 0.096710725 0.6666689 1.329777 638 15\n leadingEdge\n1: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n2: CTSS,TYMP,CST3,S100A11,AIF1,SERPINA1,...\n3: LYZ,GRN,TYMP,CST3,AIF1,LGALS2,...\n4: GRN,MS4A6A,CST3,MPEG1,CTSB,TGFBI,...\n5: TYROBP,FCER1G,SRGN,CCL3,MYO1F,ITGB2,...\n6: NCF1,LY86,MARCH1,POU2F2,HLA-DMB,HLA-DRB5,...\n\n$`1`\n pathway pval padj ES NES nMoreExtreme size\n1: NK cell 0.0000999900 0.0004007213 0.9459800 2.369826 0 48\n2: CD8 T cell 0.0001001803 0.0004007213 0.9230826 2.201075 0 25\n3: ncMono 0.0008014655 0.0016029311 0.9101411 1.755775 6 6\n4: pDC 0.0078939059 0.0126302494 0.7711439 1.640731 74 10\n5: Plasma cell 0.0007002101 0.0016029311 0.6711407 1.625909 6 30\n leadingEdge\n1: GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...\n2: GNLY,GZMB,FGFBP2,PRF1,NKG7,CTSW,...\n3: FCGR3A,IFITM2,RHOC\n4: GZMB,C12orf75,HSP90B1,ALOX5AP,PLAC8,RRBP1,...\n5: FKBP11,CD38,SDF2L1,PRDM1,PPIB,SLAMF7,...\n\n$`2`\n pathway pval padj ES NES nMoreExtreme size\n1: CD8 T cell 0.0001001101 0.0003503854 0.9406368 2.161149 0 29\n2: NK cell 0.0001000500 0.0003503854 0.8208967 1.898566 0 32\n3: CD4 T cell 0.0014347202 0.0033476805 0.8706473 1.681457 12 7\n4: Plasma cell 0.0744595677 0.1042433947 0.5638039 1.298014 743 30\n leadingEdge\n1: CD3D,CD8A,CD3G,CCL5,CD8B,GZMH,...\n2: CCL5,GZMA,CCL4,NKG7,GZMM,CST7,...\n3: CD3D,CD3G,CD3E,IL7R,PIK3IP1,TCF7\n4: FKBP11,PRDM1,PEBP1,PPIB,SEC11C,SUB1,...\n\n$`3`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0000999900 0.0002706726 0.9072478 1.989836 0 46\n2: cDC 0.0001015022 0.0002706726 0.8950426 1.806256 0 14\n3: pDC 0.0001008878 0.0002706726 0.8292186 1.700818 0 17\n4: Plasma cell 0.0348925962 0.0558281540 0.7900880 1.456388 319 7\n leadingEdge\n1: CD79A,LINC00926,TCL1A,MS4A1,TNFRSF13C,CD79B,...\n2: CD74,HLA-DQB1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DQA1,...\n3: CD74,BCL11A,TCF4,IRF8,HERPUD1,TSPAN13,...\n4: PLPP5,ISG20,HERPUD1,MZB1,ITM2C\n\n$`4`\n pathway pval padj ES NES nMoreExtreme size\n1: CD4 T cell 0.0001015744 0.0003199659 0.9121965 1.771474 0 14\n2: CD8 T cell 0.0001066553 0.0003199659 0.9014219 1.638647 0 8\n leadingEdge\n1: IL7R,LTB,LDHB,MAL,RCAN3,NOSIP,...\n2: CD3D,IL32,CD3G,CD2,CD3E,CD8B\n\n$`5`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.07818977 0.2503001 0.8293407 1.392212 678 5\n2: pDC 0.04285714 0.2503001 0.7176562 1.385862 425 18\n3: ncMono 0.08343337 0.2503001 0.6474875 1.279034 833 28\n leadingEdge\n1: PDLIM1,HLA-DRB5,STX7\n2: PTCRA,TXN,C12orf75,CST3,CTSB,APP,...\n3: OAZ1,TIMP1,IFITM3,FKBP1A,CD68,CST3,...\n\n$`6`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0000999900 0.0005417852 0.8919905 1.838712 0 45\n2: cDC 0.0002031694 0.0005417852 0.8894057 1.705469 1 14\n3: pDC 0.0002015316 0.0005417852 0.8313241 1.622237 1 17\n4: Plasma cell 0.0232629013 0.0281224853 0.7396460 1.418299 228 14\n leadingEdge\n1: CD79A,MS4A1,BANK1,HLA-DQA1,CD74,TNFRSF13C,...\n2: HLA-DQA1,CD74,HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DPA1,...\n3: CD74,JCHAIN,SPIB,TCF4,CCDC50,HERPUD1,...\n4: JCHAIN,HERPUD1,ISG20,PEBP1,MZB1,ITM2C\n\n$`7`\n pathway pval padj ES NES nMoreExtreme size\n1: ncMono 0.00009999 0.0002666667 0.9644737 2.033813 0 49\n2: cMono 0.00010000 0.0002666667 0.8854337 1.838288 0 36\n3: cDC 0.00009999 0.0002666667 0.8309648 1.730082 0 38\n4: NK cell 0.01025485 0.0205096964 0.7621593 1.478759 100 14\n5: pDC 0.02631313 0.0421010019 0.7165790 1.398343 259 15\n6: B cell 0.05732420 0.0764322654 0.6694322 1.321810 568 17\n leadingEdge\n1: CDKN1C,LST1,FCGR3A,MS4A7,AIF1,COTL1,...\n2: LST1,AIF1,COTL1,SERPINA1,FCER1G,CST3,...\n3: LST1,AIF1,COTL1,FCER1G,CST3,SPI1,...\n4: FCGR3A,FCER1G,RHOC,TYROBP,IFITM2,CCL3,...\n5: CST3,NPC2,CTSB,PLD4,MPEG1,TGFBI,...\n6: HLA-DPA1,POU2F2,HLA-DRB5,HLA-DRB1,HLA-DRA,HLA-DPB1,...\n\n$`8`\n pathway pval padj ES NES nMoreExtreme size\n1: CD4 T cell 0.0001015744 0.0006094464 0.9413572 2.012718 0 14\n2: CD8 T cell 0.0234593838 0.0703781513 0.8283043 1.494351 200 5\n leadingEdge\n1: IL7R,TCF7,PIK3IP1,LTB,LEF1,TSHZ2,...\n2: CD3G,CD3D,CD3E,CD2\n\n\nSelecing top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters.\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\n\nalldata$ref_gsea <- new.cluster.ids[as.character(alldata@active.ident)]\n\nwrap_plots(\n DimPlot(alldata, label = T, group.by = \"CCA_snn_res.0.5\") + NoAxes(),\n DimPlot(alldata, label = T, group.by = \"ref_gsea\") + NoAxes(),\n ncol = 2\n)\n\n\n\n\n\n\n\n\nCompare to results with the other celltype prediction methods in the ctrl_13 sample.\n\nctrl$ref_gsea <- alldata$ref_gsea[alldata$orig.ident == \"ctrl_13\"]\n\nwrap_plots(\n DimPlot(ctrl, label = T, group.by = \"ref_gsea\") + NoAxes() + ggtitle(\"GSEA\"),\n DimPlot(ctrl, label = T, group.by = \"predicted.id\") + NoAxes() + ggtitle(\"LabelTransfer\"),\n DimPlot(ctrl, label = T, group.by = \"scpred_prediction\") + NoAxes() + ggtitle(\"scPred\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n\n6.2 With annotated gene sets\nWe have dowloaded the celltype gene lists from http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and converted the excel file to a csv for you. Read in the gene lists and do some filtering.\n\npath_file <- file.path(\"data/cell_marker_human.csv\")\nif (!file.exists(path_file)) download.file(file.path(path_data, \"cell_marker_human.csv\"), destfile = path_file)\n\n\n# Load the human marker table\nmarkers <- read.delim(\"data/cell_marker_human.csv\", sep = \";\")\nmarkers <- markers[markers$species == \"Human\", ]\nmarkers <- markers[markers$cancer_type == \"Normal\", ]\n\n# Filter by tissue (to reduce computational time and have tissue-specific classification)\nsort(unique(markers$tissue_type))\n\n [1] \"Abdomen\" \"Abdominal adipose tissue\" \n [3] \"Abdominal fat pad\" \"Acinus\" \n [5] \"Adipose tissue\" \"Adrenal gland\" \n [7] \"Adventitia\" \"Airway\" \n [9] \"Airway epithelium\" \"Allocortex\" \n [11] \"Alveolus\" \"Amniotic fluid\" \n [13] \"Amniotic membrane\" \"Ampullary\" \n [15] \"Anogenital tract\" \"Antecubital vein\" \n [17] \"Anterior cruciate ligament\" \"Anterior presomitic mesoderm\" \n [19] \"Aorta\" \"Aortic valve\" \n [21] \"Artery\" \"Arthrosis\" \n [23] \"Articular Cartilage\" \"Ascites\" \n [25] \"Atrium\" \"Auditory cortex\" \n [27] \"Basilar membrane\" \"Beige Fat\" \n [29] \"Bile duct\" \"Biliary tract\" \n [31] \"Bladder\" \"Blood\" \n [33] \"Blood vessel\" \"Bone\" \n [35] \"Bone marrow\" \"Brain\" \n [37] \"Breast\" \"Bronchial vessel\" \n [39] \"Bronchiole\" \"Bronchoalveolar lavage\" \n [41] \"Bronchoalveolar system\" \"Bronchus\" \n [43] \"Brown adipose tissue\" \"Calvaria\" \n [45] \"Capillary\" \"Cardiac atrium\" \n [47] \"Cardiovascular system\" \"Carotid artery\" \n [49] \"Carotid plaque\" \"Cartilage\" \n [51] \"Caudal cortex\" \"Caudal forebrain\" \n [53] \"Caudal ganglionic eminence\" \"Cavernosum\" \n [55] \"Central amygdala\" \"Central nervous system\" \n [57] \"Central Nervous System\" \"Cerebellum\" \n [59] \"Cerebral organoid\" \"Cerebrospinal fluid\" \n [61] \"Choriocapillaris\" \"Chorionic villi\" \n [63] \"Chorionic villus\" \"Choroid\" \n [65] \"Choroid plexus\" \"Colon\" \n [67] \"Colon epithelium\" \"Colorectum\" \n [69] \"Cornea\" \"Corneal endothelium\" \n [71] \"Corneal epithelium\" \"Coronary artery\" \n [73] \"Corpus callosum\" \"Corpus luteum\" \n [75] \"Cortex\" \"Cortical layer\" \n [77] \"Cortical thymus\" \"Decidua\" \n [79] \"Deciduous tooth\" \"Dental pulp\" \n [81] \"Dermis\" \"Diencephalon\" \n [83] \"Distal airway\" \"Dorsal forebrain\" \n [85] \"Dorsal root ganglion\" \"Dorsolateral prefrontal cortex\"\n [87] \"Ductal tissue\" \"Duodenum\" \n [89] \"Ectocervix\" \"Ectoderm\" \n [91] \"Embryo\" \"Embryoid body\" \n [93] \"Embryonic brain\" \"Embryonic heart\" \n [95] \"Embryonic Kidney\" \"Embryonic prefrontal cortex\" \n [97] \"Embryonic stem cell\" \"Endocardium\" \n [99] \"Endocrine\" \"Endoderm\" \n[101] \"Endometrium\" \"Endometrium stroma\" \n[103] \"Entorhinal cortex\" \"Epidermis\" \n[105] \"Epithelium\" \"Esophagus\" \n[107] \"Eye\" \"Fat pad\" \n[109] \"Fetal brain\" \"Fetal gonad\" \n[111] \"Fetal heart\" \"Fetal ileums\" \n[113] \"Fetal kidney\" \"Fetal Leydig\" \n[115] \"Fetal liver\" \"Fetal lung\" \n[117] \"Fetal pancreas\" \"Fetal thymus\" \n[119] \"Fetal umbilical cord\" \"Fetus\" \n[121] \"Foreskin\" \"Frontal cortex\" \n[123] \"Fundic gland\" \"Gall bladder\" \n[125] \"Gastric corpus\" \"Gastric epithelium\" \n[127] \"Gastric gland\" \"Gastrointestinal tract\" \n[129] \"Germ\" \"Gingiva\" \n[131] \"Gonad\" \"Gut\" \n[133] \"Hair follicle\" \"Heart\" \n[135] \"Heart muscle\" \"Hippocampus\" \n[137] \"Ileum\" \"Inferior colliculus\" \n[139] \"Interfollicular epidermis\" \"Intervertebral disc\" \n[141] \"Intestinal crypt\" \"Intestine\" \n[143] \"Intrahepatic cholangio\" \"Jejunum\" \n[145] \"Kidney\" \"Lacrimal gland\" \n[147] \"Large intestine\" \"Laryngeal squamous epithelium\" \n[149] \"Lateral ganglionic eminence\" \"Ligament\" \n[151] \"Limb bud\" \"Limbal epithelium\" \n[153] \"Liver\" \"Lumbar vertebra\" \n[155] \"Lung\" \"Lymph\" \n[157] \"Lymph node\" \"Lymphatic vessel\" \n[159] \"Lymphoid tissue\" \"Malignant pleural effusion\" \n[161] \"Mammary epithelium\" \"Mammary gland\" \n[163] \"Medial ganglionic eminence\" \"Medullary thymus\" \n[165] \"Meniscus\" \"Mesoblast\" \n[167] \"Mesoderm\" \"Microvascular endothelium\" \n[169] \"Microvessel\" \"Midbrain\" \n[171] \"Middle temporal gyrus\" \"Milk\" \n[173] \"Molar\" \"Muscle\" \n[175] \"Myenteric plexus\" \"Myocardium\" \n[177] \"Myometrium\" \"Nasal concha\" \n[179] \"Nasal epithelium\" \"Nasal mucosa\" \n[181] \"Nasal polyp\" \"Neocortex\" \n[183] \"Nerve\" \"Nose\" \n[185] \"Nucleus pulposus\" \"Olfactory neuroepithelium\" \n[187] \"Optic nerve\" \"Oral cavity\" \n[189] \"Oral mucosa\" \"Osteoarthritic cartilage\" \n[191] \"Ovarian cortex\" \"Ovarian follicle\" \n[193] \"Ovary\" \"Oviduct\" \n[195] \"Pancreas\" \"Pancreatic acinar tissue\" \n[197] \"Pancreatic duct\" \"Pancreatic islet\" \n[199] \"Periodontal ligament\" \"Periodontium\" \n[201] \"Periosteum\" \"Peripheral blood\" \n[203] \"Peritoneal fluid\" \"Peritoneum\" \n[205] \"Pituitary\" \"Placenta\" \n[207] \"Plasma\" \"Pluripotent stem cell\" \n[209] \"Polyp\" \"Posterior presomitic mesoderm\" \n[211] \"Prefrontal cortex\" \"Premolar\" \n[213] \"Presomitic mesoderm\" \"Primitive streak\" \n[215] \"Prostate\" \"Pulmonary arteriy\" \n[217] \"Pyloric gland\" \"Rectum\" \n[219] \"Renal glomerulus\" \"Respiratory tract\" \n[221] \"Retina\" \"Retinal organoid\" \n[223] \"Retinal pigment epithelium\" \"Right ventricle\" \n[225] \"Saliva\" \"Salivary gland\" \n[227] \"Scalp\" \"Sclerocorneal tissue\" \n[229] \"Seminal plasma\" \"Septum transversum\" \n[231] \"Serum\" \"Sinonasal mucosa\" \n[233] \"Sinus tissue\" \"Skeletal muscle\" \n[235] \"Skin\" \"Small intestinal crypt\" \n[237] \"Small intestine\" \"Soft tissue\" \n[239] \"Sperm\" \"Spinal cord\" \n[241] \"Spleen\" \"Splenic red pulp\" \n[243] \"Sputum\" \"Stomach\" \n[245] \"Subcutaneous adipose tissue\" \"Submandibular gland\" \n[247] \"Subpallium\" \"Subplate\" \n[249] \"Subventricular zone\" \"Superior frontal gyrus\" \n[251] \"Sympathetic ganglion\" \"Synovial fluid\" \n[253] \"Synovium\" \"Taste bud\" \n[255] \"Tendon\" \"Testis\" \n[257] \"Thalamus\" \"Thymus\" \n[259] \"Thyroid\" \"Tonsil\" \n[261] \"Tooth\" \"Trachea\" \n[263] \"Tracheal airway epithelium\" \"Transformed artery\" \n[265] \"Trophoblast\" \"Umbilical cord\" \n[267] \"Umbilical cord blood\" \"Umbilical vein\" \n[269] \"Undefined\" \"Urine\" \n[271] \"Urothelium\" \"Uterine cervix\" \n[273] \"Uterus\" \"Vagina\" \n[275] \"Vein\" \"Venous blood\" \n[277] \"Ventral thalamus\" \"Ventricle\" \n[279] \"Ventricular and atrial\" \"Ventricular zone\" \n[281] \"Visceral adipose tissue\" \"Vocal fold\" \n[283] \"Whartons jelly\" \"White adipose tissue\" \n[285] \"White matter\" \"Yolk sac\" \n\ngrep(\"blood\", unique(markers$tissue_type), value = T)\n\n[1] \"Peripheral blood\" \"Umbilical cord blood\" \"Venous blood\" \n\nmarkers <- markers[markers$tissue_type %in% c(\n \"Blood\", \"Venous blood\",\n \"Serum\", \"Plasma\",\n \"Spleen\", \"Bone marrow\", \"Lymph node\"\n), ]\n\n# remove strange characters etc.\ncelltype_list <- lapply(unique(markers$cell_name), function(x) {\n x <- paste(markers$Symbol[markers$cell_name == x], sep = \",\")\n x <- gsub(\"[[]|[]]| |-\", \",\", x)\n x <- unlist(strsplit(x, split = \",\"))\n x <- unique(x[!x %in% c(\"\", \"NA\", \"family\")])\n x <- casefold(x, upper = T)\n})\nnames(celltype_list) <- unique(markers$cell_name)\n\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100]\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5]\n\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n gene_rank <- setNames(x$avg_log2FC, x$gene)\n fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000, scoreType = \"pos\")\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.01, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 5, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\n\n# show top 3 for each cluster.\nlapply(res, head, 3)\n\n$`0`\n pathway pval padj ES NES nMoreExtreme size\n1: Neutrophil 9.999e-05 0.002274773 0.8596747 1.768180 0 22\n2: Monocyte 9.999e-05 0.002274773 0.8152552 1.737522 0 40\n3: Eosinophil 9.999e-05 0.002274773 0.8683785 1.723714 0 13\n leadingEdge\n1: S100A8,S100A9,CD14,CSF3R,S100A6,PLAUR,...\n2: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n3: S100A8,S100A9,LYZ,RETN,CSF3R,ICAM1,...\n\n$`1`\n pathway pval\n1: Natural killer cell 9.999e-05\n2: Finally highly effector (TEMRA) memory T cell 9.999e-05\n3: CD4+ recently activated effector memory or effector T cell (CTL) 9.999e-05\n padj ES NES nMoreExtreme size\n1: 0.00076356 0.9028054 2.222363 0 38\n2: 0.00076356 0.9919728 2.101706 0 7\n3: 0.00076356 0.9500974 2.081966 0 10\n leadingEdge\n1: GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...\n2: GNLY,NKG7,CST7,GZMA,GZMH,CCL5,...\n3: GNLY,PRF1,NKG7,CTSW,GZMH,S1PR5,...\n\n$`2`\n pathway pval padj ES NES nMoreExtreme size\n1: CD8+ T cell 9.999e-05 0.00049995 0.9582694 2.068813 0 12\n2: CD4+ T cell 9.999e-05 0.00049995 0.9437698 2.022236 0 11\n3: T cell 9.999e-05 0.00049995 0.8614252 2.005057 0 32\n leadingEdge\n1: CD3D,CD8A,CD8B,TRGC2,TRAC,CD3E,...\n2: CD3D,CD8A,TRAC,CD3E,IL32,IL7R,...\n3: GZMK,CD3D,CD8A,CD3G,CD8B,GZMH,...\n\n$`3`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 9.999e-05 0.001849815 0.9003237 1.923899 0 29\n2: Naive B cell 9.999e-05 0.001849815 0.9502942 1.919618 0 13\n3: Follicular B cell 9.999e-05 0.001849815 0.9530874 1.889348 0 10\n leadingEdge\n1: IGHM,IGHD,CD79A,IGKC,TCL1A,MS4A1,...\n2: IGHM,IGHD,TCL1A,MS4A1,FCER2,YBX3,...\n3: IGHD,CD79A,TCL1A,MS4A1,CD79B,CD74,...\n\n$`4`\n pathway pval padj ES NES nMoreExtreme\n1: Naive T(Th0) cell 0.00009999 0.004099590 0.9024674 1.739839 0\n2: CD8+ T cell 0.00029997 0.006149385 0.8842482 1.716703 2\n3: B cell 0.00029997 0.006149385 0.8703159 1.677856 2\n size leadingEdge\n1: 11 IL7R,CD3D,IL32,TCF7,CD3E,NPM1,...\n2: 12 IL7R,TRAC,CD3D,IL32,CD28,CD3E,...\n3: 11 IL7R,CD5,CD28,JUNB,CD27,BCL2,...\n\n$`5`\n pathway pval padj ES NES nMoreExtreme size\n1: Megakaryocyte 0.00009999 0.00649935 0.9892473 1.865505 0 11\n2: Plasmablast 0.00209979 0.03412159 0.9327396 1.721114 20 6\n3: Memory B cell 0.00959904 0.11482185 0.8912334 1.644526 95 6\n leadingEdge\n1: PPBP,PF4,NRGN,MYL9,GNG11,GP9,...\n2: IGHA1,IGLC2,TUBA1B,IGKC,GAPDH\n3: IGHA1,KLF10\n\n$`6`\n pathway pval padj ES NES nMoreExtreme\n1: B cell 0.00009999 0.002533080 0.8810749 1.800871 0\n2: Plasma cell 0.00009999 0.002533080 0.9278632 1.783346 0\n3: Marginal zone B cell 0.00029997 0.004559544 0.9495868 1.747841 2\n size leadingEdge\n1: 37 IGKC,CD79A,MS4A1,BANK1,IGHM,CD74,...\n2: 13 IGKC,CD79A,IGLC3,IGLC2,IGHM,IGHG1,...\n3: 6 CD79A,MS4A1,CD79B,TNFRSF13B,CD19,CD27\n\n$`7`\n pathway pval padj ES NES nMoreExtreme size\n1: CD16+ monocyte 0.00029997 0.0089991 0.9435730 1.753112 2 6\n2: Monocyte 0.00009999 0.0089991 0.8135685 1.662756 0 27\n3: Macrophage 0.00019998 0.0089991 0.8087559 1.641009 1 24\n leadingEdge\n1: FCGR3A,TCF7L2,HES4,LYN,MTSS1\n2: LST1,FCGR3A,MS4A7,CST3,PECAM1,CD68,...\n3: FCGR3A,MS4A7,FCER1G,CD68,FTL,C1QA,...\n\n$`8`\n pathway pval padj ES NES\n1: Central memory CD8+ T cell 9.999e-05 0.00109989 0.9139722 1.941589\n2: Naive CD8+ T cell 9.999e-05 0.00109989 0.8685820 1.893681\n3: Naive CD8 T cell 9.999e-05 0.00109989 0.9024094 1.879507\n nMoreExtreme size leadingEdge\n1: 0 12 CCR7,IL7R,TCF7,LEF1,TSHZ2,RCAN3,...\n2: 0 16 CCR7,TCF7,LEF1,TSHZ2,RCAN3,MAL,...\n3: 0 10 CCR7,TCF7,LEF1,TSHZ2,RCAN3,MAL,...\n\n\n#CT_GSEA8:\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\nalldata$cellmarker_gsea <- new.cluster.ids[as.character(alldata@active.ident)]\n\nwrap_plots(\n DimPlot(alldata, label = T, group.by = \"ref_gsea\") + NoAxes(),\n DimPlot(alldata, label = T, group.by = \"cellmarker_gsea\") + NoAxes(),\n ncol = 2\n)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you think that the methods overlap well? Where do you see the most inconsistencies?\n\n\nIn this case we do not have any ground truth, and we cannot say which method performs best. You should keep in mind, that any celltype classification method is just a prediction, and you still need to use your common sense and knowledge of the biological system to judge if the results make sense.\nFinally, lets save the data with predictions.\n\nsaveRDS(ctrl, \"data/covid/results/seurat_covid_qc_dr_int_cl_ct-ctrl13.rds\")" + "text": "6 GSEA with celltype markers\nAnother option, where celltype can be classified on cluster level is to use gene set enrichment among the DEGs with known markers for different celltypes. Similar to how we did functional enrichment for the DEGs in the Differential expression exercise. There are some resources for celltype gene sets that can be used. Such as CellMarker, PanglaoDB or celltype gene sets at MSigDB. We can also look at overlap between DEGs in a reference dataset and the dataset you are analysing.\n\n6.1 DEG overlap\nFirst, lets extract top DEGs for our Covid-19 dataset and the reference dataset. When we run differential expression for our dataset, we want to report as many genes as possible, hence we set the cutoffs quite lenient.\n\n# run differential expression in our dataset, using clustering at resolution 0.5\nalldata <- SetIdent(alldata, value = \"CCA_snn_res.0.5\")\nDGE_table <- FindAllMarkers(\n alldata,\n logfc.threshold = 0,\n test.use = \"wilcox\",\n min.pct = 0.1,\n min.diff.pct = 0,\n only.pos = TRUE,\n max.cells.per.ident = 20,\n return.thresh = 1,\n assay = \"RNA\"\n)\n\n# split into a list\nDGE_list <- split(DGE_table, DGE_table$cluster)\n\nunlist(lapply(DGE_list, nrow))\n\n 0 1 2 3 4 5 6 7 8 \n3307 4102 3289 2478 2017 2522 2483 3513 2298 \n\n\n\n# Compute differential gene expression in reference dataset (that has cell annotation)\nreference <- SetIdent(reference, value = \"cell_type\")\nreference_markers <- FindAllMarkers(\n reference,\n min.pct = .1,\n min.diff.pct = .2,\n only.pos = T,\n max.cells.per.ident = 20,\n return.thresh = 1\n)\n\n# Identify the top cell marker genes in reference dataset\n# select top 50 with hihgest foldchange among top 100 signifcant genes.\nreference_markers <- reference_markers[order(reference_markers$avg_log2FC, decreasing = T), ]\nreference_markers %>%\n group_by(cluster) %>%\n top_n(-100, p_val) %>%\n top_n(50, avg_log2FC) -> top50_cell_selection\n\n# Transform the markers into a list\nref_list <- split(top50_cell_selection$gene, top50_cell_selection$cluster)\n\nunlist(lapply(ref_list, length))\n\n CD8 T cell CD4 T cell cMono B cell NK cell pDC \n 30 15 50 50 50 50 \n ncMono cDC Plasma cell \n 50 50 50 \n\n\nNow we can run GSEA for the DEGs from our dataset and check for enrichment of top DEGs in the reference dataset.\n\nsuppressPackageStartupMessages(library(fgsea))\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n gene_rank <- setNames(x$avg_log2FC, x$gene)\n fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.1, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 2, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\nres\n\n$`0`\n pathway pval padj ES NES nMoreExtreme size\n1: cMono 0.000099990 0.000299970 0.9596774 2.060568 0 48\n2: cDC 0.000099990 0.000299970 0.8397658 1.790162 0 41\n3: ncMono 0.000099990 0.000299970 0.8371549 1.787799 0 43\n4: pDC 0.001203369 0.002707581 0.7436632 1.519896 11 21\n5: NK cell 0.029631165 0.053336096 0.7493872 1.424591 285 10\n6: B cell 0.059484067 0.089226100 0.6673689 1.326559 587 15\n leadingEdge\n1: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n2: LYZ,GRN,TYMP,CST3,AIF1,LGALS2,...\n3: CTSS,TYMP,CST3,S100A11,AIF1,SERPINA1,...\n4: GRN,MS4A6A,CST3,MPEG1,CTSB,TGFBI,...\n5: TYROBP,FCER1G,SRGN,CCL3,MYO1F,ITGB2,...\n6: NCF1,LY86,MARCH1,POU2F2,HLA-DMB,HLA-DRB5,...\n\n$`1`\n pathway pval padj ES NES nMoreExtreme size\n1: NK cell 0.0000999900 0.0004014049 0.9481456 2.370524 0 48\n2: CD8 T cell 0.0001003512 0.0004014049 0.9281929 2.209379 0 25\n3: ncMono 0.0004561524 0.0012164063 0.9189116 1.778055 3 6\n4: pDC 0.0096296296 0.0154074074 0.7758141 1.647353 90 10\n5: Plasma cell 0.0013024747 0.0026049494 0.6721222 1.626799 12 30\n leadingEdge\n1: GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...\n2: GNLY,GZMB,FGFBP2,PRF1,NKG7,CTSW,...\n3: FCGR3A,IFITM2,RHOC\n4: GZMB,C12orf75,HSP90B1,ALOX5AP,PLAC8,RRBP1\n5: FKBP11,PRDM1,CD38,SDF2L1,PPIB,SLAMF7,...\n\n$`2`\n pathway pval padj ES NES nMoreExtreme size\n1: CD8 T cell 0.0001000801 0.0003502802 0.9332541 2.141759 0 29\n2: NK cell 0.0001000600 0.0003502802 0.8217604 1.895062 0 31\n3: CD4 T cell 0.0019867550 0.0046357616 0.8693316 1.678458 17 7\n4: Plasma cell 0.0817572301 0.1144601221 0.5564210 1.279938 816 30\n leadingEdge\n1: CD3D,CD8A,CD3G,CD8B,CCL5,GZMH,...\n2: CCL5,GZMA,CCL4,GZMM,NKG7,CST7,...\n3: CD3D,CD3G,CD3E,IL7R,PIK3IP1,TCF7\n4: FKBP11,PRDM1,PEBP1,PPIB,SEC11C,SUB1,...\n\n$`3`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0000999900 0.0004060914 0.9070112 2.004342 0 46\n2: cDC 0.0001015228 0.0004060914 0.8951817 1.814950 0 14\n3: pDC 0.0004026170 0.0010736454 0.7937887 1.648175 3 18\n4: Plasma cell 0.0800554312 0.1280886899 0.7211352 1.360348 750 8\n leadingEdge\n1: CD79A,LINC00926,TCL1A,MS4A1,TNFRSF13C,CD79B,...\n2: CD74,HLA-DQB1,HLA-DRA,HLA-DRB1,HLA-DPB1,HLA-DQA1,...\n3: CD74,BCL11A,TCF4,IRF8,HERPUD1,TSPAN13,...\n4: PLPP5,ISG20,HERPUD1,MZB1,ITM2C,DERL3\n\n$`4`\n pathway pval padj ES NES nMoreExtreme size\n1: CD4 T cell 0.0001014610 0.0006087662 0.9093092 1.771592 0 14\n2: CD8 T cell 0.0007438104 0.0022314313 0.8911450 1.626159 6 8\n leadingEdge\n1: IL7R,LTB,LDHB,MAL,RCAN3,NOSIP,...\n2: CD3D,IL32,CD3G,CD2,CD3E,CD8B\n\n$`5`\n pathway pval padj ES NES nMoreExtreme size\n1: pDC 0.03817025 0.2740774 0.7278455 1.398044 377 17\n2: ncMono 0.06090609 0.2740774 0.6549736 1.297265 608 30\n leadingEdge\n1: PTCRA,TXN,C12orf75,CST3,APP,CTSB,...\n2: OAZ1,TIMP1,IFITM3,FKBP1A,CD68,CST3,...\n\n$`6`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0000999900 0.0004067521 0.8905126 1.833414 0 45\n2: cDC 0.0001016880 0.0004067521 0.8877832 1.700394 0 14\n3: pDC 0.0003024803 0.0008066142 0.8341772 1.624004 2 17\n4: Plasma cell 0.0277580792 0.0341582581 0.7291313 1.403876 273 15\n leadingEdge\n1: CD79A,MS4A1,BANK1,HLA-DQA1,CD74,TNFRSF13C,...\n2: HLA-DQA1,CD74,HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DPA1,...\n3: CD74,JCHAIN,SPIB,TCF4,HERPUD1,CCDC50,...\n4: JCHAIN,HERPUD1,ISG20,ITM2C,PEBP1,MZB1\n\n$`7`\n pathway pval padj ES NES nMoreExtreme size\n1: ncMono 0.00009999 0.0002667734 0.9653377 2.038133 0 49\n2: cMono 0.00010004 0.0002667734 0.8842478 1.838505 0 36\n3: cDC 0.00010002 0.0002667734 0.8287084 1.729668 0 38\n4: NK cell 0.00700721 0.0140144206 0.7660007 1.492316 68 14\n5: pDC 0.02330058 0.0372809239 0.7210229 1.413360 229 15\n6: B cell 0.05925627 0.0790083644 0.6660721 1.322621 587 17\n leadingEdge\n1: CDKN1C,LST1,FCGR3A,MS4A7,AIF1,COTL1,...\n2: LST1,AIF1,COTL1,SERPINA1,FCER1G,CST3,...\n3: LST1,AIF1,COTL1,FCER1G,CST3,SPI1,...\n4: FCGR3A,FCER1G,RHOC,TYROBP,IFITM2,CCL3,...\n5: CST3,NPC2,CTSB,PLD4,MPEG1,TGFBI,...\n6: HLA-DPA1,POU2F2,HLA-DRB5,HLA-DRB1,HLA-DRA,HLA-DPB1,...\n\n$`8`\n pathway pval padj ES NES nMoreExtreme size\n1: CD4 T cell 0.0001015435 0.0006092608 0.9399826 2.027358 0 14\n2: CD8 T cell 0.0253356684 0.0760070053 0.8168162 1.475031 216 5\n leadingEdge\n1: TCF7,IL7R,PIK3IP1,LEF1,LTB,TSHZ2,...\n2: CD3G,CD3D,CD3E,CD2\n\n\nSelecing top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters.\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\n\nalldata$ref_gsea <- new.cluster.ids[as.character(alldata@active.ident)]\n\nwrap_plots(\n DimPlot(alldata, label = T, group.by = \"CCA_snn_res.0.5\") + NoAxes(),\n DimPlot(alldata, label = T, group.by = \"ref_gsea\") + NoAxes(),\n ncol = 2\n)\n\n\n\n\n\n\n\n\nCompare to results with the other celltype prediction methods in the ctrl_13 sample.\n\nctrl$ref_gsea <- alldata$ref_gsea[alldata$orig.ident == \"ctrl_13\"]\n\nwrap_plots(\n DimPlot(ctrl, label = T, group.by = \"ref_gsea\") + NoAxes() + ggtitle(\"GSEA\"),\n DimPlot(ctrl, label = T, group.by = \"predicted.id\") + NoAxes() + ggtitle(\"LabelTransfer\"),\n DimPlot(ctrl, label = T, group.by = \"scpred_prediction\") + NoAxes() + ggtitle(\"scPred\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n\n6.2 With annotated gene sets\nWe have dowloaded the celltype gene lists from http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and converted the excel file to a csv for you. Read in the gene lists and do some filtering.\n\npath_file <- file.path(\"data/cell_marker_human.csv\")\nif (!file.exists(path_file)) download.file(file.path(path_data, \"cell_marker_human.csv\"), destfile = path_file)\n\n\n# Load the human marker table\nmarkers <- read.delim(\"data/cell_marker_human.csv\", sep = \";\")\nmarkers <- markers[markers$species == \"Human\", ]\nmarkers <- markers[markers$cancer_type == \"Normal\", ]\n\n# Filter by tissue (to reduce computational time and have tissue-specific classification)\nsort(unique(markers$tissue_type))\n\n [1] \"Abdomen\" \"Abdominal adipose tissue\" \n [3] \"Abdominal fat pad\" \"Acinus\" \n [5] \"Adipose tissue\" \"Adrenal gland\" \n [7] \"Adventitia\" \"Airway\" \n [9] \"Airway epithelium\" \"Allocortex\" \n [11] \"Alveolus\" \"Amniotic fluid\" \n [13] \"Amniotic membrane\" \"Ampullary\" \n [15] \"Anogenital tract\" \"Antecubital vein\" \n [17] \"Anterior cruciate ligament\" \"Anterior presomitic mesoderm\" \n [19] \"Aorta\" \"Aortic valve\" \n [21] \"Artery\" \"Arthrosis\" \n [23] \"Articular Cartilage\" \"Ascites\" \n [25] \"Atrium\" \"Auditory cortex\" \n [27] \"Basilar membrane\" \"Beige Fat\" \n [29] \"Bile duct\" \"Biliary tract\" \n [31] \"Bladder\" \"Blood\" \n [33] \"Blood vessel\" \"Bone\" \n [35] \"Bone marrow\" \"Brain\" \n [37] \"Breast\" \"Bronchial vessel\" \n [39] \"Bronchiole\" \"Bronchoalveolar lavage\" \n [41] \"Bronchoalveolar system\" \"Bronchus\" \n [43] \"Brown adipose tissue\" \"Calvaria\" \n [45] \"Capillary\" \"Cardiac atrium\" \n [47] \"Cardiovascular system\" \"Carotid artery\" \n [49] \"Carotid plaque\" \"Cartilage\" \n [51] \"Caudal cortex\" \"Caudal forebrain\" \n [53] \"Caudal ganglionic eminence\" \"Cavernosum\" \n [55] \"Central amygdala\" \"Central nervous system\" \n [57] \"Central Nervous System\" \"Cerebellum\" \n [59] \"Cerebral organoid\" \"Cerebrospinal fluid\" \n [61] \"Choriocapillaris\" \"Chorionic villi\" \n [63] \"Chorionic villus\" \"Choroid\" \n [65] \"Choroid plexus\" \"Colon\" \n [67] \"Colon epithelium\" \"Colorectum\" \n [69] \"Cornea\" \"Corneal endothelium\" \n [71] \"Corneal epithelium\" \"Coronary artery\" \n [73] \"Corpus callosum\" \"Corpus luteum\" \n [75] \"Cortex\" \"Cortical layer\" \n [77] \"Cortical thymus\" \"Decidua\" \n [79] \"Deciduous tooth\" \"Dental pulp\" \n [81] \"Dermis\" \"Diencephalon\" \n [83] \"Distal airway\" \"Dorsal forebrain\" \n [85] \"Dorsal root ganglion\" \"Dorsolateral prefrontal cortex\"\n [87] \"Ductal tissue\" \"Duodenum\" \n [89] \"Ectocervix\" \"Ectoderm\" \n [91] \"Embryo\" \"Embryoid body\" \n [93] \"Embryonic brain\" \"Embryonic heart\" \n [95] \"Embryonic Kidney\" \"Embryonic prefrontal cortex\" \n [97] \"Embryonic stem cell\" \"Endocardium\" \n [99] \"Endocrine\" \"Endoderm\" \n[101] \"Endometrium\" \"Endometrium stroma\" \n[103] \"Entorhinal cortex\" \"Epidermis\" \n[105] \"Epithelium\" \"Esophagus\" \n[107] \"Eye\" \"Fat pad\" \n[109] \"Fetal brain\" \"Fetal gonad\" \n[111] \"Fetal heart\" \"Fetal ileums\" \n[113] \"Fetal kidney\" \"Fetal Leydig\" \n[115] \"Fetal liver\" \"Fetal lung\" \n[117] \"Fetal pancreas\" \"Fetal thymus\" \n[119] \"Fetal umbilical cord\" \"Fetus\" \n[121] \"Foreskin\" \"Frontal cortex\" \n[123] \"Fundic gland\" \"Gall bladder\" \n[125] \"Gastric corpus\" \"Gastric epithelium\" \n[127] \"Gastric gland\" \"Gastrointestinal tract\" \n[129] \"Germ\" \"Gingiva\" \n[131] \"Gonad\" \"Gut\" \n[133] \"Hair follicle\" \"Heart\" \n[135] \"Heart muscle\" \"Hippocampus\" \n[137] \"Ileum\" \"Inferior colliculus\" \n[139] \"Interfollicular epidermis\" \"Intervertebral disc\" \n[141] \"Intestinal crypt\" \"Intestine\" \n[143] \"Intrahepatic cholangio\" \"Jejunum\" \n[145] \"Kidney\" \"Lacrimal gland\" \n[147] \"Large intestine\" \"Laryngeal squamous epithelium\" \n[149] \"Lateral ganglionic eminence\" \"Ligament\" \n[151] \"Limb bud\" \"Limbal epithelium\" \n[153] \"Liver\" \"Lumbar vertebra\" \n[155] \"Lung\" \"Lymph\" \n[157] \"Lymph node\" \"Lymphatic vessel\" \n[159] \"Lymphoid tissue\" \"Malignant pleural effusion\" \n[161] \"Mammary epithelium\" \"Mammary gland\" \n[163] \"Medial ganglionic eminence\" \"Medullary thymus\" \n[165] \"Meniscus\" \"Mesoblast\" \n[167] \"Mesoderm\" \"Microvascular endothelium\" \n[169] \"Microvessel\" \"Midbrain\" \n[171] \"Middle temporal gyrus\" \"Milk\" \n[173] \"Molar\" \"Muscle\" \n[175] \"Myenteric plexus\" \"Myocardium\" \n[177] \"Myometrium\" \"Nasal concha\" \n[179] \"Nasal epithelium\" \"Nasal mucosa\" \n[181] \"Nasal polyp\" \"Neocortex\" \n[183] \"Nerve\" \"Nose\" \n[185] \"Nucleus pulposus\" \"Olfactory neuroepithelium\" \n[187] \"Optic nerve\" \"Oral cavity\" \n[189] \"Oral mucosa\" \"Osteoarthritic cartilage\" \n[191] \"Ovarian cortex\" \"Ovarian follicle\" \n[193] \"Ovary\" \"Oviduct\" \n[195] \"Pancreas\" \"Pancreatic acinar tissue\" \n[197] \"Pancreatic duct\" \"Pancreatic islet\" \n[199] \"Periodontal ligament\" \"Periodontium\" \n[201] \"Periosteum\" \"Peripheral blood\" \n[203] \"Peritoneal fluid\" \"Peritoneum\" \n[205] \"Pituitary\" \"Placenta\" \n[207] \"Plasma\" \"Pluripotent stem cell\" \n[209] \"Polyp\" \"Posterior presomitic mesoderm\" \n[211] \"Prefrontal cortex\" \"Premolar\" \n[213] \"Presomitic mesoderm\" \"Primitive streak\" \n[215] \"Prostate\" \"Pulmonary arteriy\" \n[217] \"Pyloric gland\" \"Rectum\" \n[219] \"Renal glomerulus\" \"Respiratory tract\" \n[221] \"Retina\" \"Retinal organoid\" \n[223] \"Retinal pigment epithelium\" \"Right ventricle\" \n[225] \"Saliva\" \"Salivary gland\" \n[227] \"Scalp\" \"Sclerocorneal tissue\" \n[229] \"Seminal plasma\" \"Septum transversum\" \n[231] \"Serum\" \"Sinonasal mucosa\" \n[233] \"Sinus tissue\" \"Skeletal muscle\" \n[235] \"Skin\" \"Small intestinal crypt\" \n[237] \"Small intestine\" \"Soft tissue\" \n[239] \"Sperm\" \"Spinal cord\" \n[241] \"Spleen\" \"Splenic red pulp\" \n[243] \"Sputum\" \"Stomach\" \n[245] \"Subcutaneous adipose tissue\" \"Submandibular gland\" \n[247] \"Subpallium\" \"Subplate\" \n[249] \"Subventricular zone\" \"Superior frontal gyrus\" \n[251] \"Sympathetic ganglion\" \"Synovial fluid\" \n[253] \"Synovium\" \"Taste bud\" \n[255] \"Tendon\" \"Testis\" \n[257] \"Thalamus\" \"Thymus\" \n[259] \"Thyroid\" \"Tonsil\" \n[261] \"Tooth\" \"Trachea\" \n[263] \"Tracheal airway epithelium\" \"Transformed artery\" \n[265] \"Trophoblast\" \"Umbilical cord\" \n[267] \"Umbilical cord blood\" \"Umbilical vein\" \n[269] \"Undefined\" \"Urine\" \n[271] \"Urothelium\" \"Uterine cervix\" \n[273] \"Uterus\" \"Vagina\" \n[275] \"Vein\" \"Venous blood\" \n[277] \"Ventral thalamus\" \"Ventricle\" \n[279] \"Ventricular and atrial\" \"Ventricular zone\" \n[281] \"Visceral adipose tissue\" \"Vocal fold\" \n[283] \"Whartons jelly\" \"White adipose tissue\" \n[285] \"White matter\" \"Yolk sac\" \n\ngrep(\"blood\", unique(markers$tissue_type), value = T)\n\n[1] \"Peripheral blood\" \"Umbilical cord blood\" \"Venous blood\" \n\nmarkers <- markers[markers$tissue_type %in% c(\n \"Blood\", \"Venous blood\",\n \"Serum\", \"Plasma\",\n \"Spleen\", \"Bone marrow\", \"Lymph node\"\n), ]\n\n# remove strange characters etc.\ncelltype_list <- lapply(unique(markers$cell_name), function(x) {\n x <- paste(markers$Symbol[markers$cell_name == x], sep = \",\")\n x <- gsub(\"[[]|[]]| |-\", \",\", x)\n x <- unlist(strsplit(x, split = \",\"))\n x <- unique(x[!x %in% c(\"\", \"NA\", \"family\")])\n x <- casefold(x, upper = T)\n})\nnames(celltype_list) <- unique(markers$cell_name)\n\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100]\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5]\n\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n gene_rank <- setNames(x$avg_log2FC, x$gene)\n fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000, scoreType = \"pos\")\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.01, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 5, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\n\n# show top 3 for each cluster.\nlapply(res, head, 3)\n\n$`0`\n pathway pval padj ES NES nMoreExtreme size\n1: Neutrophil 9.999e-05 0.001819818 0.8582790 1.761773 0 22\n2: Monocyte 9.999e-05 0.001819818 0.8133774 1.731423 0 40\n3: Eosinophil 9.999e-05 0.001819818 0.8660814 1.711202 0 13\n leadingEdge\n1: S100A8,S100A9,CD14,CSF3R,S100A6,PLAUR,...\n2: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n3: S100A8,S100A9,LYZ,RETN,CSF3R,ICAM1,...\n\n$`1`\n pathway pval\n1: Natural killer cell 9.999e-05\n2: Finally highly effector (TEMRA) memory T cell 9.999e-05\n3: CD4+ recently activated effector memory or effector T cell (CTL) 9.999e-05\n padj ES NES nMoreExtreme size\n1: 0.00075447 0.9007366 2.226831 0 38\n2: 0.00075447 0.9919414 2.108994 0 7\n3: 0.00075447 0.9555230 2.105707 0 10\n leadingEdge\n1: GNLY,GZMB,FGFBP2,PRF1,NKG7,SPON2,...\n2: GNLY,NKG7,CST7,GZMH,GZMA,CCL5,...\n3: GNLY,PRF1,NKG7,CTSW,GZMH,S1PR5,...\n\n$`2`\n pathway pval padj ES NES nMoreExtreme size\n1: CD8+ T cell 9.999e-05 0.0007082625 0.9475062 2.019496 0 12\n2: CD4+ T cell 9.999e-05 0.0007082625 0.9424177 2.008651 0 12\n3: T cell 9.999e-05 0.0007082625 0.8696672 2.003155 0 31\n leadingEdge\n1: CD3D,CD8A,CD8B,TRGC2,TRAC,CD3E,...\n2: CD3D,CD8A,TRAC,CD3E,IL32,IL7R,...\n3: GZMK,CD3D,CD8A,CD3G,CD8B,GZMH,...\n\n$`3`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 9.999e-05 0.001519848 0.8920923 1.925485 0 30\n2: Naive B cell 9.999e-05 0.001519848 0.9488513 1.922935 0 13\n3: Follicular B cell 9.999e-05 0.001519848 0.9542139 1.893081 0 10\n leadingEdge\n1: IGHM,IGHD,CD79A,IGKC,TCL1A,MS4A1,...\n2: IGHM,IGHD,TCL1A,MS4A1,FCER2,YBX3,...\n3: IGHD,CD79A,TCL1A,MS4A1,CD79B,CD74,...\n\n$`4`\n pathway pval padj ES NES nMoreExtreme\n1: Naive T(Th0) cell 0.00019998 0.004149585 0.8986344 1.735172 1\n2: CD8+ T cell 0.00019998 0.004149585 0.8914387 1.730773 1\n3: B cell 0.00019998 0.004149585 0.8774922 1.703695 1\n size leadingEdge\n1: 11 IL7R,CD3D,IL32,TCF7,CD3E,NPM1,...\n2: 12 IL7R,TRAC,CD3D,IL32,CD28,CD3E,...\n3: 12 IL7R,CD5,CD28,JUNB,CD27,BCL2,...\n\n$`5`\n pathway pval padj ES NES nMoreExtreme size\n1: Megakaryocyte 0.00009999 0.00659934 0.9892774 1.873193 0 11\n2: Plasmablast 0.00129987 0.02144786 0.9319394 1.715182 12 6\n3: Platelet 0.00029997 0.00989901 0.8513473 1.650589 2 18\n leadingEdge\n1: PPBP,PF4,NRGN,MYL9,GNG11,GP9,...\n2: IGHA1,IGLC2,TUBA1B,IGKC,GAPDH\n3: PPBP,PF4,GP9,ITGA2B,CD9,CD151,...\n\n$`6`\n pathway pval padj ES NES nMoreExtreme\n1: B cell 0.00009999 0.00189981 0.8773325 1.783824 0\n2: Plasma cell 0.00009999 0.00189981 0.9270013 1.779601 0\n3: Marginal zone B cell 0.00049995 0.00633270 0.9487283 1.749577 4\n size leadingEdge\n1: 37 IGKC,CD79A,MS4A1,IGHM,BANK1,CD74,...\n2: 13 IGKC,CD79A,IGLC3,IGLC2,IGHM,IGHG1,...\n3: 6 CD79A,MS4A1,CD79B,TNFRSF13B,CD19,CD27\n\n$`7`\n pathway pval padj ES NES nMoreExtreme size\n1: CD16+ monocyte 0.00039996 0.01199880 0.9435439 1.753885 3 6\n2: Monocyte 0.00009999 0.00449955 0.8159420 1.667233 0 27\n3: Macrophage 0.00009999 0.00449955 0.8103755 1.643732 0 24\n leadingEdge\n1: FCGR3A,TCF7L2,HES4,LYN,MTSS1\n2: LST1,FCGR3A,MS4A7,CST3,PECAM1,CD68,...\n3: FCGR3A,MS4A7,FCER1G,CD68,FTL,C1QA,...\n\n$`8`\n pathway pval padj ES NES\n1: Central memory CD8+ T cell 9.999e-05 0.001028469 0.9137572 1.956574\n2: Naive CD8+ T cell 9.999e-05 0.001028469 0.8754154 1.916439\n3: Naive CD4 T cell 9.999e-05 0.001028469 0.9031989 1.901508\n nMoreExtreme size leadingEdge\n1: 0 12 CCR7,TCF7,IL7R,LEF1,TSHZ2,RCAN3,...\n2: 0 15 CCR7,TCF7,LEF1,TSHZ2,RCAN3,MAL,...\n3: 0 10 CCR7,TCF7,LEF1,LTB,TSHZ2,MAL,...\n\n\n#CT_GSEA8:\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\nalldata$cellmarker_gsea <- new.cluster.ids[as.character(alldata@active.ident)]\n\nwrap_plots(\n DimPlot(alldata, label = T, group.by = \"ref_gsea\") + NoAxes(),\n DimPlot(alldata, label = T, group.by = \"cellmarker_gsea\") + NoAxes(),\n ncol = 2\n)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you think that the methods overlap well? Where do you see the most inconsistencies?\n\n\nIn this case we do not have any ground truth, and we cannot say which method performs best. You should keep in mind, that any celltype classification method is just a prediction, and you still need to use your common sense and knowledge of the biological system to judge if the results make sense.\nFinally, lets save the data with predictions.\n\nsaveRDS(ctrl, \"data/covid/results/seurat_covid_qc_dr_int_cl_ct-ctrl13.rds\")" }, { "objectID": "labs/seurat/seurat_06_celltyping.html#meta-session", @@ -410,7 +410,7 @@ "href": "labs/seurat/seurat_07_trajectory.html#loading-libraries", "title": " Trajectory inference using Slingshot", "section": "1 Loading libraries", - "text": "1 Loading libraries\n\nsuppressPackageStartupMessages({\n library(Seurat)\n library(plotly)\n options(rgl.printRglwidget = TRUE)\n library(Matrix)\n library(sparseMatrixStats)\n library(slingshot)\n library(tradeSeq)\n library(patchwork)\n})\n\n# Define some color palette\npal <- c(scales::hue_pal()(8), RColorBrewer::brewer.pal(9, \"Set1\"), RColorBrewer::brewer.pal(8, \"Set2\"))\nset.seed(1)\npal <- rep(sample(pal, length(pal)), 200)\n\nNice function to easily draw a graph:\n\n# Add graph to the base R graphics plot\ndraw_graph <- function(layout, graph, lwd = 0.2, col = \"grey\") {\n res <- rep(x = 1:(length(graph@p) - 1), times = (graph@p[-1] - graph@p[-length(graph@p)]))\n segments(\n x0 = layout[graph@i + 1, 1], x1 = layout[res, 1],\n y0 = layout[graph@i + 1, 2], y1 = layout[res, 2], lwd = lwd, col = col\n )\n}" + "text": "1 Loading libraries\n\nsuppressPackageStartupMessages({\n library(Seurat)\n library(plotly)\n options(rgl.printRglwidget = TRUE)\n library(Matrix)\n library(sparseMatrixStats)\n library(slingshot)\n library(tradeSeq)\n library(patchwork)\n})\n\n# Define some color palette\npal <- c(scales::hue_pal()(8), RColorBrewer::brewer.pal(9, \"Set1\"), RColorBrewer::brewer.pal(8, \"Set2\"))\nset.seed(1)\npal <- rep(sample(pal, length(pal)), 200)\n\nNice function to easily draw a graph:\n\n# Add graph to the base R graphics plot\ndraw_graph <- function(layout, graph, lwd = 0.2, col = \"grey\") {\n res <- rep(x = 1:(length(graph@p) - 1), times = (graph@p[-1] - graph@p[-length(graph@p)]))\n segments(\n x0 = layout[graph@i + 1, 1], x1 = layout[res, 1],\n y0 = layout[graph@i + 1, 2], y1 = layout[res, 2], lwd = lwd, col = col\n )\n}" }, { "objectID": "labs/seurat/seurat_07_trajectory.html#preparing-data", @@ -424,28 +424,28 @@ "href": "labs/seurat/seurat_07_trajectory.html#reading-data", "title": " Trajectory inference using Slingshot", "section": "3 Reading data", - "text": "3 Reading data\nWe already have pre-computed and subsetted the dataset (with 6688 cells and 3585 genes) following the analysis steps in this course. We then saved the objects, so you can use common tools to open and start to work with them (either in R or Python).\n\nobj <- readRDS(\"data/trajectory/trajectory_seurat_filtered.rds\")\n\n# Calculate cluster centroids (for plotting the labels later)\nmm <- sparse.model.matrix(~ 0 + factor(obj$clusters_use))\ncolnames(mm) <- levels(factor(obj$clusters_use))\ncentroids2d <- as.matrix(t(t(obj@reductions$umap@cell.embeddings) %*% mm) / Matrix::colSums(mm))\n\nLets visualize which clusters we have in our dataset:\n\nvars <- c(\"batches\", \"dataset\", \"clusters_use\", \"Phase\")\npl <- list()\n\nfor (i in vars) {\n pl[[i]] <- DimPlot(obj, group.by = i, label = T) + theme_void() + NoLegend()\n}\nwrap_plots(pl)\n\n\n\n\n\n\n\n\nYou can check, for example how many cells are in each cluster:\n\ntable(obj$clusters)\n\n\n 1 2 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 \n128 71 90 160 147 120 160 130 132 78 90 150 140 76 141 90 98 149 90 10 \n 25 26 27 28 29 32 33 34 35 36 37 38 41 43 44 45 46 47 49 50 \n 56 154 98 76 125 150 150 146 150 148 135 128 145 134 110 149 140 113 132 85 \n 52 53 54 55 57 58 59 60 61 \n126 129 57 129 147 127 118 120 101" + "text": "3 Reading data\nWe already have pre-computed and subsetted the dataset (with 6688 cells and 3585 genes) following the analysis steps in this course. We then saved the objects, so you can use common tools to open and start to work with them (either in R or Python).\n\nobj <- readRDS(\"data/trajectory/trajectory_seurat_filtered.rds\")\n\n# Calculate cluster centroids (for plotting the labels later)\nmm <- sparse.model.matrix(~ 0 + factor(obj$clusters_use))\ncolnames(mm) <- levels(factor(obj$clusters_use))\ncentroids2d <- as.matrix(t(t(obj@reductions$umap@cell.embeddings) %*% mm) / Matrix::colSums(mm))\n\nLets visualize which clusters we have in our dataset:\n\nvars <- c(\"batches\", \"dataset\", \"clusters_use\", \"Phase\")\npl <- list()\n\nfor (i in vars) {\n pl[[i]] <- DimPlot(obj, group.by = i, label = T) + theme_void() + NoLegend()\n}\nwrap_plots(pl)\n\n\n\n\n\n\n\n\nYou can check, for example how many cells are in each cluster:\n\ntable(obj$clusters)\n\n\n 1 2 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 \n128 71 90 160 147 120 160 130 132 78 90 150 140 76 141 90 98 149 90 10 \n 25 26 27 28 29 32 33 34 35 36 37 38 41 43 44 45 46 47 49 50 \n 56 154 98 76 125 150 150 146 150 148 135 128 145 134 110 149 140 113 132 85 \n 52 53 54 55 57 58 59 60 61 \n126 129 57 129 147 127 118 120 101" }, { "objectID": "labs/seurat/seurat_07_trajectory.html#exploring-the-data", "href": "labs/seurat/seurat_07_trajectory.html#exploring-the-data", "title": " Trajectory inference using Slingshot", "section": "4 Exploring the data", - "text": "4 Exploring the data\nIt is crucial that you performing analysis of a dataset understands what is going on, what are the clusters you see in your data and most importantly How are the clusters related to each other?. Well, let’s explore the data a bit. With the help of this table, write down which cluster numbers in your dataset express these key markers.\n\n\n\nMarker\nCell Type\n\n\n\n\nCd34\nHSC progenitor\n\n\nMs4a1\nB cell lineage\n\n\nCd3e\nT cell lineage\n\n\nLtf\nGranulocyte lineage\n\n\nCst3\nMonocyte lineage\n\n\nMcpt8\nMast Cell lineage\n\n\nAlas2\nRBC lineage\n\n\nSiglech\nDendritic cell lineage\n\n\nC1qc\nMacrophage cell lineage\n\n\nPf4\nMegakaryocyte cell lineage\n\n\n\n\nvars <- c(\"Cd34\", \"Ms4a1\", \"Cd3e\", \"Ltf\", \"Cst3\", \"Mcpt8\", \"Alas2\", \"Siglech\", \"C1qc\", \"Pf4\")\npl <- list()\n\npl <- list(DimPlot(obj, group.by = \"clusters_use\", label = T) + theme_void() + NoLegend())\nfor (i in vars) {\n pl[[i]] <- FeaturePlot(obj, features = i, order = T) + theme_void() + NoLegend()\n}\nwrap_plots(pl)\n\n\n\n\n\n\n\n\nAnother way to better explore your data is to look in higher dimensions, to really get a sense for what is right or wrong. As mentioned in the dimensionality reduction exercises, here we ran UMAP with 3 dimensions.\n\n\n\n\n\n\nImportant\n\n\n\nThe UMAP needs to be computed to results in exactly 3 dimensions\n\n\nSince the steps below are identical to both Seurat and Scran pipelines, we will extract the matrices from both, so it is clear what is being used where and to remove long lines of code used to get those matrices. We will use them all. Plot in 3D with Plotly:\n\ndf <- data.frame(obj@reductions$umap3d@cell.embeddings, variable = factor(obj$clusters_use))\ncolnames(df)[1:3] <- c(\"UMAP_1\", \"UMAP_2\", \"UMAP_3\")\np_State <- plot_ly(df, x = ~UMAP_1, y = ~UMAP_2, z = ~UMAP_3, color = ~variable, colors = pal, size = .5)\np_State\n\n\n\n\n\n\n# to save interactive plot and open in a new tab\ntry(htmlwidgets::saveWidget(p_State, selfcontained = T, \"umap_3d_clustering_plotly.html\"), silent = T)\nutils::browseURL(\"umap_3d_clustering_plotly.html\")\n\nWe can now compute the lineages on these dataset.\n\n# Define lineage ends\nENDS <- c(\"17\", \"27\", \"25\", \"16\", \"26\", \"53\", \"49\")\n\nset.seed(1)\nlineages <- as.SlingshotDataSet(getLineages(\n data = obj@reductions$umap3d@cell.embeddings,\n clusterLabels = obj$clusters_use,\n dist.method = \"mnn\", # It can be: \"simple\", \"scaled.full\", \"scaled.diag\", \"slingshot\" or \"mnn\"\n end.clus = ENDS, # You can also define the ENDS!\n start.clus = \"34\"\n)) # define where to START the trajectories\n\n\n# IF NEEDED, ONE CAN ALSO MANULALLY EDIT THE LINEAGES, FOR EXAMPLE:\n# sel <- sapply( lineages@lineages, function(x){rev(x)[1]} ) %in% ENDS\n# lineages@lineages <- lineages@lineages[ sel ]\n# names(lineages@lineages) <- paste0(\"Lineage\",1:length(lineages@lineages))\n# lineages\n\n\n# Change the reduction to our \"fixed\" UMAP2d (FOR VISUALISATION ONLY)\nlineages@reducedDim <- obj@reductions$umap@cell.embeddings\n\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16)\n lines(lineages, lwd = 1, col = \"black\", cex = 2)\n text(centroids2d, labels = rownames(centroids2d), cex = 0.8, font = 2, col = \"white\")\n}\n\n\n\n\n\n\n\n\nMuch better!" + "text": "4 Exploring the data\nIt is crucial that you performing analysis of a dataset understands what is going on, what are the clusters you see in your data and most importantly How are the clusters related to each other?. Well, let’s explore the data a bit. With the help of this table, write down which cluster numbers in your dataset express these key markers.\n\n\n\nMarker\nCell Type\n\n\n\n\nCd34\nHSC progenitor\n\n\nMs4a1\nB cell lineage\n\n\nCd3e\nT cell lineage\n\n\nLtf\nGranulocyte lineage\n\n\nCst3\nMonocyte lineage\n\n\nMcpt8\nMast Cell lineage\n\n\nAlas2\nRBC lineage\n\n\nSiglech\nDendritic cell lineage\n\n\nC1qc\nMacrophage cell lineage\n\n\nPf4\nMegakaryocyte cell lineage\n\n\n\n\nvars <- c(\"Cd34\", \"Ms4a1\", \"Cd3e\", \"Ltf\", \"Cst3\", \"Mcpt8\", \"Alas2\", \"Siglech\", \"C1qc\", \"Pf4\")\npl <- list()\n\npl <- list(DimPlot(obj, group.by = \"clusters_use\", label = T) + theme_void() + NoLegend())\nfor (i in vars) {\n pl[[i]] <- FeaturePlot(obj, features = i, order = T) + theme_void() + NoLegend()\n}\nwrap_plots(pl)\n\n\n\n\n\n\n\n\nAnother way to better explore your data is to look in higher dimensions, to really get a sense for what is right or wrong. As mentioned in the dimensionality reduction exercises, here we ran UMAP with 3 dimensions.\n\n\n\n\n\n\nImportant\n\n\n\nThe UMAP needs to be computed to results in exactly 3 dimensions\n\n\nSince the steps below are identical to both Seurat and Scran pipelines, we will extract the matrices from both, so it is clear what is being used where and to remove long lines of code used to get those matrices. We will use them all. Plot in 3D with Plotly:\n\ndf <- data.frame(obj@reductions$umap3d@cell.embeddings, variable = factor(obj$clusters_use))\ncolnames(df)[1:3] <- c(\"UMAP_1\", \"UMAP_2\", \"UMAP_3\")\np_State <- plot_ly(df, x = ~UMAP_1, y = ~UMAP_2, z = ~UMAP_3, color = ~variable, colors = pal, size = .5)\np_State\n\n\n\n\n\n\n# to save interactive plot and open in a new tab\ntry(htmlwidgets::saveWidget(p_State, selfcontained = T, \"umap_3d_clustering_plotly.html\"), silent = T)\nutils::browseURL(\"umap_3d_clustering_plotly.html\")\n\nWe can now compute the lineages on these dataset.\n\n# Define lineage ends\nENDS <- c(\"17\", \"27\", \"25\", \"16\", \"26\", \"53\", \"49\")\n\nset.seed(1)\nlineages <- as.SlingshotDataSet(getLineages(\n data = obj@reductions$umap3d@cell.embeddings,\n clusterLabels = obj$clusters_use,\n dist.method = \"mnn\", # It can be: \"simple\", \"scaled.full\", \"scaled.diag\", \"slingshot\" or \"mnn\"\n end.clus = ENDS, # You can also define the ENDS!\n start.clus = \"34\"\n)) # define where to START the trajectories\n\n\n# IF NEEDED, ONE CAN ALSO MANULALLY EDIT THE LINEAGES, FOR EXAMPLE:\n# sel <- sapply( lineages@lineages, function(x){rev(x)[1]} ) %in% ENDS\n# lineages@lineages <- lineages@lineages[ sel ]\n# names(lineages@lineages) <- paste0(\"Lineage\",1:length(lineages@lineages))\n# lineages\n\n\n# Change the reduction to our \"fixed\" UMAP2d (FOR VISUALISATION ONLY)\nlineages@reducedDim <- obj@reductions$umap@cell.embeddings\n\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16)\n lines(lineages, lwd = 1, col = \"black\", cex = 2)\n text(centroids2d, labels = rownames(centroids2d), cex = 0.8, font = 2, col = \"white\")\n}\n\n\n\n\n\n\n\n\nMuch better!" }, { "objectID": "labs/seurat/seurat_07_trajectory.html#defining-principal-curves", "href": "labs/seurat/seurat_07_trajectory.html#defining-principal-curves", "title": " Trajectory inference using Slingshot", "section": "5 Defining Principal Curves", - "text": "5 Defining Principal Curves\nOnce the clusters are connected, Slingshot allows you to transform them to a smooth trajectory using principal curves. This is an algorithm that iteratively changes an initial curve to better match the data points. It was developed for linear data. To apply it to single-cell data, slingshot adds two enhancements:\n\nIt will run principal curves for each ‘lineage’, which is a set of clusters that go from a defined start cluster to some end cluster\nLineages with a same set of clusters will be constrained so that their principal curves remain bundled around the overlapping clusters\n\nSince the function getCurves() takes some time to run, we can speed up the convergence of the curve fitting process by reducing the amount of cells to use in each lineage. Ideally you could all cells, but here we had set approx_points to 300 to speed up. Feel free to adjust that for your dataset.\n\n# Define curves\ncurves <- as.SlingshotDataSet(getCurves(\n data = lineages,\n thresh = 1e-1,\n stretch = 1e-1,\n allow.breaks = F,\n approx_points = 100\n))\n\ncurves\n\nclass: SlingshotDataSet \n\n Samples Dimensions\n 5828 2\n\nlineages: 7 \nLineage1: 34 18 36 33 55 59 44 60 58 29 8 43 47 49 \nLineage2: 34 18 11 15 46 9 1 2 5 13 28 17 \nLineage3: 34 18 11 15 35 7 32 6 54 25 \nLineage4: 34 18 11 15 35 7 32 6 27 \nLineage5: 34 18 36 21 12 20 16 \nLineage6: 34 18 36 33 55 38 53 \nLineage7: 34 18 36 26 \n\ncurves: 7 \nCurve1: Length: 6.7241 Samples: 2161.05\nCurve2: Length: 7.3487 Samples: 2097.02\nCurve3: Length: 3.5349 Samples: 1502.25\nCurve4: Length: 2.5623 Samples: 1387.66\nCurve5: Length: 2.9268 Samples: 979.78\nCurve6: Length: 2.8976 Samples: 1086.34\nCurve7: Length: 2.1323 Samples: 644.86\n\n# Plots\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)\n lines(curves, lwd = 2, col = \"black\")\n text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2)\n}\n\n\n\n\n\n\n\n\nWith those results in hands, we can now compute the differentiation pseudotime.\n\npseudotime <- slingPseudotime(curves, na = FALSE)\ncellWeights <- slingCurveWeights(curves)\n\nx <- rowMeans(pseudotime)\nx <- x / max(x)\no <- order(x)\n\n{\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(\"pseudotime\"), pch = 16, cex = 0.4, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"grey70\", \"orange3\", \"firebrick\", \"purple4\"))(99)[x[o] * 98 + 1]\n )\n points(centroids2d, cex = 2.5, pch = 16, col = \"#FFFFFF99\")\n text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2)\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCaution\n\n\n\nThe pseudotime represents the distance of every cell to the starting cluster!" + "text": "5 Defining Principal Curves\nOnce the clusters are connected, Slingshot allows you to transform them to a smooth trajectory using principal curves. This is an algorithm that iteratively changes an initial curve to better match the data points. It was developed for linear data. To apply it to single-cell data, slingshot adds two enhancements:\n\nIt will run principal curves for each ‘lineage’, which is a set of clusters that go from a defined start cluster to some end cluster\nLineages with a same set of clusters will be constrained so that their principal curves remain bundled around the overlapping clusters\n\nSince the function getCurves() takes some time to run, we can speed up the convergence of the curve fitting process by reducing the amount of cells to use in each lineage. Ideally you could all cells, but here we had set approx_points to 300 to speed up. Feel free to adjust that for your dataset.\n\n# Define curves\ncurves <- as.SlingshotDataSet(getCurves(\n data = lineages,\n thresh = 1e-1,\n stretch = 1e-1,\n allow.breaks = F,\n approx_points = 100\n))\n\ncurves\n\nclass: SlingshotDataSet \n\n Samples Dimensions\n 5828 2\n\nlineages: 7 \nLineage1: 34 18 36 33 55 59 44 60 58 29 8 43 47 49 \nLineage2: 34 18 11 15 46 9 1 2 5 13 28 17 \nLineage3: 34 18 11 15 35 7 32 6 54 25 \nLineage4: 34 18 11 15 35 7 32 6 27 \nLineage5: 34 18 36 21 12 20 16 \nLineage6: 34 18 36 33 55 38 53 \nLineage7: 34 18 36 26 \n\ncurves: 7 \nCurve1: Length: 6.7241 Samples: 2161.05\nCurve2: Length: 7.3487 Samples: 2097.02\nCurve3: Length: 3.5349 Samples: 1502.25\nCurve4: Length: 2.5623 Samples: 1387.66\nCurve5: Length: 2.9268 Samples: 979.78\nCurve6: Length: 2.8976 Samples: 1086.34\nCurve7: Length: 2.1323 Samples: 644.86\n\n# Plots\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)\n lines(curves, lwd = 2, col = \"black\")\n text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2)\n}\n\n\n\n\n\n\n\n\nWith those results in hands, we can now compute the differentiation pseudotime.\n\npseudotime <- slingPseudotime(curves, na = FALSE)\ncellWeights <- slingCurveWeights(curves)\n\nx <- rowMeans(pseudotime)\nx <- x / max(x)\no <- order(x)\n\n{\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(\"pseudotime\"), pch = 16, cex = 0.4, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"grey70\", \"orange3\", \"firebrick\", \"purple4\"))(99)[x[o] * 98 + 1]\n )\n points(centroids2d, cex = 2.5, pch = 16, col = \"#FFFFFF99\")\n text(centroids2d, labels = rownames(centroids2d), cex = 1, font = 2)\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCaution\n\n\n\nThe pseudotime represents the distance of every cell to the starting cluster!" }, { "objectID": "labs/seurat/seurat_07_trajectory.html#finding-differentially-expressed-genes", "href": "labs/seurat/seurat_07_trajectory.html#finding-differentially-expressed-genes", "title": " Trajectory inference using Slingshot", "section": "6 Finding differentially expressed genes", - "text": "6 Finding differentially expressed genes\nThe main way to interpret a trajectory is to find genes that change along the trajectory. There are many ways to define differential expression along a trajectory:\n\nExpression changes along a particular path (i.e. change with pseudotime)\nExpression differences between branches\nExpression changes at branch points\nExpression changes somewhere along the trajectory\n…\n\ntradeSeq is a recently proposed algorithm to find trajectory differentially expressed genes. It works by smoothing the gene expression along the trajectory by fitting a smoother using generalized additive models (GAMs), and testing whether certain coefficients are statistically different between points in the trajectory.\n\nBiocParallel::register(BiocParallel::MulticoreParam())\n\nThe fitting of GAMs can take quite a while, so for demonstration purposes we first do a very stringent filtering of the genes.\n\n\n\n\n\n\nCaution\n\n\n\nIn an ideal experiment, you would use all the genes, or at least those defined as being variable.\n\n\n\nsel_cells <- split(colnames(obj@assays$RNA@data), obj$clusters_use)\nsel_cells <- unlist(lapply(sel_cells, function(x) {\n set.seed(1)\n return(sample(x, 20))\n}))\n\ngv <- as.data.frame(na.omit(scran::modelGeneVar(obj@assays$RNA@data[, sel_cells])))\ngv <- gv[order(gv$bio, decreasing = T), ]\nsel_genes <- sort(rownames(gv)[1:500])\n\nFitting the model:\n\n\n\n\n\n\nCaution\n\n\n\nThis step is slow to run, so it’s better to skip it for now and use the pre-computed file in the next step.\n\n\n\nsceGAM <- fitGAM(\n counts = drop0(obj@assays$RNA@data[sel_genes, sel_cells]),\n pseudotime = pseudotime[sel_cells, ],\n cellWeights = cellWeights[sel_cells, ],\n nknots = 5, verbose = T, parallel = T, sce = TRUE,\n BPPARAM = BiocParallel::MulticoreParam()\n)\n\nDownload the precomputed file.\n\npath_file <- \"data/trajectory/seurat_scegam.rds\"\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"trajectory/results/seurat_scegam.rds\"), destfile = path_file)\nsceGAM <- readRDS(path_file)\n\n\nplotGeneCount(curves, clusters = obj$clusters_use, models = sceGAM)\n\n\n\n\n\n\n\nlineages\n\nclass: SlingshotDataSet \n\n Samples Dimensions\n 5828 2\n\nlineages: 7 \nLineage1: 34 18 36 33 55 59 44 60 58 29 8 43 47 49 \nLineage2: 34 18 11 15 46 9 1 2 5 13 28 17 \nLineage3: 34 18 11 15 35 7 32 6 54 25 \nLineage4: 34 18 11 15 35 7 32 6 27 \nLineage5: 34 18 36 21 12 20 16 \nLineage6: 34 18 36 33 55 38 53 \nLineage7: 34 18 36 26 \n\ncurves: 0 \n\n\n\nlc <- sapply(lineages@lineages, function(x) {\n rev(x)[1]\n})\nnames(lc) <- gsub(\"Lineage\", \"L\", names(lc))\n\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 16, cex = 4)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\")\n}\n\n\n\n\n\n\n\n\n\n6.1 Genes that change with pseudotime\nWe can first look at general trends of gene expression across pseudotime.\n\nres <- na.omit(associationTest(sceGAM, contrastType = \"consecutive\"))\nres <- res[res$pvalue < 1e-3, ]\nres <- res[res$waldStat > mean(res$waldStat), ]\nres <- res[order(res$waldStat, decreasing = T), ]\nres[1:10, ]\n\n\n\n \n\n\n\nWe can plot their expression:\n\npar(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = \"\", ylab = \"\")\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 15, cex = 3, xpd = T)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\", xpd = T)\n}\n\nvars <- rownames(res[1:15, ])\nvars <- na.omit(vars[vars != \"NA\"])\n\nfor (i in vars) {\n x <- drop0(obj@assays$RNA@data)[i, ]\n x <- (x - min(x)) / (max(x) - min(x))\n o <- order(x)\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"lightgray\", \"grey60\", \"navy\"))(99)[x[o] * 98 + 1]\n )\n}\n\n\n\n\n\n\n\n\n\n\n6.2 Genes that change between two pseudotime points\nWe can define custom pseudotime values of interest if we’re interested in genes that change between particular point in pseudotime. By default, we can look at differences between start and end:\n\nres <- na.omit(startVsEndTest(sceGAM, pseudotimeValues = c(0, 1)))\nres <- res[res$pvalue < 1e-3, ]\nres <- res[res$waldStat > mean(res$waldStat), ]\nres <- res[order(res$waldStat, decreasing = T), ]\nres[1:10, 1:6]\n\n\n\n \n\n\n\nYou can see now that there are several more columns, one for each lineage. This table represents the differential expression within each lineage, to identify which genes go up or down. Let’s check lineage 1:\n\n# Get the top UP and Down regulated in lineage 1\nres_lin1 <- sort(setNames(res$logFClineage1, rownames(res)))\nvars <- names(c(rev(res_lin1)[1:7], res_lin1[1:8]))\nvars <- na.omit(vars[vars != \"NA\"])\n\npar(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))\n\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = \"\", ylab = \"\")\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 15, cex = 3, xpd = T)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\", xpd = T)\n}\n\nfor (i in vars) {\n x <- drop0(obj@assays$RNA@data)[i, ]\n x <- (x - min(x)) / (max(x) - min(x))\n o <- order(x)\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"lightgray\", \"grey60\", \"navy\"))(99)[x[o] * 98 + 1]\n )\n}\n\n\n\n\n\n\n\n\n\n\n6.3 Genes that are different between lineages\nMore interesting are genes that are different between two branches. We may have seen some of these genes already pop up in previous analyses of pseudotime. There are several ways to define “different between branches”, and each have their own functions:\n\nDifferent at the end points, using diffEndTest\nDifferent at the branching point, using earlyDETest\nDifferent somewhere in pseudotime the branching point, using patternTest\nNote that the last function requires that the pseudotimes between two lineages are aligned.\n\n\nres <- na.omit(diffEndTest(sceGAM))\nres <- res[res$pvalue < 1e-3, ]\nres <- res[res$waldStat > mean(res$waldStat), ]\nres <- res[order(res$waldStat, decreasing = T), ]\nres[1:10, ]\n\n\n\n \n\n\n\nYou can see now that there are even more columns, one for the pair-wise comparison between each lineage. Let’s check lineage 1 vs lineage 2:\n\n# Get the top UP and Down regulated in lineage 1 vs 2\nres_lin1_2 <- sort(setNames(res$logFC1_2, rownames(res)))\nvars <- names(c(rev(res_lin1_2)[1:7], res_lin1_2[1:8]))\nvars <- na.omit(vars[vars != \"NA\"])\n\npar(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = \"\", ylab = \"\")\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 15, cex = 3, xpd = T)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\", xpd = T)\n}\n\nfor (i in vars) {\n x <- drop0(obj@assays$RNA@data)[i, ]\n x <- (x - min(x)) / (max(x) - min(x))\n o <- order(x)\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"lightgray\", \"grey60\", \"navy\"))(99)[x[o] * 98 + 1]\n )\n}\n\n\n\n\n\n\n\n\nCheck out this vignette for a more in-depth overview of tradeSeq and many other differential expression tests." + "text": "6 Finding differentially expressed genes\nThe main way to interpret a trajectory is to find genes that change along the trajectory. There are many ways to define differential expression along a trajectory:\n\nExpression changes along a particular path (i.e. change with pseudotime)\nExpression differences between branches\nExpression changes at branch points\nExpression changes somewhere along the trajectory\n…\n\ntradeSeq is a recently proposed algorithm to find trajectory differentially expressed genes. It works by smoothing the gene expression along the trajectory by fitting a smoother using generalized additive models (GAMs), and testing whether certain coefficients are statistically different between points in the trajectory.\n\nBiocParallel::register(BiocParallel::MulticoreParam())\n\nThe fitting of GAMs can take quite a while, so for demonstration purposes we first do a very stringent filtering of the genes.\n\n\n\n\n\n\nCaution\n\n\n\nIn an ideal experiment, you would use all the genes, or at least those defined as being variable.\n\n\n\nsel_cells <- split(colnames(obj@assays$RNA@data), obj$clusters_use)\nsel_cells <- unlist(lapply(sel_cells, function(x) {\n set.seed(1)\n return(sample(x, 20))\n}))\n\ngv <- as.data.frame(na.omit(scran::modelGeneVar(obj@assays$RNA@data[, sel_cells])))\ngv <- gv[order(gv$bio, decreasing = T), ]\nsel_genes <- sort(rownames(gv)[1:500])\n\nFitting the model:\n\n\n\n\n\n\nCaution\n\n\n\nThis step is slow to run, so it’s better to skip it for now and use the pre-computed file in the next step.\n\n\n\nsceGAM <- fitGAM(\n counts = drop0(obj@assays$RNA@data[sel_genes, sel_cells]),\n pseudotime = pseudotime[sel_cells, ],\n cellWeights = cellWeights[sel_cells, ],\n nknots = 5, verbose = T, parallel = T, sce = TRUE,\n BPPARAM = BiocParallel::MulticoreParam()\n)\n\nDownload the precomputed file.\n\npath_file <- \"data/trajectory/seurat_scegam.rds\"\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"trajectory/results/seurat_scegam.rds\"), destfile = path_file)\nsceGAM <- readRDS(path_file)\n\n\nplotGeneCount(curves, clusters = obj$clusters_use, models = sceGAM)\n\n\n\n\n\n\n\nlineages\n\nclass: SlingshotDataSet \n\n Samples Dimensions\n 5828 2\n\nlineages: 7 \nLineage1: 34 18 36 33 55 59 44 60 58 29 8 43 47 49 \nLineage2: 34 18 11 15 46 9 1 2 5 13 28 17 \nLineage3: 34 18 11 15 35 7 32 6 54 25 \nLineage4: 34 18 11 15 35 7 32 6 27 \nLineage5: 34 18 36 21 12 20 16 \nLineage6: 34 18 36 33 55 38 53 \nLineage7: 34 18 36 26 \n\ncurves: 0 \n\n\n\nlc <- sapply(lineages@lineages, function(x) {\n rev(x)[1]\n})\nnames(lc) <- gsub(\"Lineage\", \"L\", names(lc))\n\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], pch = 16)\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 16, cex = 4)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\")\n}\n\n\n\n\n\n\n\n\n\n6.1 Genes that change with pseudotime\nWe can first look at general trends of gene expression across pseudotime.\n\nres <- na.omit(associationTest(sceGAM, contrastType = \"consecutive\"))\nres <- res[res$pvalue < 1e-3, ]\nres <- res[res$waldStat > mean(res$waldStat), ]\nres <- res[order(res$waldStat, decreasing = T), ]\nres[1:10, ]\n\n\n\n \n\n\n\nWe can plot their expression:\n\npar(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = \"\", ylab = \"\")\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 15, cex = 3, xpd = T)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\", xpd = T)\n}\n\nvars <- rownames(res[1:15, ])\nvars <- na.omit(vars[vars != \"NA\"])\n\nfor (i in vars) {\n x <- drop0(obj@assays$RNA@data)[i, ]\n x <- (x - min(x)) / (max(x) - min(x))\n o <- order(x)\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"lightgray\", \"grey60\", \"navy\"))(99)[x[o] * 98 + 1]\n )\n}\n\n\n\n\n\n\n\n\n\n\n6.2 Genes that change between two pseudotime points\nWe can define custom pseudotime values of interest if we’re interested in genes that change between particular point in pseudotime. By default, we can look at differences between start and end:\n\nres <- na.omit(startVsEndTest(sceGAM, pseudotimeValues = c(0, 1)))\nres <- res[res$pvalue < 1e-3, ]\nres <- res[res$waldStat > mean(res$waldStat), ]\nres <- res[order(res$waldStat, decreasing = T), ]\nres[1:10, 1:6]\n\n\n\n \n\n\n\nYou can see now that there are several more columns, one for each lineage. This table represents the differential expression within each lineage, to identify which genes go up or down. Let’s check lineage 1:\n\n# Get the top UP and Down regulated in lineage 1\nres_lin1 <- sort(setNames(res$logFClineage1, rownames(res)))\nvars <- names(c(rev(res_lin1)[1:7], res_lin1[1:8]))\nvars <- na.omit(vars[vars != \"NA\"])\n\npar(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))\n\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = \"\", ylab = \"\")\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 15, cex = 3, xpd = T)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\", xpd = T)\n}\n\nfor (i in vars) {\n x <- drop0(obj@assays$RNA@data)[i, ]\n x <- (x - min(x)) / (max(x) - min(x))\n o <- order(x)\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"lightgray\", \"grey60\", \"navy\"))(99)[x[o] * 98 + 1]\n )\n}\n\n\n\n\n\n\n\n\n\n\n6.3 Genes that are different between lineages\nMore interesting are genes that are different between two branches. We may have seen some of these genes already pop up in previous analyses of pseudotime. There are several ways to define “different between branches”, and each have their own functions:\n\nDifferent at the end points, using diffEndTest\nDifferent at the branching point, using earlyDETest\nDifferent somewhere in pseudotime the branching point, using patternTest\nNote that the last function requires that the pseudotimes between two lineages are aligned.\n\n\nres <- na.omit(diffEndTest(sceGAM))\nres <- res[res$pvalue < 1e-3, ]\nres <- res[res$waldStat > mean(res$waldStat), ]\nres <- res[order(res$waldStat, decreasing = T), ]\nres[1:10, ]\n\n\n\n \n\n\n\nYou can see now that there are even more columns, one for the pair-wise comparison between each lineage. Let’s check lineage 1 vs lineage 2:\n\n# Get the top UP and Down regulated in lineage 1 vs 2\nres_lin1_2 <- sort(setNames(res$logFC1_2, rownames(res)))\nvars <- names(c(rev(res_lin1_2)[1:7], res_lin1_2[1:8]))\nvars <- na.omit(vars[vars != \"NA\"])\n\npar(mfrow = c(4, 4), mar = c(.1, .1, 2, 1))\n{\n plot(obj@reductions$umap@cell.embeddings, col = pal[obj$clusters_use], cex = .5, pch = 16, axes = F, xlab = \"\", ylab = \"\")\n lines(curves, lwd = 2, col = \"black\")\n points(centroids2d[lc, ], col = \"black\", pch = 15, cex = 3, xpd = T)\n text(centroids2d[lc, ], labels = names(lc), cex = 1, font = 2, col = \"white\", xpd = T)\n}\n\nfor (i in vars) {\n x <- drop0(obj@assays$RNA@data)[i, ]\n x <- (x - min(x)) / (max(x) - min(x))\n o <- order(x)\n plot(obj@reductions$umap@cell.embeddings[o, ],\n main = paste0(i), pch = 16, cex = 0.5, axes = F, xlab = \"\", ylab = \"\",\n col = colorRampPalette(c(\"lightgray\", \"grey60\", \"navy\"))(99)[x[o] * 98 + 1]\n )\n}\n\n\n\n\n\n\n\n\nCheck out this vignette for a more in-depth overview of tradeSeq and many other differential expression tests." }, { "objectID": "labs/seurat/seurat_07_trajectory.html#references", @@ -480,14 +480,14 @@ "href": "labs/seurat/seurat_08_spatial.html#meta-st_qc", "title": " Spatial Transcriptomics", "section": "2 Quality control", - "text": "2 Quality control\nSimilar to scRNA-seq we use statistics on number of counts, number of features and percent mitochondria for quality control.\nNow the counts and feature counts are calculated on the Spatial assay, so they are named “nCount_Spatial” and “nFeature_Spatial”.\n\nbrain <- PercentageFeatureSet(brain, \"^mt-\", col.name = \"percent_mito\")\nbrain <- PercentageFeatureSet(brain, \"^Hb.*-\", col.name = \"percent_hb\")\n\nVlnPlot(brain, features = c(\"nCount_Spatial\", \"nFeature_Spatial\", \"percent_mito\", \"percent_hb\"), pt.size = 0.1, ncol = 2) + NoLegend()\n\n\n\n\n\n\n\n\nWe can also plot the same data onto the tissue section.\n\nSpatialFeaturePlot(brain, features = c(\"nCount_Spatial\", \"nFeature_Spatial\", \"percent_mito\", \"percent_hb\"))\n\n\n\n\n\n\n\n\nAs you can see, the spots with low number of counts/features and high mitochondrial content are mainly towards the edges of the tissue. It is quite likely that these regions are damaged tissue. You may also see regions within a tissue with low quality if you have tears or folds in your section.\nBut remember, for some tissue types, the amount of genes expressed and proportion mitochondria may also be a biological features, so bear in mind what tissue you are working on and what these features mean.\n\n2.1 Filter spots\nSelect all spots with less than 25% mitocondrial reads, less than 20% hb-reads and 500 detected genes. You must judge for yourself based on your knowledge of the tissue what are appropriate filtering criteria for your dataset.\n\nbrain <- brain[, brain$nFeature_Spatial > 500 & brain$percent_mito < 25 & brain$percent_hb < 20]\n\nAnd replot onto tissue section:\n\nSpatialFeaturePlot(brain, features = c(\"nCount_Spatial\", \"nFeature_Spatial\", \"percent_mito\"))\n\n\n\n\n\n\n\n\n\n\n2.2 Top expressed genes\nAs for scRNA-seq data, we will look at what the top expressed genes are.\n\nC <- GetAssayData(brain, assay = \"Spatial\", slot = \"counts\")\nC@x <- C@x / rep.int(colSums(C), diff(C@p))\nmost_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]\nboxplot(as.matrix(t(C[most_expressed, ])),\n cex = 0.1, las = 1, xlab = \"% total count per cell\",\n col = (scales::hue_pal())(20)[20:1], horizontal = TRUE\n)\n\n\n\n\n\n\n\nrm(C)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3360689 179.5 5248232 280.3 5248232 280.3\nVcells 189921766 1449.0 375078860 2861.7 357748466 2729.5\n\n\nAs you can see, the mitochondrial genes are among the top expressed genes. Also the lncRNA gene Bc1 (brain cytoplasmic RNA 1). Also one hemoglobin gene.\n\n\n2.3 Filter genes\nWe will remove the Bc1 gene, hemoglobin genes (blood contamination) and the mitochondrial genes.\n\ndim(brain)\n\n[1] 31053 5789\n\n# Filter Bl1\nbrain <- brain[!grepl(\"Bc1\", rownames(brain)), ]\n\n# Filter Mitocondrial\nbrain <- brain[!grepl(\"^mt-\", rownames(brain)), ]\n\n# Filter Hemoglobin gene (optional if that is a problem on your data)\nbrain <- brain[!grepl(\"^Hb.*-\", rownames(brain)), ]\n\ndim(brain)\n\n[1] 31031 5789" + "text": "2 Quality control\nSimilar to scRNA-seq we use statistics on number of counts, number of features and percent mitochondria for quality control.\nNow the counts and feature counts are calculated on the Spatial assay, so they are named “nCount_Spatial” and “nFeature_Spatial”.\n\nbrain <- PercentageFeatureSet(brain, \"^mt-\", col.name = \"percent_mito\")\nbrain <- PercentageFeatureSet(brain, \"^Hb.*-\", col.name = \"percent_hb\")\n\nVlnPlot(brain, features = c(\"nCount_Spatial\", \"nFeature_Spatial\", \"percent_mito\", \"percent_hb\"), pt.size = 0.1, ncol = 2) + NoLegend()\n\n\n\n\n\n\n\n\nWe can also plot the same data onto the tissue section.\n\nSpatialFeaturePlot(brain, features = c(\"nCount_Spatial\", \"nFeature_Spatial\", \"percent_mito\", \"percent_hb\"))\n\n\n\n\n\n\n\n\nAs you can see, the spots with low number of counts/features and high mitochondrial content are mainly towards the edges of the tissue. It is quite likely that these regions are damaged tissue. You may also see regions within a tissue with low quality if you have tears or folds in your section.\nBut remember, for some tissue types, the amount of genes expressed and proportion mitochondria may also be a biological features, so bear in mind what tissue you are working on and what these features mean.\n\n2.1 Filter spots\nSelect all spots with less than 25% mitocondrial reads, less than 20% hb-reads and 500 detected genes. You must judge for yourself based on your knowledge of the tissue what are appropriate filtering criteria for your dataset.\n\nbrain <- brain[, brain$nFeature_Spatial > 500 & brain$percent_mito < 25 & brain$percent_hb < 20]\n\nAnd replot onto tissue section:\n\nSpatialFeaturePlot(brain, features = c(\"nCount_Spatial\", \"nFeature_Spatial\", \"percent_mito\"))\n\n\n\n\n\n\n\n\n\n\n2.2 Top expressed genes\nAs for scRNA-seq data, we will look at what the top expressed genes are.\n\nC <- GetAssayData(brain, assay = \"Spatial\", slot = \"counts\")\nC@x <- C@x / rep.int(colSums(C), diff(C@p))\nmost_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]\nboxplot(as.matrix(t(C[most_expressed, ])),\n cex = 0.1, las = 1, xlab = \"% total count per cell\",\n col = (scales::hue_pal())(20)[20:1], horizontal = TRUE\n)\n\n\n\n\n\n\n\nrm(C)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3360672 179.5 5248490 280.3 5248490 280.3\nVcells 189921808 1449.0 375078910 2861.7 357748714 2729.5\n\n\nAs you can see, the mitochondrial genes are among the top expressed genes. Also the lncRNA gene Bc1 (brain cytoplasmic RNA 1). Also one hemoglobin gene.\n\n\n2.3 Filter genes\nWe will remove the Bc1 gene, hemoglobin genes (blood contamination) and the mitochondrial genes.\n\ndim(brain)\n\n[1] 31053 5789\n\n# Filter Bl1\nbrain <- brain[!grepl(\"Bc1\", rownames(brain)), ]\n\n# Filter Mitocondrial\nbrain <- brain[!grepl(\"^mt-\", rownames(brain)), ]\n\n# Filter Hemoglobin gene (optional if that is a problem on your data)\nbrain <- brain[!grepl(\"^Hb.*-\", rownames(brain)), ]\n\ndim(brain)\n\n[1] 31031 5789" }, { "objectID": "labs/seurat/seurat_08_spatial.html#meta-st_analysis", "href": "labs/seurat/seurat_08_spatial.html#meta-st_analysis", "title": " Spatial Transcriptomics", "section": "3 Analysis", - "text": "3 Analysis\nWe will proceed with the data in a very similar manner to scRNA-seq data.\nFor ST data, the Seurat team recommends to use SCTranform for normalization, so we will do that. SCTransform will select variable genes and normalize in one step.\n\nbrain <- SCTransform(brain, assay = \"Spatial\", method = \"poisson\", verbose = TRUE)\n\nNow we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.\n\nSpatialFeaturePlot(brain, features = c(\"Hpca\", \"Ttr\"))\n\n\n\n\n\n\n\n\nIf you want to see the tissue better you can modify point size and transparency of the points.\n\nSpatialFeaturePlot(brain, features = \"Ttr\", pt.size.factor = 1, alpha = c(0.1, 1))\n\n\n\n\n\n\n\n\n\n3.1 Dimensionality reduction and clustering\nWe can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.\nBut make sure you run it on the SCT assay.\n\nbrain <- RunPCA(brain, assay = \"SCT\", verbose = FALSE)\nbrain <- FindNeighbors(brain, reduction = \"pca\", dims = 1:30)\nbrain <- FindClusters(brain, verbose = FALSE)\nbrain <- RunUMAP(brain, reduction = \"pca\", dims = 1:30)\n\nWe can then plot clusters onto umap or onto the tissue section.\n\nDimPlot(brain, reduction = \"umap\", group.by = c(\"ident\", \"orig.ident\"))\n\n\n\n\n\n\n\n\n\nSpatialDimPlot(brain)\n\n\n\n\n\n\n\n\nWe can also plot each cluster separately\n\nSpatialDimPlot(brain, cells.highlight = CellsByIdentities(brain), facet.highlight = TRUE, ncol = 5)\n\n\n\n\n\n\n\n\n\n\n3.2 Integration\nQuite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.\nWe will do a similar integration as in the Data Integration lab, but this time we will use the SCT assay for integration. Therefore we need to run PrepSCTIntegration which will compute the sctransform residuals for all genes in both the datasets.\n\n# create a list of the original data that we loaded to start with\nst.list <- list(anterior1 = brain1, posterior1 = brain2)\n\n# run SCT on both datasets\nst.list <- lapply(st.list, SCTransform, assay = \"Spatial\", method = \"poisson\")\n\n# need to set maxSize for PrepSCTIntegration to work\noptions(future.globals.maxSize = 2000 * 1024^2) # set allowed size to 2K MiB\n\nst.features <- SelectIntegrationFeatures(st.list, nfeatures = 3000, verbose = FALSE)\nst.list <- PrepSCTIntegration(object.list = st.list, anchor.features = st.features, verbose = FALSE)\n\nNow we can perform the actual integration.\n\nint.anchors <- FindIntegrationAnchors(object.list = st.list, normalization.method = \"SCT\", verbose = FALSE, anchor.features = st.features)\nbrain.integrated <- IntegrateData(anchorset = int.anchors, normalization.method = \"SCT\", verbose = FALSE)\n\nrm(int.anchors, st.list)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3530604 188.6 5248232 280.3 5248232 280.3\nVcells 546165895 4167.0 1148293022 8760.8 1147468533 8754.5\n\n\nThen we run dimensionality reduction and clustering as before.\n\nbrain.integrated <- RunPCA(brain.integrated, verbose = FALSE)\nbrain.integrated <- FindNeighbors(brain.integrated, dims = 1:30)\nbrain.integrated <- FindClusters(brain.integrated, verbose = FALSE)\nbrain.integrated <- RunUMAP(brain.integrated, dims = 1:30)\n\n\nDimPlot(brain.integrated, reduction = \"umap\", group.by = c(\"ident\", \"orig.ident\"))\n\n\n\n\n\n\n\n\n\nSpatialDimPlot(brain.integrated)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you see any differences between the integrated and non-integrated clustering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.\n\n\n\n\n3.3 Spatially Variable Features\nThere are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that have spatial patterning without taking clusters or spatial annotation into account. First, we will do differential expression between clusters just as we did for the scRNAseq data before.\n\n# differential expression between cluster 1 and cluster 6\nde_markers <- FindMarkers(brain.integrated, ident.1 = 5, ident.2 = 6)\n\n# plot top markers\nSpatialFeaturePlot(object = brain.integrated, features = rownames(de_markers)[1:3], alpha = c(0.1, 1), ncol = 3)\n\n\n\n\n\n\n\n\nSpatial transcriptomics allows researchers to investigate how gene expression trends varies in space, thus identifying spatial patterns of gene expression. For this purpose there are multiple methods, such as SpatailDE, SPARK, Trendsceek, HMRF and Splotch.\nIn FindSpatiallyVariables the default method in Seurat (method = ‘markvariogram), is inspired by the Trendsceek, which models spatial transcriptomics data as a mark point process and computes a ’variogram’, which identifies genes whose expression level is dependent on their spatial location. More specifically, this process calculates gamma(r) values measuring the dependence between two spots a certain “r” distance apart. By default, we use an r-value of ‘5’ in these analyses, and only compute these values for variable genes (where variation is calculated independently of spatial location) to save time.\n\n\n\n\n\n\nCaution\n\n\n\nTakes a long time to run, so skip this step for now!\n\n\n\n# brain <- FindSpatiallyVariableFeatures(brain, assay = \"SCT\", features = VariableFeatures(brain)[1:1000],\n# selection.method = \"markvariogram\")\n\n# We would get top features from SpatiallyVariableFeatures\n# top.features <- head(SpatiallyVariableFeatures(brain, selection.method = \"markvariogram\"), 6)" + "text": "3 Analysis\nWe will proceed with the data in a very similar manner to scRNA-seq data.\nFor ST data, the Seurat team recommends to use SCTranform for normalization, so we will do that. SCTransform will select variable genes and normalize in one step.\n\nbrain <- SCTransform(brain, assay = \"Spatial\", method = \"poisson\", verbose = TRUE)\n\nNow we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.\n\nSpatialFeaturePlot(brain, features = c(\"Hpca\", \"Ttr\"))\n\n\n\n\n\n\n\n\nIf you want to see the tissue better you can modify point size and transparency of the points.\n\nSpatialFeaturePlot(brain, features = \"Ttr\", pt.size.factor = 1, alpha = c(0.1, 1))\n\n\n\n\n\n\n\n\n\n3.1 Dimensionality reduction and clustering\nWe can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.\nBut make sure you run it on the SCT assay.\n\nbrain <- RunPCA(brain, assay = \"SCT\", verbose = FALSE)\nbrain <- FindNeighbors(brain, reduction = \"pca\", dims = 1:30)\nbrain <- FindClusters(brain, verbose = FALSE)\nbrain <- RunUMAP(brain, reduction = \"pca\", dims = 1:30)\n\nWe can then plot clusters onto umap or onto the tissue section.\n\nDimPlot(brain, reduction = \"umap\", group.by = c(\"ident\", \"orig.ident\"))\n\n\n\n\n\n\n\n\n\nSpatialDimPlot(brain)\n\n\n\n\n\n\n\n\nWe can also plot each cluster separately\n\nSpatialDimPlot(brain, cells.highlight = CellsByIdentities(brain), facet.highlight = TRUE, ncol = 5)\n\n\n\n\n\n\n\n\n\n\n3.2 Integration\nQuite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.\nWe will do a similar integration as in the Data Integration lab, but this time we will use the SCT assay for integration. Therefore we need to run PrepSCTIntegration which will compute the sctransform residuals for all genes in both the datasets.\n\n# create a list of the original data that we loaded to start with\nst.list <- list(anterior1 = brain1, posterior1 = brain2)\n\n# run SCT on both datasets\nst.list <- lapply(st.list, SCTransform, assay = \"Spatial\", method = \"poisson\")\n\n# need to set maxSize for PrepSCTIntegration to work\noptions(future.globals.maxSize = 2000 * 1024^2) # set allowed size to 2K MiB\n\nst.features <- SelectIntegrationFeatures(st.list, nfeatures = 3000, verbose = FALSE)\nst.list <- PrepSCTIntegration(object.list = st.list, anchor.features = st.features, verbose = FALSE)\n\nNow we can perform the actual integration.\n\nint.anchors <- FindIntegrationAnchors(object.list = st.list, normalization.method = \"SCT\", verbose = FALSE, anchor.features = st.features)\nbrain.integrated <- IntegrateData(anchorset = int.anchors, normalization.method = \"SCT\", verbose = FALSE)\n\nrm(int.anchors, st.list)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 3530587 188.6 5248490 280.3 5248490 280.3\nVcells 546165937 4167.0 1148293145 8760.8 1147467666 8754.5\n\n\nThen we run dimensionality reduction and clustering as before.\n\nbrain.integrated <- RunPCA(brain.integrated, verbose = FALSE)\nbrain.integrated <- FindNeighbors(brain.integrated, dims = 1:30)\nbrain.integrated <- FindClusters(brain.integrated, verbose = FALSE)\nbrain.integrated <- RunUMAP(brain.integrated, dims = 1:30)\n\n\nDimPlot(brain.integrated, reduction = \"umap\", group.by = c(\"ident\", \"orig.ident\"))\n\n\n\n\n\n\n\n\n\nSpatialDimPlot(brain.integrated)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you see any differences between the integrated and non-integrated clustering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.\n\n\n\n\n3.3 Spatially Variable Features\nThere are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that have spatial patterning without taking clusters or spatial annotation into account. First, we will do differential expression between clusters just as we did for the scRNAseq data before.\n\n# differential expression between cluster 1 and cluster 6\nde_markers <- FindMarkers(brain.integrated, ident.1 = 5, ident.2 = 6)\n\n# plot top markers\nSpatialFeaturePlot(object = brain.integrated, features = rownames(de_markers)[1:3], alpha = c(0.1, 1), ncol = 3)\n\n\n\n\n\n\n\n\nSpatial transcriptomics allows researchers to investigate how gene expression trends varies in space, thus identifying spatial patterns of gene expression. For this purpose there are multiple methods, such as SpatailDE, SPARK, Trendsceek, HMRF and Splotch.\nIn FindSpatiallyVariables the default method in Seurat (method = ‘markvariogram), is inspired by the Trendsceek, which models spatial transcriptomics data as a mark point process and computes a ’variogram’, which identifies genes whose expression level is dependent on their spatial location. More specifically, this process calculates gamma(r) values measuring the dependence between two spots a certain “r” distance apart. By default, we use an r-value of ‘5’ in these analyses, and only compute these values for variable genes (where variation is calculated independently of spatial location) to save time.\n\n\n\n\n\n\nCaution\n\n\n\nTakes a long time to run, so skip this step for now!\n\n\n\n# brain <- FindSpatiallyVariableFeatures(brain, assay = \"SCT\", features = VariableFeatures(brain)[1:1000],\n# selection.method = \"markvariogram\")\n\n# We would get top features from SpatiallyVariableFeatures\n# top.features <- head(SpatiallyVariableFeatures(brain, selection.method = \"markvariogram\"), 6)" }, { "objectID": "labs/seurat/seurat_08_spatial.html#meta-st_ss", @@ -536,7 +536,7 @@ "href": "labs/bioc/bioc_01_qc.html#meta-qc_collate", "title": " Quality Control", "section": "2 Collate", - "text": "2 Collate\nWe can now load the expression matrices and merge them into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column Chemistry in the metadata for plotting later on.\n\nsce <- SingleCellExperiment(assays = list(counts = cbind(cov.1, cov.15, cov.17, ctrl.5, ctrl.13, ctrl.14)))\ndim(sce)\n\n[1] 33538 9000\n\n# Adding metadata\nsce@colData$sample <- unlist(sapply(c(\"cov.1\", \"cov.15\", \"cov.17\", \"ctrl.5\", \"ctrl.13\", \"ctrl.14\"), function(x) rep(x, ncol(get(x)))))\nsce@colData$type <- ifelse(grepl(\"cov\", sce@colData$sample), \"Covid\", \"Control\")\n\nOnce you have created the merged Seurat object, the count matrices and individual count matrices and objects are not needed anymore. It is a good idea to remove them and run garbage collect to free up some memory.\n\n# remove all objects that will not be used.\nrm(cov.15, cov.1, cov.17, ctrl.5, ctrl.13, ctrl.14)\n# run garbage collect to free up memory\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 10216587 545.7 17147474 915.8 13915408 743.2\nVcells 44612623 340.4 94446392 720.6 83350999 636.0\n\n\nHere is how the count matrix and the metadata look like for every cell.\n\nhead(counts(sce)[, 1:10])\n\n6 x 10 sparse Matrix of class \"dgCMatrix\"\n \nMIR1302-2HG . . . . . . . . . .\nFAM138A . . . . . . . . . .\nOR4F5 . . . . . . . . . .\nAL627309.1 . . . . . . . . . .\nAL627309.3 . . . . . . . . . .\nAL627309.2 . . . . . . . . . .\n\nhead(sce@colData, 10)\n\nDataFrame with 10 rows and 2 columns\n sample type\n <character> <character>\nAGGTAGGTCGTTGTTT-1 cov.1 Covid\nTAGAGTCGTCCTCCAT-1 cov.1 Covid\nCCCTGATAGCGAACTG-1 cov.1 Covid\nTCATCATTCCACGTAA-1 cov.1 Covid\nATTTACCCAAGCCTGC-1 cov.1 Covid\nGTTGTCCTCTAGAACC-1 cov.1 Covid\nCCTCCAACAAGAGATT-1 cov.1 Covid\nAATAGAGGTGTGAGCA-1 cov.1 Covid\nGGTGGCTAGCGAATGC-1 cov.1 Covid\nTCGGGCACAGTGTGGA-1 cov.1 Covid" + "text": "2 Collate\nWe can now merge them objects into a single object. Each analysis workflow (Seurat, Scater, Scanpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column type in the metadata to define covid and ctrl samples.\n\nsce <- SingleCellExperiment(assays = list(counts = cbind(cov.1, cov.15, cov.17, ctrl.5, ctrl.13, ctrl.14)))\ndim(sce)\n\n[1] 33538 9000\n\n# Adding metadata\nsce@colData$sample <- unlist(sapply(c(\"cov.1\", \"cov.15\", \"cov.17\", \"ctrl.5\", \"ctrl.13\", \"ctrl.14\"), function(x) rep(x, ncol(get(x)))))\nsce@colData$type <- ifelse(grepl(\"cov\", sce@colData$sample), \"Covid\", \"Control\")\n\nOnce you have created the merged Seurat object, the count matrices and individual count matrices and objects are not needed anymore. It is a good idea to remove them and run garbage collect to free up some memory.\n\n# remove all objects that will not be used.\nrm(cov.15, cov.1, cov.17, ctrl.5, ctrl.13, ctrl.14)\n# run garbage collect to free up memory\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 10216383 545.7 17147170 915.8 13915194 743.2\nVcells 44612100 340.4 94440822 720.6 83350476 636.0\n\n\nHere is how the count matrix and the metadata look like for every cell.\n\nhead(counts(sce)[, 1:10])\n\n6 x 10 sparse Matrix of class \"dgCMatrix\"\n \nMIR1302-2HG . . . . . . . . . .\nFAM138A . . . . . . . . . .\nOR4F5 . . . . . . . . . .\nAL627309.1 . . . . . . . . . .\nAL627309.3 . . . . . . . . . .\nAL627309.2 . . . . . . . . . .\n\nhead(sce@colData, 10)\n\nDataFrame with 10 rows and 2 columns\n sample type\n <character> <character>\nAGGTAGGTCGTTGTTT-1 cov.1 Covid\nTAGAGTCGTCCTCCAT-1 cov.1 Covid\nCCCTGATAGCGAACTG-1 cov.1 Covid\nTCATCATTCCACGTAA-1 cov.1 Covid\nATTTACCCAAGCCTGC-1 cov.1 Covid\nGTTGTCCTCTAGAACC-1 cov.1 Covid\nCCTCCAACAAGAGATT-1 cov.1 Covid\nAATAGAGGTGTGAGCA-1 cov.1 Covid\nGGTGGCTAGCGAATGC-1 cov.1 Covid\nTCGGGCACAGTGTGGA-1 cov.1 Covid" }, { "objectID": "labs/bioc/bioc_01_qc.html#meta-qc_calqc", @@ -550,21 +550,21 @@ "href": "labs/bioc/bioc_01_qc.html#meta-qc_plotqc", "title": " Quality Control", "section": "4 Plot QC", - "text": "4 Plot QC\nNow we can plot some of the QC variables as violin plots.\n\n# total is total UMIs per cell\n# detected is number of detected genes.\n# the different gene subset percentages are listed as subsets_mt_percent etc.\n\nwrap_plots(\n plotColData(sce, y = \"detected\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"total\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_mt_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_ribo_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_hb_percent\", x = \"sample\", colour_by = \"sample\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nAs you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 sample having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.\n\nplotColData(sce, x = \"total\", y = \"detected\", colour_by = \"sample\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nPlot additional QC stats that we have calculated as scatter plots. How are the different measures correlated? Can you explain why?" + "text": "4 Plot QC\nNow we can plot some of the QC variables as violin plots.\n\n# total is total UMIs per cell\n# detected is number of detected genes.\n# the different gene subset percentages are listed as subsets_mt_percent etc.\n\nwrap_plots(\n plotColData(sce, y = \"detected\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"total\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_mt_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_ribo_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_hb_percent\", x = \"sample\", colour_by = \"sample\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nAs you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 and covid_16 samples having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. We can also plot the different QC-measures as scatter plots.\n\nplotColData(sce, x = \"total\", y = \"detected\", colour_by = \"sample\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nPlot additional QC stats that we have calculated as scatter plots. How are the different measures correlated? Can you explain why?" }, { "objectID": "labs/bioc/bioc_01_qc.html#meta-qc_filter", "href": "labs/bioc/bioc_01_qc.html#meta-qc_filter", "title": " Quality Control", "section": "5 Filtering", - "text": "5 Filtering\n\n5.1 Detection-based filtering\nA standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.\nIn Scran, we can use the function quickPerCellQC to filter out outliers from distributions of qc stats, such as detected genes, gene subsets etc. But in this case, we will take one setting at a time and run through the steps of filtering cells.\n\ndim(sce)\n\n[1] 33538 9000\n\nselected_c <- colnames(sce)[sce$detected > 200]\nselected_f <- rownames(sce)[Matrix::rowSums(counts(sce)) > 3]\n\nsce.filt <- sce[selected_f, selected_c]\ndim(sce.filt)\n\n[1] 18209 8095\n\n\nExtremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In this case, we will run doublet prediction further down, so we will skip this step now, but the code below is an example of how it can be run:\n\n# skip for now and run doublet detection instead...\n\n# high.det.v3 <- sce.filt$nFeatures > 4100\n# high.det.v2 <- (sce.filt$nFeatures > 2000) & (sce.filt$sample_id == \"v2.1k\")\n\n# remove these cells\n# sce.filt <- sce.filt[ , (!high.det.v3) & (!high.det.v2)]\n\n# check number of cells\n# ncol(sce.filt)\n\nAdditionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.\nIn Scater, you can also use the function plotHighestExprs() to plot the gene contribution, but the function is quite slow.\n\n# Compute the relative expression of each gene per cell\n# Use sparse matrix operations, if your dataset is large, doing matrix devisions the regular way will take a very long time.\nC <- counts(sce)\nC@x <- C@x / rep.int(colSums(C), diff(C@p))\nmost_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]\nboxplot(as.matrix(t(C[most_expressed, ])), cex = .1, las = 1, xlab = \"% total count per cell\", col = scales::hue_pal()(20)[20:1], horizontal = TRUE)\n\n\n\n\n\n\n\nrm(C)\n\n# also, there is the option of running the function \"plotHighestExprs\" in the scater package, however, this function takes very long to execute.\n\nAs you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.\n\n\n5.2 Mito/Ribo filtering\nWe also have quite a lot of cells with high proportion of mitochondrial and low proportion of ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent_mito variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 20% mitochondrial reads and that will be used as a cutoff. We will also remove cells with less than 5% ribosomal reads.\n\nselected_mito <- sce.filt$subsets_mt_percent < 30\nselected_ribo <- sce.filt$subsets_ribo_percent > 5\n\n# and subset the object to only keep those cells\nsce.filt <- sce.filt[, selected_mito & selected_ribo]\ndim(sce.filt)\n\n[1] 18209 6023\n\n\nAs you can see, a large proportion of sample covid_15 is filtered out. Also, there is still quite a lot of variation in percent_mito, so it will have to be dealt with in the data analysis step. We can also notice that the percent_ribo are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.\n\n\n5.3 Plot filtered QC\nLets plot the same QC-stats another time.\n\nwrap_plots(\n plotColData(sce, y = \"detected\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"total\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_mt_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_ribo_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_hb_percent\", x = \"sample\", colour_by = \"sample\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n\n5.4 Filter genes\nAs the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis.\n\ndim(sce.filt)\n\n[1] 18209 6023\n\n# Filter MALAT1\nsce.filt <- sce.filt[!grepl(\"MALAT1\", rownames(sce.filt)), ]\n\n# Filter Mitocondrial\nsce.filt <- sce.filt[!grepl(\"^MT-\", rownames(sce.filt)), ]\n\n# Filter Ribossomal gene (optional if that is a problem on your data)\n# sce.filt <- sce.filt[ ! grepl(\"^RP[SL]\", rownames(sce.filt)), ]\n\n# Filter Hemoglobin gene\nsce.filt <- sce.filt[!grepl(\"^HB[^(P)]\", rownames(sce.filt)), ]\n\ndim(sce.filt)\n\n[1] 18183 6023" + "text": "5 Filtering\n\n5.1 Detection-based filtering\nA standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.\nIn Scran, we can use the function quickPerCellQC to filter out outliers from distributions of qc stats, such as detected genes, gene subsets etc. But in this case, we will take one setting at a time and run through the steps of filtering cells.\n\ndim(sce)\n\n[1] 33538 9000\n\nselected_c <- colnames(sce)[sce$detected > 200]\nselected_f <- rownames(sce)[Matrix::rowSums(counts(sce)) > 3]\n\nsce.filt <- sce[selected_f, selected_c]\ndim(sce.filt)\n\n[1] 18209 8095\n\n\nExtremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In this case, we will run doublet prediction further down, so we will skip this step now, but the code below is an example of how it can be run:\n\n# skip for now and run doublet detection instead...\n\n# high.det.v3 <- sce.filt$nFeatures > 4100\n# high.det.v2 <- (sce.filt$nFeatures > 2000) & (sce.filt$sample_id == \"v2.1k\")\n\n# remove these cells\n# sce.filt <- sce.filt[ , (!high.det.v3) & (!high.det.v2)]\n\n# check number of cells\n# ncol(sce.filt)\n\nAdditionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.\nIn Scater, you can also use the function plotHighestExprs() to plot the gene contribution, but the function is quite slow.\n\n# Compute the relative expression of each gene per cell\n# Use sparse matrix operations, if your dataset is large, doing matrix devisions the regular way will take a very long time.\nC <- counts(sce)\nC@x <- C@x / rep.int(colSums(C), diff(C@p))\nmost_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]\nboxplot(as.matrix(t(C[most_expressed, ])), cex = .1, las = 1, xlab = \"% total count per cell\", col = scales::hue_pal()(20)[20:1], horizontal = TRUE)\n\n\n\n\n\n\n\nrm(C)\n\n# also, there is the option of running the function \"plotHighestExprs\" in the scater package, however, this function takes very long to execute.\n\nAs you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.\n\n\n5.2 Mito/Ribo filtering\nWe also have quite a lot of cells with high proportion of mitochondrial and low proportion of ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent_mito variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 20% mitochondrial reads and that will be used as a cutoff. We will also remove cells with less than 5% ribosomal reads.\n\nselected_mito <- sce.filt$subsets_mt_percent < 30\nselected_ribo <- sce.filt$subsets_ribo_percent > 5\n\n# and subset the object to only keep those cells\nsce.filt <- sce.filt[, selected_mito & selected_ribo]\ndim(sce.filt)\n\n[1] 18209 6023\n\n\nAs you can see, a large proportion of sample covid_15 is filtered out. Also, there is still quite a lot of variation in percent_mito, so it will have to be dealt with in the data analysis step. We can also notice that the percent_ribo are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.\n\n\n5.3 Plot filtered QC\nLets plot the same QC-stats another time.\n\nwrap_plots(\n plotColData(sce, y = \"detected\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"total\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_mt_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_ribo_percent\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce, y = \"subsets_hb_percent\", x = \"sample\", colour_by = \"sample\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n\n5.4 Filter genes\nAs the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis. In this case we will also remove the HB genes.\n\ndim(sce.filt)\n\n[1] 18209 6023\n\n# Filter MALAT1\nsce.filt <- sce.filt[!grepl(\"MALAT1\", rownames(sce.filt)), ]\n\n# Filter Mitocondrial\nsce.filt <- sce.filt[!grepl(\"^MT-\", rownames(sce.filt)), ]\n\n# Filter Ribossomal gene (optional if that is a problem on your data)\n# sce.filt <- sce.filt[ ! grepl(\"^RP[SL]\", rownames(sce.filt)), ]\n\n# Filter Hemoglobin gene\nsce.filt <- sce.filt[!grepl(\"^HB[^(PES)]\", rownames(sce.filt)), ]\n\ndim(sce.filt)\n\n[1] 18186 6023" }, { "objectID": "labs/bioc/bioc_01_qc.html#meta-qc_sex", "href": "labs/bioc/bioc_01_qc.html#meta-qc_sex", "title": " Quality Control", "section": "6 Sample sex", - "text": "6 Sample sex\nWhen working with human or animal samples, you should ideally constrain you experiments to a single sex to avoid including sex bias in the conclusions. However this may not always be possible. By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is. It can also bee a good way to detect if there has been any sample mixups, if the sample metadata sex does not agree with the computational predictions.\nTo get chromosome information for all genes, you should ideally parse the information from the gtf file that you used in the mapping pipeline as it has the exact same annotation version/gene naming. However, it may not always be available, as in this case where we have downloaded public data. R package biomaRt can be used to fetch annotation information. The code to run biomaRt is provided. As the biomart instances quite often are unresponsive, we will download and use a file that was created in advance.\n\n# this code chunk is not executed\nsuppressMessages(library(biomaRt))\n\n# initialize connection to mart, may take some time if the sites are unresponsive.\nmart <- useMart(\"ENSEMBL_MART_ENSEMBL\", dataset = \"hsapiens_gene_ensembl\")\n\n# fetch chromosome info plus some other annotations\ngenes_table <- try(biomaRt::getBM(attributes = c(\n \"ensembl_gene_id\", \"external_gene_name\",\n \"description\", \"gene_biotype\", \"chromosome_name\", \"start_position\"\n), mart = mart, useCache = F))\n\nwrite.csv(genes_table, file = \"data/results/genes_table.csv\")\n\n\ngenes_file <- file.path(path_results, \"genes_table.csv\")\n\nif (!file.exists(genes_file)) download.file(file.path(path_data, \"covid/results/genes_table.csv\"), destfile = genes_file)\ngenes.table <- read.csv(genes_file)\n\ngenes.table <- genes.table[genes.table$external_gene_name %in% rownames(sce.filt), ]\n\nNow that we have the chromosome information, we can calculate per cell the proportion of reads that comes from chromosome Y.\n\nchrY.gene <- genes.table$external_gene_name[genes.table$chromosome_name == \"Y\"]\nsce.filt@colData$pct_chrY <- Matrix::colSums(counts(sce.filt)[chrY.gene, ]) / colSums(counts(sce.filt))\n\nThen plot XIST expression vs chrY proportion. As you can see, the samples are clearly on either side, even if some cells do not have detection of either.\n\n# as plotColData cannot take an expression vs metadata, we need to add in XIST expression to colData\nsce.filt@colData$XIST <- counts(sce.filt)[\"XIST\", ] / colSums(counts(sce.filt)) * 10000\nplotColData(sce.filt, \"XIST\", \"pct_chrY\")\n\n\n\n\n\n\n\n\nPlot as violins.\n\nwrap_plots(\n plotColData(sce.filt, y = \"XIST\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce.filt, y = \"pct_chrY\", x = \"sample\", colour_by = \"sample\"),\n ncol = 2\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nHere, we can see clearly that we have two males and 4 females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?" + "text": "6 Sample sex\nWhen working with human or animal samples, you should ideally constrain you experiments to a single sex to avoid including sex bias in the conclusions. However this may not always be possible. By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is. It can also bee a good way to detect if there has been any sample mixups, if the sample metadata sex does not agree with the computational predictions.\nTo get chromosome information for all genes, you should ideally parse the information from the gtf file that you used in the mapping pipeline as it has the exact same annotation version/gene naming. However, it may not always be available, as in this case where we have downloaded public data. R package biomaRt can be used to fetch annotation information. The code to run biomaRt is provided. As the biomart instances quite often are unresponsive, we will download and use a file that was created in advance.\n\n# this code chunk is not executed\nsuppressMessages(library(biomaRt))\n\n# initialize connection to mart, may take some time if the sites are unresponsive.\nmart <- useMart(\"ENSEMBL_MART_ENSEMBL\", dataset = \"hsapiens_gene_ensembl\")\n\n# fetch chromosome info plus some other annotations\ngenes_table <- try(biomaRt::getBM(attributes = c(\n \"ensembl_gene_id\", \"external_gene_name\",\n \"description\", \"gene_biotype\", \"chromosome_name\", \"start_position\"\n), mart = mart, useCache = F))\n\nwrite.csv(genes_table, file = \"data/covid/results/genes_table.csv\")\n\n\ngenes_file <- file.path(path_results, \"genes_table.csv\")\n\nif (!file.exists(genes_file)) download.file(file.path(path_data, \"covid/results/genes_table.csv\"), destfile = genes_file)\ngenes.table <- read.csv(genes_file)\n\ngenes.table <- genes.table[genes.table$external_gene_name %in% rownames(sce.filt), ]\n\nNow that we have the chromosome information, we can calculate per cell the proportion of reads that comes from chromosome Y.\n\nchrY.gene <- genes.table$external_gene_name[genes.table$chromosome_name == \"Y\"]\nsce.filt@colData$pct_chrY <- Matrix::colSums(counts(sce.filt)[chrY.gene, ]) / colSums(counts(sce.filt))\n\nThen plot XIST expression vs chrY proportion. As you can see, the samples are clearly on either side, even if some cells do not have detection of either.\n\n# as plotColData cannot take an expression vs metadata, we need to add in XIST expression to colData\nsce.filt@colData$XIST <- counts(sce.filt)[\"XIST\", ] / colSums(counts(sce.filt)) * 10000\nplotColData(sce.filt, \"XIST\", \"pct_chrY\")\n\n\n\n\n\n\n\n\nPlot as violins.\n\nwrap_plots(\n plotColData(sce.filt, y = \"XIST\", x = \"sample\", colour_by = \"sample\"),\n plotColData(sce.filt, y = \"pct_chrY\", x = \"sample\", colour_by = \"sample\"),\n ncol = 2\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nHere, we can see clearly that we have three males and five females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?" }, { "objectID": "labs/bioc/bioc_01_qc.html#meta-qc_cellcycle", @@ -578,7 +578,7 @@ "href": "labs/bioc/bioc_01_qc.html#meta-qc_doublet", "title": " Quality Control", "section": "8 Predict doublets", - "text": "8 Predict doublets\nDoublets/Multiples of cells in the same well/droplet is a common issue in scRNAseq protocols. Especially in droplet-based methods with overloading of cells. In a typical 10x experiment the proportion of doublets is linearly dependent on the amount of loaded cells. As indicated from the Chromium user guide, doublet rates are about as follows:\n\nMost doublet detectors simulates doublets by merging cell counts and predicts doublets as cells that have similar embeddings as the simulated doublets. Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.\n\n\n\n\n\n\nCaution\n\n\n\nIdeally doublet prediction should be run on each sample separately, especially if your different samples have different proportions of cell types. In this case, the data is subsampled so we have very few cells per sample and all samples are sorted PBMCs so it is okay to run them together.\n\n\nThere is a method to predict if a cluster consists of mainly doublets findDoubletClusters(), but we can also predict individual cells based on simulations using the function computeDoubletDensity() which we will do here. Doublet detection will be performed using PCA, so we need to first normalize the data and run variable gene detection, as well as UMAP for visualization. These steps will be explored in more detail in coming exercises.\n\nsce.filt <- logNormCounts(sce.filt)\ndec <- modelGeneVar(sce.filt, block = sce.filt$sample)\nhvgs <- getTopHVGs(dec, n = 2000)\n\nsce.filt <- runPCA(sce.filt, subset_row = hvgs)\n\nsce.filt <- runUMAP(sce.filt, pca = 10)\n\n\nsuppressPackageStartupMessages(library(scDblFinder))\n\n# run computeDoubletDensity with 10 principal components.\nsce.filt <- scDblFinder(sce.filt, dims = 10)\n\n\nwrap_plots(\n plotUMAP(sce.filt, colour_by = \"scDblFinder.score\"),\n plotUMAP(sce.filt, colour_by = \"scDblFinder.class\"),\n plotUMAP(sce.filt, colour_by = \"sample\"),\n ncol = 3\n)\n\n\n\n\n\n\n\n\nNow, lets remove all predicted doublets from our data.\n\nsce.filt <- sce.filt[, sce.filt$scDblFinder.score < 2]\ndim(sce.filt)\n\n[1] 18183 6023" + "text": "8 Predict doublets\nDoublets/Multiples of cells in the same well/droplet is a common issue in scRNAseq protocols. Especially in droplet-based methods with overloading of cells. In a typical 10x experiment the proportion of doublets is linearly dependent on the amount of loaded cells. As indicated from the Chromium user guide, doublet rates are about as follows:\n\nMost doublet detectors simulates doublets by merging cell counts and predicts doublets as cells that have similar embeddings as the simulated doublets. Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.\n\n\n\n\n\n\nCaution\n\n\n\nIdeally doublet prediction should be run on each sample separately, especially if your different samples have different proportions of cell types. In this case, the data is subsampled so we have very few cells per sample and all samples are sorted PBMCs so it is okay to run them together.\n\n\nThere is a method to predict if a cluster consists of mainly doublets findDoubletClusters(), but we can also predict individual cells based on simulations using the function computeDoubletDensity() which we will do here. Doublet detection will be performed using PCA, so we need to first normalize the data and run variable gene detection, as well as UMAP for visualization. These steps will be explored in more detail in coming exercises.\n\nsce.filt <- logNormCounts(sce.filt)\ndec <- modelGeneVar(sce.filt, block = sce.filt$sample)\nhvgs <- getTopHVGs(dec, n = 2000)\n\nsce.filt <- runPCA(sce.filt, subset_row = hvgs)\n\nsce.filt <- runUMAP(sce.filt, pca = 10)\n\n\nsuppressPackageStartupMessages(library(scDblFinder))\n\n# run computeDoubletDensity with 10 principal components.\nsce.filt <- scDblFinder(sce.filt, dims = 10)\n\n\nwrap_plots(\n plotUMAP(sce.filt, colour_by = \"scDblFinder.score\"),\n plotUMAP(sce.filt, colour_by = \"scDblFinder.class\"),\n plotUMAP(sce.filt, colour_by = \"sample\"),\n ncol = 3\n)\n\n\n\n\n\n\n\n\nNow, lets remove all predicted doublets from our data.\n\nsce.filt <- sce.filt[, sce.filt$scDblFinder.score < 2]\ndim(sce.filt)\n\n[1] 18186 6023" }, { "objectID": "labs/bioc/bioc_01_qc.html#meta-qc_save", @@ -683,7 +683,7 @@ "href": "labs/bioc/bioc_03_integration.html#meta-int_prep", "title": " Data Integration", "section": "1 Data preparation", - "text": "1 Data preparation\nLet’s first load necessary libraries and the data saved in the previous lab.\n\nsuppressPackageStartupMessages({\n library(scater)\n library(scran)\n library(patchwork)\n library(ggplot2)\n library(batchelor)\n library(harmony)\n library(reticulate)\n})\n\n# Activate scanorama Python venv\nreticulate::use_virtualenv(\"/opt/venv/scanorama\")\nreticulate::py_discover_config()\n\npython: /opt/venv/scanorama/bin/python\nlibpython: /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so\npythonhome: /opt/venv/scanorama:/opt/venv/scanorama\nversion: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]\nnumpy: /opt/venv/scanorama/lib/python3.10/site-packages/numpy\nnumpy_version: 1.26.3\n\nNOTE: Python version was forced by use_python function\n\n\n\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\npath_file <- \"data/covid/results/bioc_covid_qc_dr.rds\"\nif (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"covid/results/bioc_covid_qc_dr.rds\"), destfile = path_file)\nsce <- readRDS(path_file)\nprint(reducedDims(sce))\n\nList of length 8\nnames(8): PCA UMAP tSNE_on_PCA ... UMAP_on_ScaleData KNN UMAP_on_Graph\n\n\nWe split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation (vst).\n\nsce.list <- lapply(unique(sce$sample), function(x) {\n x <- sce[, sce$sample == x]\n})\n\nhvgs_per_dataset <- lapply(sce.list, function(x) {\n x <- computeSumFactors(x, sizes = c(20, 40, 60, 80))\n x <- logNormCounts(x)\n var.out <- modelGeneVar(x, method = \"loess\")\n hvg.out <- var.out[which(var.out$FDR <= 0.05 & var.out$bio >= 0.2), ]\n hvg.out <- hvg.out[order(hvg.out$bio, decreasing = TRUE), ]\n return(rownames(hvg.out))\n})\nnames(hvgs_per_dataset) <- unique(sce$sample)\n\n# venn::venn(hvgs_per_dataset,opacity = .4,zcolor = scales::hue_pal()(3),cexsn = 1,cexil = 1,lwd=1,col=\"white\",borders = NA)\n\ntemp <- unique(unlist(hvgs_per_dataset))\noverlap <- sapply(hvgs_per_dataset, function(x) {\n temp %in% x\n})\n\n\npheatmap::pheatmap(t(overlap * 1), cluster_rows = F, color = c(\"grey90\", \"grey20\")) ## MNN\n\n\n\n\n\n\n\n\nThe mutual nearest neighbors (MNN) approach within the scran package utilizes a novel approach to adjust for batch effects. The fastMNN() function returns a representation of the data with reduced dimensionality, which can be used in a similar fashion to other lower-dimensional representations such as PCA. In particular, this representation can be used for downstream methods such as clustering. The BNPARAM can be used to specify the specific nearest neighbors method to use from the BiocNeighbors package. Here we make use of the Annoy library via the BiocNeighbors::AnnoyParam() argument. We save the reduced-dimension MNN representation into the reducedDims slot of our sce object.\n\nmnn_out <- batchelor::fastMNN(sce, subset.row = unique(unlist(hvgs_per_dataset)), batch = factor(sce$sample), k = 20, d = 50)\n\n\n\n\n\n\n\nCaution\n\n\n\nfastMNN() does not produce a batch-corrected expression matrix.\n\n\n\nmnn_out <- t(reducedDim(mnn_out, \"corrected\"))\ncolnames(mnn_out) <- unlist(lapply(sce.list, function(x) {\n colnames(x)\n}))\nmnn_out <- mnn_out[, colnames(sce)]\nrownames(mnn_out) <- paste0(\"dim\", 1:50)\nreducedDim(sce, \"MNN\") <- t(mnn_out)\n\nWe can observe that a new assay slot is now created under the name MNN.\n\nreducedDims(sce)\n\nList of length 9\nnames(9): PCA UMAP tSNE_on_PCA UMAP_on_PCA ... KNN UMAP_on_Graph MNN\n\n\nThus, the result from fastMNN() should solely be treated as a reduced dimensionality representation, suitable for direct plotting, TSNE/UMAP, clustering, and trajectory analysis that relies on such results.\n\nset.seed(42)\nsce <- runTSNE(sce, dimred = \"MNN\", n_dimred = 50, perplexity = 30, name = \"tSNE_on_MNN\")\nsce <- runUMAP(sce, dimred = \"MNN\", n_dimred = 50, ncomponents = 2, name = \"UMAP_on_MNN\")\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nwrap_plots(\n plotReducedDim(sce, dimred = \"PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"PCA\"),\n plotReducedDim(sce, dimred = \"tSNE_on_PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"tSNE_on_PCA\"),\n plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_PCA\"),\n plotReducedDim(sce, dimred = \"MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"MNN\"),\n plotReducedDim(sce, dimred = \"tSNE_on_MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"tSNE_on_MNN\"),\n plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_MNN\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nLet’s plot some marker genes for different cell types onto the embedding.\n\n\n\nMarkers\nCell Type\n\n\n\n\nCD3E\nT cells\n\n\nCD3E CD4\nCD4+ T cells\n\n\nCD3E CD8A\nCD8+ T cells\n\n\nGNLY, NKG7\nNK cells\n\n\nMS4A1\nB cells\n\n\nCD14, LYZ, CST3, MS4A7\nCD14+ Monocytes\n\n\nFCGR3A, LYZ, CST3, MS4A7\nFCGR3A+ Monocytes\n\n\nFCER1A, CST3\nDCs\n\n\n\n\nplotlist <- list()\nfor (i in c(\"CD3E\", \"CD4\", \"CD8A\", \"NKG7\", \"GNLY\", \"MS4A1\", \"CD14\", \"LYZ\", \"MS4A7\", \"FCGR3A\", \"CST3\", \"FCER1A\")) {\n plotlist[[i]] <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = i, by_exprs_values = \"logcounts\", point_size = 0.6) +\n scale_fill_gradientn(colours = colorRampPalette(c(\"grey90\", \"orange3\", \"firebrick\", \"firebrick\", \"red\", \"red\"))(10)) +\n ggtitle(label = i) + theme(plot.title = element_text(size = 20))\n}\nwrap_plots(plotlist = plotlist, ncol = 3)\n\n\n\n\n\n\n\n\nINTEG_R1:\nINTEG_R2:\n\nlibrary(harmony)\n\nreducedDimNames(sce)\n\n [1] \"PCA\" \"UMAP\" \"tSNE_on_PCA\" \n [4] \"UMAP_on_PCA\" \"UMAP10_on_PCA\" \"UMAP_on_ScaleData\"\n [7] \"KNN\" \"UMAP_on_Graph\" \"MNN\" \n[10] \"tSNE_on_MNN\" \"UMAP_on_MNN\" \n\nsce <- RunHarmony(\n sce,\n group.by.vars = \"sample\",\n reduction.save = \"harmony\",\n reduction = \"PCA\",\n dims.use = 1:50\n)\n\n# Here we use all PCs computed from Harmony for UMAP calculation\nsce <- runUMAP(sce, dimred = \"harmony\", n_dimred = 50, ncomponents = 2, name = \"UMAP_on_Harmony\")\n\nINTEG_R3:\nINTEG_R4:\n\nhvgs <- unique(unlist(hvgs_per_dataset))\n\nscelist <- list()\ngenelist <- list()\nfor (i in 1:length(sce.list)) {\n scelist[[i]] <- t(as.matrix(logcounts(sce.list[[i]])[hvgs, ]))\n genelist[[i]] <- hvgs\n}\n\nlapply(scelist, dim)\n\n[[1]]\n[1] 923 454\n\n[[2]]\n[1] 611 454\n\n[[3]]\n[1] 1111 454\n\n[[4]]\n[1] 1067 454\n\n[[5]]\n[1] 1203 454\n\n[[6]]\n[1] 1108 454\n\n\nINTEG_R5:\n\nscanorama <- reticulate::import(\"scanorama\")\n\nintegrated.data <- scanorama$integrate(datasets_full = scelist, genes_list = genelist)\n\nintdimred <- do.call(rbind, integrated.data[[1]])\ncolnames(intdimred) <- paste0(\"PC_\", 1:100)\nrownames(intdimred) <- colnames(logcounts(sce))\n\n# Add standard deviations in order to draw Elbow Plots in Seurat\nstdevs <- apply(intdimred, MARGIN = 2, FUN = sd)\nattr(intdimred, \"varExplained\") <- stdevs\n\nreducedDim(sce, \"Scanorama_PCA\") <- intdimred\n\n# Here we use all PCs computed from Scanorama for UMAP calculation\nsce <- runUMAP(sce, dimred = \"Scanorama_PCA\", n_dimred = 50, ncomponents = 2, name = \"UMAP_on_Scanorama\")\n\nINTEG_R6:\n\np1 <- plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_PCA\")\np2 <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_MNN\")\np3 <- plotReducedDim(sce, dimred = \"UMAP_on_Harmony\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_Harmony\")\np4 <- plotReducedDim(sce, dimred = \"UMAP_on_Scanorama\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_Scanorama\")\n\nwrap_plots(p1, p2, p3, p4, nrow = 2) +\n plot_layout(guides = \"collect\")\n\nINTEG_R7:\nLet’s save the integrated data for further analysis.\n\nsaveRDS(sce, \"data/covid/results/bioc_covid_qc_dr_int.rds\")" + "text": "1 Data preparation\nLet’s first load necessary libraries and the data saved in the previous lab.\n\nsuppressPackageStartupMessages({\n library(scater)\n library(scran)\n library(patchwork)\n library(ggplot2)\n library(batchelor)\n library(harmony)\n library(reticulate)\n})\n\n# Activate scanorama Python venv\nreticulate::use_virtualenv(\"/opt/venv/scanorama\")\nreticulate::py_discover_config()\n\npython: /opt/venv/scanorama/bin/python\nlibpython: /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so\npythonhome: /opt/venv/scanorama:/opt/venv/scanorama\nversion: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]\nnumpy: /opt/venv/scanorama/lib/python3.10/site-packages/numpy\nnumpy_version: 1.26.3\n\nNOTE: Python version was forced by use_python function\n\n\n\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\npath_file <- \"data/covid/results/bioc_covid_qc_dr.rds\"\nif (!dir.exists(dirname(path_file))) dir.create(dirname(path_file), recursive = TRUE)\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"covid/results/bioc_covid_qc_dr.rds\"), destfile = path_file)\nsce <- readRDS(path_file)\nprint(reducedDims(sce))\n\nList of length 8\nnames(8): PCA UMAP tSNE_on_PCA ... UMAP_on_ScaleData KNN UMAP_on_Graph\n\n\nWe split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation (vst).\n\nsce.list <- lapply(unique(sce$sample), function(x) {\n x <- sce[, sce$sample == x]\n})\n\nhvgs_per_dataset <- lapply(sce.list, function(x) {\n x <- computeSumFactors(x, sizes = c(20, 40, 60, 80))\n x <- logNormCounts(x)\n var.out <- modelGeneVar(x, method = \"loess\")\n hvg.out <- var.out[which(var.out$FDR <= 0.05 & var.out$bio >= 0.2), ]\n hvg.out <- hvg.out[order(hvg.out$bio, decreasing = TRUE), ]\n return(rownames(hvg.out))\n})\nnames(hvgs_per_dataset) <- unique(sce$sample)\n\n# venn::venn(hvgs_per_dataset,opacity = .4,zcolor = scales::hue_pal()(3),cexsn = 1,cexil = 1,lwd=1,col=\"white\",borders = NA)\n\ntemp <- unique(unlist(hvgs_per_dataset))\noverlap <- sapply(hvgs_per_dataset, function(x) {\n temp %in% x\n})\n\n\npheatmap::pheatmap(t(overlap * 1), cluster_rows = F, color = c(\"grey90\", \"grey20\")) ## MNN\n\n\n\n\n\n\n\n\nThe mutual nearest neighbors (MNN) approach within the scran package utilizes a novel approach to adjust for batch effects. The fastMNN() function returns a representation of the data with reduced dimensionality, which can be used in a similar fashion to other lower-dimensional representations such as PCA. In particular, this representation can be used for downstream methods such as clustering. The BNPARAM can be used to specify the specific nearest neighbors method to use from the BiocNeighbors package. Here we make use of the Annoy library via the BiocNeighbors::AnnoyParam() argument. We save the reduced-dimension MNN representation into the reducedDims slot of our sce object.\n\nmnn_out <- batchelor::fastMNN(sce, subset.row = unique(unlist(hvgs_per_dataset)), batch = factor(sce$sample), k = 20, d = 50)\n\n\n\n\n\n\n\nCaution\n\n\n\nfastMNN() does not produce a batch-corrected expression matrix.\n\n\n\nmnn_out <- t(reducedDim(mnn_out, \"corrected\"))\ncolnames(mnn_out) <- unlist(lapply(sce.list, function(x) {\n colnames(x)\n}))\nmnn_out <- mnn_out[, colnames(sce)]\nrownames(mnn_out) <- paste0(\"dim\", 1:50)\nreducedDim(sce, \"MNN\") <- t(mnn_out)\n\nWe can observe that a new assay slot is now created under the name MNN.\n\nreducedDims(sce)\n\nList of length 9\nnames(9): PCA UMAP tSNE_on_PCA UMAP_on_PCA ... KNN UMAP_on_Graph MNN\n\n\nThus, the result from fastMNN() should solely be treated as a reduced dimensionality representation, suitable for direct plotting, TSNE/UMAP, clustering, and trajectory analysis that relies on such results.\n\nset.seed(42)\nsce <- runTSNE(sce, dimred = \"MNN\", n_dimred = 50, perplexity = 30, name = \"tSNE_on_MNN\")\nsce <- runUMAP(sce, dimred = \"MNN\", n_dimred = 50, ncomponents = 2, name = \"UMAP_on_MNN\")\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nwrap_plots(\n plotReducedDim(sce, dimred = \"PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"PCA\"),\n plotReducedDim(sce, dimred = \"tSNE_on_PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"tSNE_on_PCA\"),\n plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_PCA\"),\n plotReducedDim(sce, dimred = \"MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"MNN\"),\n plotReducedDim(sce, dimred = \"tSNE_on_MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"tSNE_on_MNN\"),\n plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_MNN\"),\n ncol = 3\n) + plot_layout(guides = \"collect\")\n\n\n\n\n\n\n\n\nLet’s plot some marker genes for different cell types onto the embedding.\n\n\n\nMarkers\nCell Type\n\n\n\n\nCD3E\nT cells\n\n\nCD3E CD4\nCD4+ T cells\n\n\nCD3E CD8A\nCD8+ T cells\n\n\nGNLY, NKG7\nNK cells\n\n\nMS4A1\nB cells\n\n\nCD14, LYZ, CST3, MS4A7\nCD14+ Monocytes\n\n\nFCGR3A, LYZ, CST3, MS4A7\nFCGR3A+ Monocytes\n\n\nFCER1A, CST3\nDCs\n\n\n\n\nplotlist <- list()\nfor (i in c(\"CD3E\", \"CD4\", \"CD8A\", \"NKG7\", \"GNLY\", \"MS4A1\", \"CD14\", \"LYZ\", \"MS4A7\", \"FCGR3A\", \"CST3\", \"FCER1A\")) {\n plotlist[[i]] <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = i, by_exprs_values = \"logcounts\", point_size = 0.6) +\n scale_fill_gradientn(colours = colorRampPalette(c(\"grey90\", \"orange3\", \"firebrick\", \"firebrick\", \"red\", \"red\"))(10)) +\n ggtitle(label = i) + theme(plot.title = element_text(size = 20))\n}\nwrap_plots(plotlist = plotlist, ncol = 3)\n\n\n\n\n\n\n\n\nINTEG_R1:\nINTEG_R2:\n\nlibrary(harmony)\n\nreducedDimNames(sce)\n\n [1] \"PCA\" \"UMAP\" \"tSNE_on_PCA\" \n [4] \"UMAP_on_PCA\" \"UMAP10_on_PCA\" \"UMAP_on_ScaleData\"\n [7] \"KNN\" \"UMAP_on_Graph\" \"MNN\" \n[10] \"tSNE_on_MNN\" \"UMAP_on_MNN\" \n\nsce <- RunHarmony(\n sce,\n group.by.vars = \"sample\",\n reduction.save = \"harmony\",\n reduction = \"PCA\",\n dims.use = 1:50\n)\n\n# Here we use all PCs computed from Harmony for UMAP calculation\nsce <- runUMAP(sce, dimred = \"harmony\", n_dimred = 50, ncomponents = 2, name = \"UMAP_on_Harmony\")\n\nINTEG_R3:\nINTEG_R4:\n\nhvgs <- unique(unlist(hvgs_per_dataset))\n\nscelist <- list()\ngenelist <- list()\nfor (i in 1:length(sce.list)) {\n scelist[[i]] <- t(as.matrix(logcounts(sce.list[[i]])[hvgs, ]))\n genelist[[i]] <- hvgs\n}\n\nlapply(scelist, dim)\n\n[[1]]\n[1] 923 500\n\n[[2]]\n[1] 611 500\n\n[[3]]\n[1] 1111 500\n\n[[4]]\n[1] 1067 500\n\n[[5]]\n[1] 1203 500\n\n[[6]]\n[1] 1108 500\n\n\nINTEG_R5:\n\nscanorama <- reticulate::import(\"scanorama\")\n\nintegrated.data <- scanorama$integrate(datasets_full = scelist, genes_list = genelist)\n\nintdimred <- do.call(rbind, integrated.data[[1]])\ncolnames(intdimred) <- paste0(\"PC_\", 1:100)\nrownames(intdimred) <- colnames(logcounts(sce))\n\n# Add standard deviations in order to draw Elbow Plots in Seurat\nstdevs <- apply(intdimred, MARGIN = 2, FUN = sd)\nattr(intdimred, \"varExplained\") <- stdevs\n\nreducedDim(sce, \"Scanorama_PCA\") <- intdimred\n\n# Here we use all PCs computed from Scanorama for UMAP calculation\nsce <- runUMAP(sce, dimred = \"Scanorama_PCA\", n_dimred = 50, ncomponents = 2, name = \"UMAP_on_Scanorama\")\n\nINTEG_R6:\n\np1 <- plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_PCA\")\np2 <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_MNN\")\np3 <- plotReducedDim(sce, dimred = \"UMAP_on_Harmony\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_Harmony\")\np4 <- plotReducedDim(sce, dimred = \"UMAP_on_Scanorama\", colour_by = \"sample\", point_size = 0.6) + ggplot2::ggtitle(label = \"UMAP_on_Scanorama\")\n\nwrap_plots(p1, p2, p3, p4, nrow = 2) +\n plot_layout(guides = \"collect\")\n\nINTEG_R7:\nLet’s save the integrated data for further analysis.\n\nsaveRDS(sce, \"data/covid/results/bioc_covid_qc_dr_int.rds\")" }, { "objectID": "labs/bioc/bioc_03_integration.html#meta-session", @@ -739,7 +739,7 @@ "href": "labs/bioc/bioc_05_dge.html#meta-dge_cmg", "title": " Differential gene expression", "section": "1 Cell marker genes", - "text": "1 Cell marker genes\nLet us first compute a ranking for the highly differential genes in each cluster. There are many different tests and parameters to be chosen that can be used to refine your results. When looking for marker genes, we want genes that are positivelly expressed in a cell type and possibly not expressed in the others.\n\n# Compute differentiall expression\nmarkers_genes <- scran::findMarkers(\n x = sce,\n groups = as.character(sce$louvain_SNNk15),\n lfc = .5,\n pval.type = \"all\",\n direction = \"up\"\n)\n\n# List of dataFrames with the results for each cluster\nmarkers_genes\n\nList of length 9\nnames(9): 1 2 3 4 5 6 7 8 9\n\n# Visualizing the expression of one\nmarkers_genes[[\"1\"]]\n\nDataFrame with 18183 rows and 11 columns\n p.value FDR summary.logFC logFC.2 logFC.3\n <numeric> <numeric> <numeric> <numeric> <numeric>\nS100A8 5.01536e-64 9.11942e-60 6.628056 7.76621 2.340367\nS100A12 5.88901e-53 5.35399e-49 1.787072 4.27763 1.787072\nS100A9 2.54322e-28 1.54144e-24 1.421390 7.39019 1.421390\nCXCL8 5.98014e-15 2.71842e-11 1.102967 1.58992 1.102967\nPLBD1 2.42988e-14 8.83649e-11 0.987264 2.43642 0.987264\n... ... ... ... ... ...\nAC007325.4 1 1 0.01104654 0.01104654 -0.004812566\nAL354822.1 1 1 -0.00785244 -0.00785244 0.000868684\nAC004556.1 1 1 0.02294381 -0.02462402 -0.124791403\nAC233755.1 1 1 -0.00670799 -0.00670799 0.000000000\nAC240274.1 1 1 -0.00724362 -0.00724362 -0.007032607\n logFC.4 logFC.5 logFC.6 logFC.7 logFC.8\n <numeric> <numeric> <numeric> <numeric> <numeric>\nS100A8 7.89619 7.78462 7.94406 7.88144 6.62806\nS100A12 4.31600 4.28998 4.31586 4.31295 4.26648\nS100A9 7.50841 7.42086 7.55250 7.55379 6.29102\nCXCL8 1.68719 1.54233 1.63129 1.63792 1.53139\nPLBD1 2.43135 2.44121 2.44252 2.44082 2.40550\n... ... ... ... ... ...\nAC007325.4 -0.00271371 0.00667792 0.00417983 0.00809222 0.0110465\nAL354822.1 -0.01036855 -0.00936705 -0.00928158 -0.01539009 -0.0490755\nAC004556.1 -0.04927666 -0.01090129 -0.05200271 -0.04487633 0.0229438\nAC233755.1 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000\nAC240274.1 -0.01510737 -0.01125536 -0.00103067 -0.00380232 -0.0143902\n logFC.9\n <numeric>\nS100A8 6.27635\nS100A12 3.88182\nS100A9 4.81815\nCXCL8 1.54518\nPLBD1 1.81260\n... ...\nAC007325.4 -0.00652380\nAL354822.1 -0.00783011\nAC004556.1 -0.14233685\nAC233755.1 0.00000000\nAC240274.1 -0.01826009\n\n\nWe can now select the top 25 up regulated genes for plotting.\n\n# Colect the top 25 genes for each cluster and put the into a single table\ntop25 <- lapply(names(markers_genes), function(x) {\n temp <- markers_genes[[x]][1:25, 1:2]\n temp$gene <- rownames(markers_genes[[x]])[1:25]\n temp$cluster <- x\n return(temp)\n})\ntop25 <- as_tibble(do.call(rbind, top25))\ntop25$p.value[top25$p.value == 0] <- 1e-300\ntop25\n\n\n\n \n\n\n\n\npar(mfrow = c(1, 5), mar = c(4, 6, 3, 1))\nfor (i in unique(top25$cluster)) {\n barplot(sort(setNames(-log10(top25$p.value), top25$gene)[top25$cluster == i], F),\n horiz = T, las = 1, main = paste0(i, \" vs. rest\"), border = \"white\", yaxs = \"i\", xlab = \"-log10FC\"\n )\n abline(v = c(0, -log10(0.05)), lty = c(1, 2))\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can visualize them as a heatmap. Here we are selecting the top 5.\n\nas_tibble(top25) %>%\n group_by(cluster) %>%\n top_n(-5, p.value) -> top5\n\nscater::plotHeatmap(sce[, order(sce$louvain_SNNk15)],\n features = unique(top5$gene),\n center = T, zlim = c(-3, 3),\n colour_columns_by = \"louvain_SNNk15\",\n show_colnames = F, cluster_cols = F,\n fontsize_row = 6,\n color = colorRampPalette(c(\"purple\", \"black\", \"yellow\"))(90)\n)\n\n\n\n\n\n\n\n\nWe can also plot a violin plot for each gene.\n\nscater::plotExpression(sce, features = unique(top5$gene), x = \"louvain_SNNk15\", ncol = 5, colour_by = \"louvain_SNNk15\", scales = \"free\")" + "text": "1 Cell marker genes\nLet us first compute a ranking for the highly differential genes in each cluster. There are many different tests and parameters to be chosen that can be used to refine your results. When looking for marker genes, we want genes that are positivelly expressed in a cell type and possibly not expressed in the others.\n\n# Compute differentiall expression\nmarkers_genes <- scran::findMarkers(\n x = sce,\n groups = as.character(sce$louvain_SNNk15),\n lfc = .5,\n pval.type = \"all\",\n direction = \"up\"\n)\n\n# List of dataFrames with the results for each cluster\nmarkers_genes\n\nList of length 9\nnames(9): 1 2 3 4 5 6 7 8 9\n\n# Visualizing the expression of one\nmarkers_genes[[\"1\"]]\n\nDataFrame with 18186 rows and 11 columns\n p.value FDR summary.logFC logFC.2 logFC.3\n <numeric> <numeric> <numeric> <numeric> <numeric>\nS100A12 1.57321e-139 2.86104e-135 2.34116 4.13134 2.34116\nS100A8 1.35706e-64 1.23397e-60 6.52478 7.66360 3.33664\nS100A9 1.40449e-61 8.51405e-58 6.19181 7.33443 2.41140\nPLBD1 3.89784e-49 1.77215e-45 1.28043 2.32483 1.28043\nNAMPT 7.45257e-38 2.71065e-34 1.27817 2.67891 1.27817\n... ... ... ... ... ...\nAC007325.4 1 1 0.00966451 0.00966451 0.000585433\nAL354822.1 1 1 -0.00710162 -0.00710162 0.000697440\nAC004556.1 1 1 -0.04593904 -0.05277778 -0.107041903\nAC233755.1 1 1 -0.00643585 -0.00643585 0.000000000\nAC240274.1 1 1 -0.00464419 -0.00464419 -0.003523507\n logFC.4 logFC.5 logFC.6 logFC.7 logFC.8\n <numeric> <numeric> <numeric> <numeric> <numeric>\nS100A12 4.17448 4.16371 4.16654 4.16271 4.11649\nS100A8 7.78141 7.69505 7.80820 7.76011 6.52478\nS100A9 7.42537 7.40085 7.47624 7.47041 6.19181\nPLBD1 2.32076 2.33067 2.33183 2.32822 2.28943\nNAMPT 2.76442 2.68668 2.75854 2.86208 2.81797\n... ... ... ... ... ...\nAC007325.4 -0.00472268 0.00533155 0.003317156 0.007609982 0.00966451\nAL354822.1 -0.00667383 -0.00850239 -0.008013634 -0.012927623 -0.01634707\nAC004556.1 -0.04331115 -0.01255718 -0.045939045 -0.042512552 0.01815203\nAC233755.1 0.00000000 0.00000000 0.000000000 0.000000000 0.00000000\nAC240274.1 -0.00702685 -0.00810242 0.000772945 -0.000256299 -0.01576446\n logFC.9\n <numeric>\nS100A12 4.12539\nS100A8 6.89910\nS100A9 5.25571\nPLBD1 1.85508\nNAMPT 1.62395\n... ...\nAC007325.4 -0.0146342\nAL354822.1 -0.0131441\nAC004556.1 -0.1608256\nAC233755.1 0.0000000\nAC240274.1 -0.0229031\n\n\nWe can now select the top 25 up regulated genes for plotting.\n\n# Colect the top 25 genes for each cluster and put the into a single table\ntop25 <- lapply(names(markers_genes), function(x) {\n temp <- markers_genes[[x]][1:25, 1:2]\n temp$gene <- rownames(markers_genes[[x]])[1:25]\n temp$cluster <- x\n return(temp)\n})\ntop25 <- as_tibble(do.call(rbind, top25))\ntop25$p.value[top25$p.value == 0] <- 1e-300\ntop25\n\n\n\n \n\n\n\n\npar(mfrow = c(1, 5), mar = c(4, 6, 3, 1))\nfor (i in unique(top25$cluster)) {\n barplot(sort(setNames(-log10(top25$p.value), top25$gene)[top25$cluster == i], F),\n horiz = T, las = 1, main = paste0(i, \" vs. rest\"), border = \"white\", yaxs = \"i\", xlab = \"-log10FC\"\n )\n abline(v = c(0, -log10(0.05)), lty = c(1, 2))\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can visualize them as a heatmap. Here we are selecting the top 5.\n\nas_tibble(top25) %>%\n group_by(cluster) %>%\n top_n(-5, p.value) -> top5\n\nscater::plotHeatmap(sce[, order(sce$louvain_SNNk15)],\n features = unique(top5$gene),\n center = T, zlim = c(-3, 3),\n colour_columns_by = \"louvain_SNNk15\",\n show_colnames = F, cluster_cols = F,\n fontsize_row = 6,\n color = colorRampPalette(c(\"purple\", \"black\", \"yellow\"))(90)\n)\n\n\n\n\n\n\n\n\nWe can also plot a violin plot for each gene.\n\nscater::plotExpression(sce, features = unique(top5$gene), x = \"louvain_SNNk15\", ncol = 5, colour_by = \"louvain_SNNk15\", scales = \"free\")" }, { "objectID": "labs/bioc/bioc_05_dge.html#meta-dge_cond", @@ -795,14 +795,14 @@ "href": "labs/bioc/bioc_06_celltyping.html#scmap", "title": " Celltype prediction", "section": "3 scMap", - "text": "3 scMap\nThe scMap package is one method for projecting cells from a scRNA-seq experiment on to the cell-types or individual cells identified in a different experiment. It can be run on different levels, either projecting by cluster or by single cell, here we will try out both.\nFor scmap cell type labels must be stored in the cell_type1 column of the colData slots, and gene ids that are consistent across both datasets must be stored in the feature_symbol column of the rowData slots.\n\n3.1 scMap cluster\n\n# add in slot cell_type1\nref.sce@colData$cell_type1 <- ref.sce@colData$cell_type\n# create a rowData slot with feature_symbol\nrd <- data.frame(feature_symbol = rownames(ref.sce))\nrownames(rd) <- rownames(ref.sce)\nrowData(ref.sce) <- rd\n\n# same for the ctrl dataset\n# create a rowData slot with feature_symbol\nrd <- data.frame(feature_symbol = rownames(ctrl.sce))\nrownames(rd) <- rownames(ctrl.sce)\nrowData(ctrl.sce) <- rd\n\nThen we can select variable features in both datasets.\n\n# select features\ncounts(ctrl.sce) <- as.matrix(counts(ctrl.sce))\nlogcounts(ctrl.sce) <- as.matrix(logcounts(ctrl.sce))\nctrl.sce <- selectFeatures(ctrl.sce, suppress_plot = TRUE)\n\ncounts(ref.sce) <- as.matrix(counts(ref.sce))\nlogcounts(ref.sce) <- as.matrix(logcounts(ref.sce))\nref.sce <- selectFeatures(ref.sce, suppress_plot = TRUE)\n\nThen we need to index the reference dataset by cluster, default is the clusters in cell_type1.\n\nref.sce <- indexCluster(ref.sce)\n\nNow we project the Covid-19 dataset onto that index.\n\nproject_cluster <- scmapCluster(\n projection = ctrl.sce,\n index_list = list(\n ref = metadata(ref.sce)$scmap_cluster_index\n )\n)\n\n# projected labels\ntable(project_cluster$scmap_cluster_labs)\n\n\n B cell CD4 T cell CD8 T cell cDC cMono ncMono \n 70 104 125 38 215 160 \n NK cell pDC Plasma cell unassigned \n 294 2 1 194 \n\n\nThen add the predictions to metadata and plot UMAP.\n\n# add in predictions\nctrl.sce@colData$scmap_cluster <- project_cluster$scmap_cluster_labs\n\nplotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cluster\")" + "text": "3 scMap\nThe scMap package is one method for projecting cells from a scRNA-seq experiment on to the cell-types or individual cells identified in a different experiment. It can be run on different levels, either projecting by cluster or by single cell, here we will try out both.\nFor scmap cell type labels must be stored in the cell_type1 column of the colData slots, and gene ids that are consistent across both datasets must be stored in the feature_symbol column of the rowData slots.\n\n3.1 scMap cluster\n\n# add in slot cell_type1\nref.sce@colData$cell_type1 <- ref.sce@colData$cell_type\n# create a rowData slot with feature_symbol\nrd <- data.frame(feature_symbol = rownames(ref.sce))\nrownames(rd) <- rownames(ref.sce)\nrowData(ref.sce) <- rd\n\n# same for the ctrl dataset\n# create a rowData slot with feature_symbol\nrd <- data.frame(feature_symbol = rownames(ctrl.sce))\nrownames(rd) <- rownames(ctrl.sce)\nrowData(ctrl.sce) <- rd\n\nThen we can select variable features in both datasets.\n\n# select features\ncounts(ctrl.sce) <- as.matrix(counts(ctrl.sce))\nlogcounts(ctrl.sce) <- as.matrix(logcounts(ctrl.sce))\nctrl.sce <- selectFeatures(ctrl.sce, suppress_plot = TRUE)\n\ncounts(ref.sce) <- as.matrix(counts(ref.sce))\nlogcounts(ref.sce) <- as.matrix(logcounts(ref.sce))\nref.sce <- selectFeatures(ref.sce, suppress_plot = TRUE)\n\nThen we need to index the reference dataset by cluster, default is the clusters in cell_type1.\n\nref.sce <- indexCluster(ref.sce)\n\nNow we project the Covid-19 dataset onto that index.\n\nproject_cluster <- scmapCluster(\n projection = ctrl.sce,\n index_list = list(\n ref = metadata(ref.sce)$scmap_cluster_index\n )\n)\n\n# projected labels\ntable(project_cluster$scmap_cluster_labs)\n\n\n B cell CD4 T cell CD8 T cell cDC cMono ncMono \n 69 105 124 38 215 160 \n NK cell pDC Plasma cell unassigned \n 294 2 1 195 \n\n\nThen add the predictions to metadata and plot UMAP.\n\n# add in predictions\nctrl.sce@colData$scmap_cluster <- project_cluster$scmap_cluster_labs\n\nplotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cluster\")" }, { "objectID": "labs/bioc/bioc_06_celltyping.html#scmap-cell", "href": "labs/bioc/bioc_06_celltyping.html#scmap-cell", "title": " Celltype prediction", "section": "4 scMap cell", - "text": "4 scMap cell\nWe can instead index the refernce data based on each single cell and project our data onto the closest neighbor in that dataset.\n\nref.sce <- indexCell(ref.sce)\n\nAgain we need to index the reference dataset.\n\nproject_cell <- scmapCell(\n projection = ctrl.sce,\n index_list = list(\n ref = metadata(ref.sce)$scmap_cell_index\n )\n)\n\nWe now get a table with index for the 5 nearest neigbors in the reference dataset for each cell in our dataset. We will select the celltype of the closest neighbor and assign it to the data.\n\ncell_type_pred <- colData(ref.sce)$cell_type1[project_cell$ref[[1]][1, ]]\ntable(cell_type_pred)\n\ncell_type_pred\n B cell CD4 T cell CD8 T cell cDC cMono ncMono \n 101 161 293 37 241 164 \n NK cell pDC Plasma cell \n 203 2 1 \n\n\nThen add the predictions to metadata and plot umap.\n\n# add in predictions\nctrl.sce@colData$scmap_cell <- cell_type_pred\n\nplotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cell\")\n\n\n\n\n\n\n\n\nPlot both:\n\nwrap_plots(\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cluster\"),\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cell\"),\n ncol = 2\n)" + "text": "4 scMap cell\nWe can instead index the refernce data based on each single cell and project our data onto the closest neighbor in that dataset.\n\nref.sce <- indexCell(ref.sce)\n\nAgain we need to index the reference dataset.\n\nproject_cell <- scmapCell(\n projection = ctrl.sce,\n index_list = list(\n ref = metadata(ref.sce)$scmap_cell_index\n )\n)\n\nWe now get a table with index for the 5 nearest neigbors in the reference dataset for each cell in our dataset. We will select the celltype of the closest neighbor and assign it to the data.\n\ncell_type_pred <- colData(ref.sce)$cell_type1[project_cell$ref[[1]][1, ]]\ntable(cell_type_pred)\n\ncell_type_pred\n B cell CD4 T cell CD8 T cell cDC cMono ncMono NK cell \n 102 176 300 65 187 189 182 \n pDC \n 2 \n\n\nThen add the predictions to metadata and plot umap.\n\n# add in predictions\nctrl.sce@colData$scmap_cell <- cell_type_pred\n\nplotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cell\")\n\n\n\n\n\n\n\n\nPlot both:\n\nwrap_plots(\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cluster\"),\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cell\"),\n ncol = 2\n)" }, { "objectID": "labs/bioc/bioc_06_celltyping.html#meta-ct_scpred", @@ -823,7 +823,7 @@ "href": "labs/bioc/bioc_06_celltyping.html#meta-ct_gsea", "title": " Celltype prediction", "section": "7 GSEA with celltype markers", - "text": "7 GSEA with celltype markers\nAnother option, where celltype can be classified on cluster level is to use gene set enrichment among the DEGs with known markers for different celltypes. Similar to how we did functional enrichment for the DEGs in the Differential expression exercise. There are some resources for celltype gene sets that can be used. Such as CellMarker, PanglaoDB or celltype gene sets at MSigDB. We can also look at overlap between DEGs in a reference dataset and the dataset you are analysing.\n\n7.1 DEG overlap\nFirst, lets extract top DEGs for our Covid-19 dataset and the reference dataset. When we run differential expression for our dataset, we want to report as many genes as possible, hence we set the cutoffs quite lenient.\n\n# run differential expression in our dataset, using clustering at resolution 0.3\nDGE_list <- scran::findMarkers(\n x = alldata,\n groups = as.character(alldata@colData$louvain_SNNk15),\n pval.type = \"all\",\n min.prop = 0\n)\n\n\n# Compute differential gene expression in reference dataset (that has cell annotation)\nref_DGE <- scran::findMarkers(\n x = ref.sce,\n groups = as.character(ref.sce@colData$cell_type),\n pval.type = \"all\",\n direction = \"up\"\n)\n\n# Identify the top cell marker genes in reference dataset\n# select top 50 with hihgest foldchange among top 100 signifcant genes.\nref_list <- lapply(ref_DGE, function(x) {\n x$logFC <- rowSums(as.matrix(x[, grep(\"logFC\", colnames(x))]))\n x %>%\n as.data.frame() %>%\n filter(p.value < 0.01) %>%\n top_n(-100, p.value) %>%\n top_n(50, logFC) %>%\n rownames()\n})\n\nunlist(lapply(ref_list, length))\n\n B cell CD4 T cell CD8 T cell cDC cMono ncMono \n 50 50 19 17 50 50 \n NK cell pDC Plasma cell \n 50 50 24 \n\n\nNow we can run GSEA for the DEGs from our dataset and check for enrichment of top DEGs in the reference dataset.\n\nsuppressPackageStartupMessages(library(fgsea))\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n x$logFC <- rowSums(as.matrix(x[, grep(\"logFC\", colnames(x))]))\n gene_rank <- setNames(x$logFC, rownames(x))\n fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.1, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 2, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\nres\n\n$`1`\n pathway pval padj ES NES nMoreExtreme size\n1: cMono 0.0001612123 0.0005946089 0.9477365 1.935642 0 47\n2: ncMono 0.0001611344 0.0005946089 0.8883004 1.824343 0 49\n3: cDC 0.0581929556 0.0654670750 -0.7642090 -1.413663 265 17\n4: Plasma cell 0.0263583815 0.0338893476 -0.7559870 -1.492311 113 24\n5: NK cell 0.0018440464 0.0027660695 -0.7327226 -1.663502 6 49\n6: CD8 T cell 0.0011008366 0.0019815059 -0.8963974 -1.673679 4 18\n7: B cell 0.0002632272 0.0005946089 -0.9032392 -2.032917 0 47\n8: CD4 T cell 0.0002642706 0.0005946089 -0.9254862 -2.108715 0 50\n leadingEdge\n1: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n2: S100A11,AIF1,S100A4,FCER1G,MAFB,SERPINA1,...\n3: HLA-DPB1,HLA-DPA1,HLA-DQB1,HLA-DRB1,HLA-DMA,HLA-DRB5,...\n4: ISG20,PEBP1,CYCS,MIF,FKBP11,SPCS2,...\n5: GNLY,NKG7,B2M,CTSW,GZMA,FGFBP2,...\n6: IL32,CCL5,GZMH,CD3D,CD2,CD8A,...\n7: RPS5,CXCR4,RPL23A,CD52,RPL18A,RPL13A,...\n8: RPL3,RPS4X,RPS27A,RPL5,EEF1A1,RPL14,...\n\n$`2`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0002041650 0.0003700658 0.9650595 2.060454 0 47\n2: CD4 T cell 0.0002055921 0.0003700658 0.8591045 1.846955 0 50\n3: cDC 0.0004203447 0.0006305170 0.9445632 1.709807 1 17\n4: CD8 T cell 0.0021048603 0.0027062490 -0.8921894 -1.641239 10 18\n5: cMono 0.0001959248 0.0003700658 -0.8185447 -1.761319 0 47\n6: ncMono 0.0001940994 0.0003700658 -0.8829761 -1.915489 0 49\n7: NK cell 0.0001940994 0.0003700658 -0.9127279 -1.980031 0 49\n leadingEdge\n1: MS4A1,CD37,TNFRSF13C,CXCR4,BANK1,CD79B,...\n2: RPS6,RPL13,RPL32,RPS3A,RPS29,RPL3,...\n3: HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DRB1,HLA-DPA1,HLA-DMA,...\n4: CCL5,IL32,GZMH,CD3D,CD2,LYAR,...\n5: S100A6,S100A9,LYZ,S100A8,TYROBP,FCN1,...\n6: S100A4,FCER1G,S100A11,AIF1,IFITM3,LST1,...\n7: HCST,NKG7,ITGB2,GNLY,MYO1F,CST7,...\n\n$`3`\n pathway pval padj ES NES nMoreExtreme size\n1: ncMono 0.0001041124 0.0004694836 0.9309715 1.625137 0 49\n2: cMono 0.0001043297 0.0004694836 0.9315183 1.624154 0 47\n3: cDC 0.0168105930 0.0216136195 0.8590261 1.386413 145 17\n4: CD4 T cell 0.0026666667 0.0040000000 -0.7020776 -1.886878 0 50\n5: NK cell 0.0025188917 0.0040000000 -0.7120017 -1.914447 0 49\n6: CD8 T cell 0.0007980846 0.0023942538 -0.9359176 -2.017558 0 18\n7: B cell 0.0023980815 0.0040000000 -0.8774013 -2.326466 0 47\n leadingEdge\n1: AIF1,PSAP,S100A11,FCER1G,S100A4,SERPINA1,...\n2: S100A9,LYZ,S100A8,FCN1,TYROBP,S100A6,...\n3: HLA-DRA,HLA-DRB1,HLA-DRB5,HLA-DQB1,HLA-DPA1,HLA-DMA,...\n4: RPL3,PIK3IP1,IL7R,RPS29,RPS3,RPS27A,...\n5: NKG7,GNLY,CST7,GZMA,CTSW,GZMM,...\n6: CCL5,IL32,GZMH,CD3D,CD2,CD8A,...\n7: CXCR4,MS4A1,TNFRSF13C,CD79B,BANK1,RPS5,...\n\n$`4`\n pathway pval padj ES NES nMoreExtreme size\n1: CD4 T cell 0.0001930875 0.0004653568 0.9803622 2.131622 0 50\n2: NK cell 0.0275077559 0.0412616339 -0.6668272 -1.466630 132 49\n3: cDC 0.0001991239 0.0004653568 -0.9322686 -1.728863 0 17\n4: pDC 0.0006202191 0.0011163945 -0.8171519 -1.789912 2 47\n5: cMono 0.0002067397 0.0004653568 -0.9186945 -2.012333 0 47\n6: ncMono 0.0002068252 0.0004653568 -0.9263802 -2.037495 0 49\n leadingEdge\n1: IL7R,LDHB,PIK3IP1,NOSIP,RPL3,RPS12,...\n2: NKG7,GNLY,FGFBP2,MYO1F,CST7,GZMA,...\n3: HLA-DRA,HLA-DRB1,HLA-DPA1,HLA-DPB1,HLA-DQB1,HLA-DMA,...\n4: PLEK,NPC2,IRF8,PLAC8,PTPRE,CTSB,...\n5: S100A9,S100A8,LYZ,TYROBP,FCN1,APLP2,...\n6: FCER1G,PSAP,IFITM3,LYN,SAT1,LST1,...\n\n$`5`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0001818182 0.0004016064 0.9624502 2.052882 0 47\n2: CD4 T cell 0.0001812251 0.0004016064 0.8762641 1.886926 0 50\n3: cDC 0.0001904399 0.0004016064 0.9538608 1.738185 0 17\n4: CD8 T cell 0.0004203447 0.0006305170 -0.9046911 -1.711837 1 18\n5: cMono 0.0008884940 0.0011423494 -0.7954796 -1.765723 3 47\n6: ncMono 0.0002231147 0.0004016064 -0.8859954 -1.977394 0 49\n7: NK cell 0.0002231147 0.0004016064 -0.9087684 -2.028219 0 49\n leadingEdge\n1: MS4A1,CD37,CXCR4,TNFRSF13C,BANK1,LINC00926,...\n2: RPS6,RPL13,RPL32,RPS3A,RPL9,RPL3,...\n3: HLA-DRA,HLA-DQB1,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DMA,...\n4: CCL5,IL32,GZMH,CD3D,CD2,LYAR,...\n5: S100A6,S100A9,LYZ,S100A8,TYROBP,FCN1,...\n6: S100A4,FCER1G,S100A11,AIF1,PSAP,IFITM3,...\n7: HCST,NKG7,ITGB2,GNLY,MYO1F,CST7,...\n\n$`6`\n pathway pval padj ES NES nMoreExtreme size\n1: NK cell 0.0001968117 0.0003660024 0.9357367 2.012182 0 49\n2: CD4 T cell 0.0001970443 0.0003660024 0.8648575 1.865254 0 50\n3: CD8 T cell 0.0002002804 0.0003660024 0.9667190 1.776197 0 18\n4: cDC 0.0047732697 0.0071599045 -0.8811814 -1.612760 23 17\n5: ncMono 0.0002032107 0.0003660024 -0.8655401 -1.895657 0 49\n6: cMono 0.0002033347 0.0003660024 -0.9182094 -1.999151 0 47\n leadingEdge\n1: NKG7,GNLY,CST7,GZMA,CTSW,GZMM,...\n2: IL7R,RPS3,RPS29,RPL3,MGAT4A,RPS4X,...\n3: CCL5,IL32,GZMH,CD3D,LYAR,CD8A,...\n4: HLA-DRA,HLA-DMA,HLA-DQB1,HLA-DRB5,BASP1,HLA-DRB1,...\n5: FCER1G,AIF1,LST1,FTH1,COTL1,PSAP,...\n6: S100A9,S100A8,LYZ,TYROBP,FCN1,VCAN,...\n\n$`7`\n pathway pval padj ES NES nMoreExtreme size\n1: NK cell 0.0002246686 0.0006740058 0.9822433 2.117581 0 49\n2: CD8 T cell 0.0052356021 0.0067314884 0.8934917 1.648012 23 18\n3: cDC 0.0007408779 0.0016233766 -0.9096050 -1.649017 3 17\n4: ncMono 0.0025220681 0.0037831021 -0.7690981 -1.653101 13 49\n5: CD4 T cell 0.0009018759 0.0016233766 -0.8069090 -1.736711 4 50\n6: cMono 0.0001806685 0.0006740058 -0.8740244 -1.867198 0 47\n7: B cell 0.0001806685 0.0006740058 -0.8943406 -1.910600 0 47\n leadingEdge\n1: GNLY,NKG7,FGFBP2,CST7,PRF1,CTSW,...\n2: CCL5,GZMH,IL32,LYAR,CD2,LINC01871,...\n3: HLA-DRA,HLA-DRB1,HLA-DQB1,HLA-DPA1,HLA-DMA,HLA-DRB5,...\n4: COTL1,FTH1,AIF1,LST1,SAT1,SPI1,...\n5: TMEM123,RPS13,RPL22,RPS28,RPL35A,RPL36,...\n6: S100A9,S100A8,LYZ,FCN1,TKT,VCAN,...\n7: CD37,RPS11,MS4A1,CD52,BANK1,TNFRSF13C,...\n\n$`8`\n pathway pval padj ES NES nMoreExtreme size\n1: ncMono 0.0021600605 0.003240091 -0.7537958 -1.411206 19 49\n2: NK cell 0.0006480181 0.001166433 -0.7784508 -1.457363 5 49\n3: B cell 0.0004329004 0.001166433 -0.7863661 -1.466871 3 47\n4: cDC 0.0005745145 0.001166433 -0.8884593 -1.499586 4 17\n5: cMono 0.0001082251 0.000487013 -0.8319138 -1.551835 0 47\n6: CD4 T cell 0.0001077702 0.000487013 -0.9066494 -1.701390 0 50\n leadingEdge\n1: S100A4,S100A11,AIF1,IFITM2,CEBPB,SERPINA1,...\n2: ITGB2,NKG7,GNLY,MYO1F,IFITM1,JAK1,...\n3: CD52,RPS23,RPL13A,RPS11,RPL12,FAU,...\n4: HLA-DRA,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DQB1,HLA-DMA,...\n5: JUND,S100A6,NFKBIA,TYROBP,LYZ,FOS,...\n6: RPL34,RPS13,RPL13,EEF1A1,RPS3A,RPL32,...\n\n$`9`\n pathway pval padj ES NES nMoreExtreme size\n1: ncMono 0.0001191611 0.001072450 0.9705242 1.879820 0 49\n2: cDC 0.0061555680 0.011080022 0.8911415 1.525520 43 17\n3: cMono 0.0129496403 0.016649538 0.7656658 1.476902 107 47\n4: Plasma cell 0.0330511890 0.037182588 -0.7002547 -1.547523 81 24\n5: NK cell 0.0105590062 0.015838509 -0.6315449 -1.603456 16 49\n6: CD8 T cell 0.0007165890 0.001612325 -0.8974765 -1.869886 1 18\n7: CD4 T cell 0.0006422608 0.001612325 -0.8507977 -2.161552 0 50\n8: B cell 0.0006016847 0.001612325 -0.8721690 -2.198723 0 47\n leadingEdge\n1: AIF1,LST1,COTL1,FCER1G,PSAP,FCGR3A,...\n2: HLA-DPA1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DRB5,HLA-DMA,...\n3: LYZ,TYROBP,S100A6,FCN1,TKT,S100A9,...\n4: ISG20,CYCS,FKBP11,PEBP1,JCHAIN,MZB1,...\n5: CST7,IFITM1,GZMM,CCL4,CD247,HOPX,...\n6: CCL5,IL32,CD3D,GZMH,CD2,LYAR,...\n7: RPL31,RPS29,IL7R,RPS3,RPS27A,CCR7,...\n8: CXCR4,MS4A1,BANK1,TNFRSF13C,LINC00926,RALGPS2,...\n\n\nSelecing top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters.\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\n\nalldata@colData$ref_gsea <- new.cluster.ids[as.character(alldata@colData$louvain_SNNk15)]\n\nwrap_plots(\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"louvain_SNNk15\"),\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"ref_gsea\"),\n ncol = 2\n)\n\n\n\n\n\n\n\n\nCompare to results with the other celltype prediction methods in the ctrl_13 sample.\n\nctrl.sce@colData$ref_gsea <- alldata@colData$ref_gsea[alldata@colData$sample == \"ctrl.13\"]\n\nwrap_plots(\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"ref_gsea\"),\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cell\"),\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scpred_prediction\"),\n ncol = 3\n)\n\n\n\n\n\n\n\n\n\n\n7.2 With annotated gene sets\nWe have dowloaded the celltype gene lists from http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and converted the excel file to a csv for you. Read in the gene lists and do some filtering.\n\npath_file <- file.path(\"data/human_cell_markers.txt\")\nif (!file.exists(path_file)) download.file(file.path(path_data, \"human_cell_markers.txt\"), destfile = path_file)\n\n\nmarkers <- read.delim(\"data/human_cell_markers.txt\")\nmarkers <- markers[markers$speciesType == \"Human\", ]\nmarkers <- markers[markers$cancerType == \"Normal\", ]\n\n# Filter by tissue (to reduce computational time and have tissue-specific classification)\n# sort(unique(markers$tissueType))\n# grep(\"blood\",unique(markers$tissueType),value = T)\n# markers <- markers [ markers$tissueType %in% c(\"Blood\",\"Venous blood\",\n# \"Serum\",\"Plasma\",\n# \"Spleen\",\"Bone marrow\",\"Lymph node\"), ]\n\n\n# remove strange characters etc.\ncelltype_list <- lapply(unique(markers$cellName), function(x) {\n x <- paste(markers$geneSymbol[markers$cellName == x], sep = \",\")\n x <- gsub(\"[[]|[]]| |-\", \",\", x)\n x <- unlist(strsplit(x, split = \",\"))\n x <- unique(x[!x %in% c(\"\", \"NA\", \"family\")])\n x <- casefold(x, upper = T)\n})\nnames(celltype_list) <- unique(markers$cellName)\n# celltype_list <- lapply(celltype_list , function(x) {x[1:min(length(x),50)]} )\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100]\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5]\n\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n x$logFC <- rowSums(as.matrix(x[, grep(\"logFC\", colnames(x))]))\n gene_rank <- setNames(x$logFC, rownames(x))\n fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000)\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.01, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 5, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\n\n# show top 3 for each cluster.\nlapply(res, head, 3)\n\n$`1`\n pathway pval padj ES NES\n1: Neutrophil 0.0001507613 0.01493723 0.9197310 2.010307\n2: CD1C+_B dendritic cell 0.0001589067 0.01493723 0.9293164 1.931839\n3: Stromal cell 0.0013311148 0.05004992 0.8544544 1.696909\n nMoreExtreme size leadingEdge\n1: 0 80 S100A8,S100A9,S100A12,MNDA,S100A11,NAMPT,...\n2: 0 53 S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n3: 7 38 VIM,TIMP2,BST1,TIMP1,ANPEP,CD44,...\n\n$`2`\n pathway pval padj ES NES\n1: Follicular B cell 0.006354586 0.05430282 0.8587199 1.627043\n2: Pyramidal cell 0.003853565 0.04168250 -0.9722789 -1.490874\n3: CD4+CD25+ regulatory T cell 0.001541426 0.02414900 -0.9799548 -1.502644\n nMoreExtreme size leadingEdge\n1: 29 22 MS4A1,CD69,CD22,FCER2,CD40,PAX5,...\n2: 19 6 NRGN,CD3E\n3: 7 6 CD3E,CD3D,CD3G,PTPRC,CD4\n\n$`3`\n pathway pval padj ES NES\n1: Neutrophil 0.0001011327 0.007217168 0.8809821 1.569285\n2: CD1C+_B dendritic cell 0.0001033271 0.007217168 0.8836167 1.550651\n3: Monocyte derived dendritic cell 0.0001151676 0.007217168 0.9481164 1.532539\n nMoreExtreme size leadingEdge\n1: 0 80 S100A9,S100A8,S100A11,CD14,LST1,MNDA,...\n2: 0 53 S100A9,LYZ,S100A8,FCN1,VCAN,CD14,...\n3: 0 17 S100A9,S100A8,CST3,CD14,CD33,ITGAX,...\n\n$`4`\n pathway pval padj ES NES nMoreExtreme\n1: Naive CD8+ T cell 0.0001888218 0.005616299 0.8620656 2.045525 0\n2: Naive CD4+ T cell 0.0002017756 0.005616299 0.9214751 1.879833 0\n3: CD4+ T cell 0.0002022654 0.005616299 0.9193037 1.787130 0\n size leadingEdge\n1: 91 LDHB,PIK3IP1,NOSIP,TCF7,RCAN3,NPM1,...\n2: 34 IL7R,NOSIP,TCF7,EEF1B2,RPS5,MAL,...\n3: 25 IL7R,LTB,CD3E,CD3D,CD3G,CD2,...\n\n$`5`\n pathway pval padj ES NES\n1: Follicular B cell 0.005346572 0.04188148 0.8501224 1.610208\n2: Hematopoietic precursor cell 0.008534851 0.06171354 -0.9521366 -1.493451\n3: Pyramidal cell 0.003048161 0.03581589 -0.9725160 -1.525417\n nMoreExtreme size leadingEdge\n1: 27 22 MS4A1,CD69,CD22,CD40,FCER2,PAX5,...\n2: 41 6 CD14,PTPRC\n3: 14 6 CD3E,NRGN\n\n$`6`\n pathway pval padj ES\n1: CD4+ cytotoxic T cell 0.0001908761 0.007875995 0.8929282\n2: Natural killer cell 0.0003821899 0.009483454 0.7967208\n3: Effector CD8+ memory T (Tem) cell 0.0003824092 0.009483454 0.7969411\n NES nMoreExtreme size leadingEdge\n1: 2.063730 0 86 CCL5,NKG7,GZMH,GNLY,CST7,GZMA,...\n2: 1.835585 1 84 NKG7,GNLY,CD3D,CD3E,GZMA,CD3G,...\n3: 1.824241 1 79 GZMH,GNLY,ARL4C,GZMB,FGFBP2,KLRD1,...\n\n$`7`\n pathway pval padj ES NES\n1: CD4+ cytotoxic T cell 0.0002165909 0.01025753 0.9480244 2.205220\n2: Effector CD8+ memory T (Tem) cell 0.0002165909 0.01025753 0.8968211 2.068348\n3: Natural killer cell 0.0002182453 0.01025753 0.8507701 1.972715\n nMoreExtreme size leadingEdge\n1: 0 86 GNLY,NKG7,GZMB,FGFBP2,CCL5,CST7,...\n2: 0 79 GNLY,GZMB,FGFBP2,KLRD1,SPON2,GZMH,...\n3: 0 84 GNLY,NKG7,GZMB,GZMA,CD247,KLRD1,...\n\n$`8`\n pathway pval padj ES NES nMoreExtreme\n1: Megakaryocyte 0.002577320 0.08490323 0.7934901 1.757021 2\n2: Neutrophil 0.008846794 0.11600655 -0.6842598 -1.340588 84\n3: Mesenchymal cell 0.009494346 0.11600655 -0.7144618 -1.363128 88\n size leadingEdge\n1: 25 PPBP,PF4,GP9,ITGA2B,CD9,RASGRP2,...\n2: 80 PTPRC,ITGB2,S100A11,CD44,IFITM2,S100A12,...\n3: 58 S100A4,PTPRC,VIM,CD44,ZEB2,CTSC,...\n\n$`9`\n pathway pval padj ES NES\n1: Mesenchymal cell 0.0001175917 0.02210724 0.8495970 1.678997\n2: Stromal cell 0.0007528231 0.04569762 0.8602790 1.630578\n3: Endometrial stem cell 0.0029594138 0.06821588 0.9013667 1.560572\n nMoreExtreme size leadingEdge\n1: 0 58 COTL1,S100A4,VIM,CTSC,HES4,ZEB2,...\n2: 5 38 VIM,PECAM1,TIMP1,CD44,TIMP2,ICAM3,...\n3: 20 18 PECAM1,CD44,PTPRC,ITGA4,ITGB1,ENG,...\n\n\n#CT_GSEA8:\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\nalldata@colData$cellmarker_gsea <- new.cluster.ids[as.character(alldata@colData$louvain_SNNk15)]\n\nwrap_plots(\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"cellmarker_gsea\"),\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"ref_gsea\"),\n ncol = 2\n)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you think that the methods overlap well? Where do you see the most inconsistencies?\n\n\nIn this case we do not have any ground truth, and we cannot say which method performs best. You should keep in mind, that any celltype classification method is just a prediction, and you still need to use your common sense and knowledge of the biological system to judge if the results make sense.\nFinally, lets save the data with predictions.\n\nsaveRDS(ctrl.sce, \"data/covid/results/bioc_covid_qc_dr_int_cl_ct-ctrl13.rds\")" + "text": "7 GSEA with celltype markers\nAnother option, where celltype can be classified on cluster level is to use gene set enrichment among the DEGs with known markers for different celltypes. Similar to how we did functional enrichment for the DEGs in the Differential expression exercise. There are some resources for celltype gene sets that can be used. Such as CellMarker, PanglaoDB or celltype gene sets at MSigDB. We can also look at overlap between DEGs in a reference dataset and the dataset you are analysing.\n\n7.1 DEG overlap\nFirst, lets extract top DEGs for our Covid-19 dataset and the reference dataset. When we run differential expression for our dataset, we want to report as many genes as possible, hence we set the cutoffs quite lenient.\n\n# run differential expression in our dataset, using clustering at resolution 0.3\nDGE_list <- scran::findMarkers(\n x = alldata,\n groups = as.character(alldata@colData$louvain_SNNk15),\n pval.type = \"all\",\n min.prop = 0\n)\n\n\n# Compute differential gene expression in reference dataset (that has cell annotation)\nref_DGE <- scran::findMarkers(\n x = ref.sce,\n groups = as.character(ref.sce@colData$cell_type),\n pval.type = \"all\",\n direction = \"up\"\n)\n\n# Identify the top cell marker genes in reference dataset\n# select top 50 with hihgest foldchange among top 100 signifcant genes.\nref_list <- lapply(ref_DGE, function(x) {\n x$logFC <- rowSums(as.matrix(x[, grep(\"logFC\", colnames(x))]))\n x %>%\n as.data.frame() %>%\n filter(p.value < 0.01) %>%\n top_n(-100, p.value) %>%\n top_n(50, logFC) %>%\n rownames()\n})\n\nunlist(lapply(ref_list, length))\n\n B cell CD4 T cell CD8 T cell cDC cMono ncMono \n 50 50 19 17 50 50 \n NK cell pDC Plasma cell \n 50 50 24 \n\n\nNow we can run GSEA for the DEGs from our dataset and check for enrichment of top DEGs in the reference dataset.\n\nsuppressPackageStartupMessages(library(fgsea))\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n x$logFC <- rowSums(as.matrix(x[, grep(\"logFC\", colnames(x))]))\n gene_rank <- setNames(x$logFC, rownames(x))\n fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.1, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 2, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\nres\n\n$`1`\n pathway pval padj ES NES nMoreExtreme size\n1: cMono 0.0001327492 0.0007607777 0.9515770 1.824481 0 47\n2: ncMono 0.0003952048 0.0007607777 0.8775149 1.692495 2 49\n3: NK cell 0.0070510162 0.0105765243 -0.6936107 -1.614830 16 49\n4: CD8 T cell 0.0002904444 0.0007607777 -0.9042254 -1.737436 0 18\n5: B cell 0.0004050223 0.0007607777 -0.9085213 -2.107981 0 47\n6: CD4 T cell 0.0004226543 0.0007607777 -0.9246391 -2.155367 0 50\n leadingEdge\n1: S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n2: S100A11,AIF1,S100A4,FCER1G,MAFB,SAT1,...\n3: GNLY,NKG7,CTSW,GZMA,B2M,GZMM,...\n4: IL32,CCL5,GZMH,CD3D,CD2,CD8A,...\n5: RPS5,CXCR4,RPL23A,CD52,RPL18A,RPL13A,...\n6: RPL3,RPS4X,RPS27A,RPL5,EEF1A1,RPL14,...\n\n$`2`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0001973554 0.000365408 0.9639203 2.032368 0 47\n2: CD4 T cell 0.0001970055 0.000365408 0.8696162 1.846240 0 50\n3: cDC 0.0001979806 0.000365408 0.9506666 1.711665 0 17\n4: CD8 T cell 0.0016083635 0.002067896 -0.8930677 -1.641590 7 18\n5: cMono 0.0008105370 0.001215805 -0.7964559 -1.721447 3 47\n6: ncMono 0.0002030045 0.000365408 -0.8998544 -1.960178 0 49\n7: NK cell 0.0002030045 0.000365408 -0.9119291 -1.986480 0 49\n leadingEdge\n1: MS4A1,CD37,CXCR4,TNFRSF13C,BANK1,CD79B,...\n2: RPS6,RPL13,RPL32,RPS3A,RPS29,RPL3,...\n3: HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DRB1,HLA-DPA1,HLA-DMA,...\n4: CCL5,IL32,GZMH,CD3D,CD2,LYAR,...\n5: S100A6,S100A9,TYROBP,LYZ,S100A8,FCN1,...\n6: S100A4,FCER1G,S100A11,AIF1,LST1,IFITM3,...\n7: HCST,NKG7,ITGB2,GNLY,MYO1F,CST7,...\n\n$`3`\n pathway pval padj ES NES nMoreExtreme size\n1: cMono 0.0001162115 0.0005229518 0.9366715 1.754906 0 47\n2: ncMono 0.0001152206 0.0005229518 0.9282959 1.748653 0 49\n3: cDC 0.0058249797 0.0074892596 0.8938512 1.504106 42 17\n4: Plasma cell 0.0304961311 0.0343081475 -0.7003557 -1.517735 66 24\n5: NK cell 0.0007558579 0.0011583012 -0.7163352 -1.783214 0 49\n6: CD8 T cell 0.0003958828 0.0011583012 -0.9181589 -1.871343 0 18\n7: CD4 T cell 0.0007722008 0.0011583012 -0.7661201 -1.917423 0 50\n8: B cell 0.0007158196 0.0011583012 -0.8988784 -2.222805 0 47\n leadingEdge\n1: LYZ,S100A9,S100A8,FCN1,TYROBP,S100A6,...\n2: AIF1,PSAP,S100A4,FCER1G,S100A11,COTL1,...\n3: HLA-DRA,HLA-DRB1,HLA-DRB5,HLA-DPA1,HLA-DQB1,HLA-DPB1,...\n4: ISG20,CYCS,FKBP11,JCHAIN,MZB1,PEBP1,...\n5: NKG7,CST7,GZMM,CTSW,GZMA,FGFBP2,...\n6: CCL5,IL32,CD3D,GZMH,CD2,CD8A,...\n7: PIK3IP1,RPS29,IL7R,RPS27A,RPL3,RPS3,...\n8: CXCR4,MS4A1,CD79B,RPS5,TNFRSF13C,BANK1,...\n\n$`4`\n pathway pval padj ES NES nMoreExtreme size\n1: CD4 T cell 0.0002101723 0.0004728878 0.9821321 2.134294 0 50\n2: NK cell 0.0339112212 0.0508668318 -0.6722891 -1.452462 177 49\n3: cDC 0.0001847404 0.0004728878 -0.9340033 -1.702542 0 17\n4: pDC 0.0005699088 0.0010258359 -0.8226825 -1.767306 2 47\n5: cMono 0.0001899696 0.0004728878 -0.9129405 -1.961201 0 47\n6: ncMono 0.0001905125 0.0004728878 -0.9484681 -2.049139 0 49\n leadingEdge\n1: IL7R,LDHB,PIK3IP1,RPL3,RPS12,RPL13,...\n2: NKG7,GNLY,MYO1F,FGFBP2,CST7,ITGB2,...\n3: HLA-DRA,HLA-DRB1,HLA-DPA1,HLA-DPB1,HLA-DQB1,HLA-DMA,...\n4: PLEK,NPC2,PLAC8,IRF8,CTSB,PTPRE,...\n5: S100A9,TYROBP,S100A8,LYZ,FCN1,APLP2,...\n6: FCER1G,PSAP,IFITM3,LST1,SAT1,AIF1,...\n\n$`5`\n pathway pval padj ES NES nMoreExtreme size\n1: B cell 0.0001963479 0.0003684749 0.9641767 2.042223 0 47\n2: CD4 T cell 0.0001947799 0.0003684749 0.8825397 1.887378 0 50\n3: cDC 0.0002023063 0.0003684749 0.9586399 1.728487 0 17\n4: CD8 T cell 0.0013938670 0.0020908005 -0.8995956 -1.657233 6 18\n5: cMono 0.0018333673 0.0023571865 -0.7742663 -1.687829 8 47\n6: ncMono 0.0002047083 0.0003684749 -0.9119300 -1.999937 0 49\n7: NK cell 0.0002047083 0.0003684749 -0.9130713 -2.002440 0 49\n leadingEdge\n1: MS4A1,CD37,CXCR4,TNFRSF13C,BANK1,LINC00926,...\n2: RPS6,RPL13,RPL32,RPS3A,RPL9,RPS29,...\n3: HLA-DRA,HLA-DQB1,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DMA,...\n4: CCL5,IL32,GZMH,CD3D,CD2,LYAR,...\n5: S100A6,S100A9,LYZ,TYROBP,S100A8,FCN1,...\n6: S100A4,FCER1G,S100A11,AIF1,PSAP,LST1,...\n7: ITGB2,NKG7,HCST,GNLY,MYO1F,CST7,...\n\n$`6`\n pathway pval padj ES NES nMoreExtreme size\n1: NK cell 0.0001863586 0.0003882657 0.9383295 1.976633 0 49\n2: CD4 T cell 0.0001846381 0.0003882657 0.8789241 1.861605 0 50\n3: CD8 T cell 0.0001971220 0.0003882657 0.9670054 1.759568 0 18\n4: pDC 0.0952073931 0.1224095054 -0.6041872 -1.319390 442 47\n5: cDC 0.0034246575 0.0051369863 -0.8829998 -1.620873 16 17\n6: ncMono 0.0002157032 0.0003882657 -0.8872842 -1.952006 0 49\n7: cMono 0.0002149151 0.0003882657 -0.9085014 -1.983934 0 47\n leadingEdge\n1: NKG7,GNLY,CST7,GZMA,CTSW,GZMM,...\n2: IL7R,RPS3,RPS29,RPL3,RPL13,RPS6,...\n3: CCL5,IL32,GZMH,CD3D,LYAR,CD8A,...\n4: NPC2,CTSB,IRF8,UNC93B1,PLEK,TCF4,...\n5: HLA-DRA,HLA-DQB1,HLA-DRB5,HLA-DMA,HLA-DRB1,BASP1,...\n6: FCER1G,AIF1,LST1,FTH1,COTL1,PSAP,...\n7: S100A9,S100A8,LYZ,TYROBP,FCN1,TKT,...\n\n$`7`\n pathway pval padj ES NES nMoreExtreme size\n1: NK cell 0.0002319109 0.0006957328 0.9845619 2.113823 0 49\n2: CD8 T cell 0.0048098946 0.0061841503 0.9021784 1.664902 20 18\n3: cDC 0.0005337129 0.0009606832 -0.9182682 -1.651815 2 17\n4: CD4 T cell 0.0008767315 0.0013150973 -0.7915340 -1.672841 4 50\n5: ncMono 0.0005272408 0.0009606832 -0.8127856 -1.712117 2 49\n6: cMono 0.0001757469 0.0006957328 -0.8702759 -1.823002 0 47\n7: B cell 0.0001757469 0.0006957328 -0.8859815 -1.855901 0 47\n leadingEdge\n1: GNLY,NKG7,CTSW,FGFBP2,CST7,PRF1,...\n2: CCL5,GZMH,IL32,LYAR,CD2,LINC01871,...\n3: HLA-DRA,HLA-DRB1,HLA-DQB1,HLA-DPA1,HLA-DPB1,HLA-DMA,...\n4: RPS28,TMEM123,RPL35A,RPS13,RPL9,RPS12,...\n5: COTL1,FTH1,AIF1,LST1,SAT1,NAP1L1,...\n6: S100A9,S100A8,LYZ,FCN1,TKT,MNDA,...\n7: CD37,CD52,MS4A1,BANK1,CD79B,TNFRSF13C,...\n\n$`8`\n pathway pval padj ES NES nMoreExtreme size\n1: Plasma cell 0.0497362472 0.0639466035 0.6759316 1.456101 65 24\n2: NK cell 0.0014337708 0.0021506562 -0.7686920 -1.449205 12 49\n3: ncMono 0.0014337708 0.0021506562 -0.7689308 -1.449655 12 49\n4: B cell 0.0004431642 0.0013294926 -0.7964359 -1.495521 3 47\n5: cDC 0.0006997901 0.0015745276 -0.8968690 -1.514856 5 17\n6: cMono 0.0001107910 0.0004985597 -0.8330905 -1.564350 0 47\n7: CD4 T cell 0.0001100837 0.0004985597 -0.9094132 -1.719188 0 50\n leadingEdge\n1: JCHAIN,MZB1,DAD1,DERL3,TNFRSF17,MYDGF,...\n2: ITGB2,NKG7,GNLY,MYO1F,IFITM1,JAK1,...\n3: S100A4,S100A11,AIF1,IFITM2,CEBPB,SERPINA1,...\n4: CD52,RPS23,RPL13A,RPS11,RPL12,FAU,...\n5: HLA-DRA,HLA-DRB1,HLA-DPB1,HLA-DPA1,HLA-DQB1,HLA-DMA,...\n6: JUND,S100A6,TYROBP,NFKBIA,LYZ,FOS,...\n7: RPL34,EEF1A1,RPL13,RPS13,RPS3A,RPS6,...\n\n$`9`\n pathway pval padj ES NES nMoreExtreme size\n1: ncMono 0.0001131990 0.001018791 0.9741332 1.797890 0 49\n2: cDC 0.0419888030 0.062983204 0.8400218 1.386229 314 17\n3: CD8 T cell 0.0004108463 0.001562500 -0.9139791 -1.881373 0 18\n4: NK cell 0.0008561644 0.001562500 -0.7548756 -1.882005 0 49\n5: B cell 0.0008244023 0.001562500 -0.7643028 -1.891838 0 47\n6: CD4 T cell 0.0008680556 0.001562500 -0.8712990 -2.177877 0 50\n leadingEdge\n1: LST1,AIF1,COTL1,FCER1G,FCGR3A,IFITM3,...\n2: HLA-DPA1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DRB5,MTMR14,...\n3: CCL5,IL32,GZMH,CD3D,CD2,CD8A,...\n4: NKG7,GNLY,CST7,CTSW,GZMA,CD247,...\n5: CXCR4,MS4A1,BANK1,TNFRSF13C,LINC00926,RPL13A,...\n6: RPL31,LDHB,RPS3,IL7R,RPS29,RPS27A,...\n\n\nSelecing top significant overlap per cluster, we can now rename the clusters according to the predicted labels. OBS! Be aware that if you have some clusters that have non-significant p-values for all the gene sets, the cluster label will not be very reliable. Also, the gene sets you are using may not cover all the celltypes you have in your dataset and hence predictions may just be the most similar celltype. Also, some of the clusters have very similar p-values to multiple celltypes, for instance the ncMono and cMono celltypes are equally good for some clusters.\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\n\nalldata@colData$ref_gsea <- new.cluster.ids[as.character(alldata@colData$louvain_SNNk15)]\n\nwrap_plots(\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"louvain_SNNk15\"),\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"ref_gsea\"),\n ncol = 2\n)\n\n\n\n\n\n\n\n\nCompare to results with the other celltype prediction methods in the ctrl_13 sample.\n\nctrl.sce@colData$ref_gsea <- alldata@colData$ref_gsea[alldata@colData$sample == \"ctrl.13\"]\n\nwrap_plots(\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"ref_gsea\"),\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scmap_cell\"),\n plotReducedDim(ctrl.sce, dimred = \"UMAP\", colour_by = \"scpred_prediction\"),\n ncol = 3\n)\n\n\n\n\n\n\n\n\n\n\n7.2 With annotated gene sets\nWe have dowloaded the celltype gene lists from http://bio-bigdata.hrbmu.edu.cn/CellMarker/CellMarker_download.html and converted the excel file to a csv for you. Read in the gene lists and do some filtering.\n\npath_file <- file.path(\"data/human_cell_markers.txt\")\nif (!file.exists(path_file)) download.file(file.path(path_data, \"human_cell_markers.txt\"), destfile = path_file)\n\n\nmarkers <- read.delim(\"data/human_cell_markers.txt\")\nmarkers <- markers[markers$speciesType == \"Human\", ]\nmarkers <- markers[markers$cancerType == \"Normal\", ]\n\n# Filter by tissue (to reduce computational time and have tissue-specific classification)\n# sort(unique(markers$tissueType))\n# grep(\"blood\",unique(markers$tissueType),value = T)\n# markers <- markers [ markers$tissueType %in% c(\"Blood\",\"Venous blood\",\n# \"Serum\",\"Plasma\",\n# \"Spleen\",\"Bone marrow\",\"Lymph node\"), ]\n\n\n# remove strange characters etc.\ncelltype_list <- lapply(unique(markers$cellName), function(x) {\n x <- paste(markers$geneSymbol[markers$cellName == x], sep = \",\")\n x <- gsub(\"[[]|[]]| |-\", \",\", x)\n x <- unlist(strsplit(x, split = \",\"))\n x <- unique(x[!x %in% c(\"\", \"NA\", \"family\")])\n x <- casefold(x, upper = T)\n})\nnames(celltype_list) <- unique(markers$cellName)\n# celltype_list <- lapply(celltype_list , function(x) {x[1:min(length(x),50)]} )\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100]\ncelltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5]\n\n\n# run fgsea for each of the clusters in the list\nres <- lapply(DGE_list, function(x) {\n x$logFC <- rowSums(as.matrix(x[, grep(\"logFC\", colnames(x))]))\n gene_rank <- setNames(x$logFC, rownames(x))\n fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000)\n return(fgseaRes)\n})\nnames(res) <- names(DGE_list)\n\n# You can filter and resort the table based on ES, NES or pvalue\nres <- lapply(res, function(x) {\n x[x$pval < 0.01, ]\n})\nres <- lapply(res, function(x) {\n x[x$size > 5, ]\n})\nres <- lapply(res, function(x) {\n x[order(x$NES, decreasing = T), ]\n})\n\n# show top 3 for each cluster.\nlapply(res, head, 3)\n\n$`1`\n pathway pval padj ES NES\n1: Neutrophil 0.0001222195 0.01215255 0.9203456 1.876178\n2: CD1C+_B dendritic cell 0.0001292825 0.01215255 0.9243123 1.809278\n3: Stromal cell 0.0011025358 0.04145535 0.8693509 1.626355\n nMoreExtreme size leadingEdge\n1: 0 80 S100A8,S100A9,S100A12,MNDA,NAMPT,S100A11,...\n2: 0 54 S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...\n3: 7 38 VIM,TIMP2,BST1,TIMP1,CD44,ANPEP,...\n\n$`2`\n pathway pval padj ES NES\n1: Follicular B cell 0.008464329 0.06630391 0.8526465 1.600418\n2: Pyramidal cell 0.004198321 0.04582724 -0.9744811 -1.516437\n3: CD4+CD25+ regulatory T cell 0.002199120 0.03445289 -0.9799105 -1.524886\n nMoreExtreme size leadingEdge\n1: 41 22 MS4A1,CD69,CD22,FCER2,CD40,PAX5,...\n2: 20 6 NRGN,CD3E\n3: 10 6 CD3E,CD3D,CD3G,PTPRC,CD4\n\n$`3`\n pathway pval padj ES NES\n1: Neutrophil 0.0001081315 0.01063709 0.8977016 1.749619\n2: CD1C+_B dendritic cell 0.0001131606 0.01063709 0.8981095 1.699327\n3: Stromal cell 0.0003583801 0.02245849 0.8818002 1.619610\n nMoreExtreme size leadingEdge\n1: 0 80 S100A9,S100A8,S100A11,LST1,CD14,S100A12,...\n2: 0 54 LYZ,S100A9,S100A8,FCN1,VCAN,CD14,...\n3: 2 38 VIM,CD44,TIMP2,TIMP1,ICAM1,PECAM1,...\n\n$`4`\n pathway pval padj ES NES nMoreExtreme\n1: Naive CD8+ T cell 0.0002157497 0.006783575 0.8599144 2.048419 0\n2: Naive CD4+ T cell 0.0002164971 0.006783575 0.9296309 1.895090 0\n3: CD4+ T cell 0.0002150538 0.006783575 0.9271953 1.799035 0\n size leadingEdge\n1: 91 LDHB,PIK3IP1,NOSIP,TCF7,NPM1,RCAN3,...\n2: 34 IL7R,NOSIP,TCF7,EEF1B2,RPS5,MAL,...\n3: 25 IL7R,LTB,CD3E,CD3D,CD3G,CD2,...\n\n$`5`\n pathway pval padj ES NES nMoreExtreme\n1: Follicular B cell 0.008289527 0.05993966 0.8517164 1.606149 40\n2: Myoepithelial cell 0.008235294 0.05993966 -0.9398262 -1.486394 41\n3: Pyramidal cell 0.002341463 0.03160284 -0.9730353 -1.502327 11\n size leadingEdge\n1: 22 MS4A1,CD69,CD22,CD40,FCER2,PAX5,...\n2: 7 ITGB1,BHLHE40,CD44\n3: 6 CD3E,NRGN\n\n$`6`\n pathway pval padj ES\n1: CD4+ cytotoxic T cell 0.0001825484 0.008283763 0.8850534\n2: Natural killer cell 0.0001830831 0.008283763 0.8009472\n3: Effector CD8+ memory T (Tem) cell 0.0003665689 0.011485826 0.7818876\n NES nMoreExtreme size leadingEdge\n1: 2.020483 0 86 CCL5,NKG7,GNLY,GZMH,CST7,GZMA,...\n2: 1.824333 0 84 NKG7,GNLY,CD3D,CD3E,GZMA,CD3G,...\n3: 1.765723 1 79 GNLY,GZMH,ARL4C,GZMB,FGFBP2,KLRD1,...\n\n$`7`\n pathway pval padj ES NES\n1: CD4+ cytotoxic T cell 0.0002382087 0.01130895 0.9485749 2.191249\n2: Effector CD8+ memory T (Tem) cell 0.0002406160 0.01130895 0.8946982 2.041811\n3: Natural killer cell 0.0002387205 0.01130895 0.8572499 1.974925\n nMoreExtreme size leadingEdge\n1: 0 86 GNLY,NKG7,CCL5,GZMB,CTSW,FGFBP2,...\n2: 0 79 GNLY,GZMB,FGFBP2,KLRD1,SPON2,GZMH,...\n3: 0 84 GNLY,NKG7,GZMB,CD247,GZMA,KLRD1,...\n\n$`8`\n pathway pval padj ES NES nMoreExtreme\n1: Megakaryocyte 0.008771930 0.1649123 0.8128385 1.763957 10\n2: Eosinophil 0.007063238 0.1480465 -0.7453288 -1.396081 63\n3: Natural killer cell 0.003492433 0.1480465 -0.7084403 -1.396899 32\n size leadingEdge\n1: 25 PPBP,PF4,GP9,ITGA2B,CD9,RASGRP2,...\n2: 47 CD52,PTPRC,CD48,CD44,CD53,CD69,...\n3: 84 PTPRC,NKG7,GNLY,CD69,CD81,FCGR3A,...\n\n$`9`\n pathway pval padj ES NES nMoreExtreme\n1: Mesenchymal cell 0.0001106929 0.02081027 0.8631721 1.606935 0\n2: Stromal cell 0.0009342520 0.04648863 0.8552856 1.544942 7\n3: Hemangioblast 0.0003017502 0.02836451 0.9907663 1.516461 1\n size leadingEdge\n1: 58 COTL1,S100A4,CTSC,HES4,VIM,ZEB2,...\n2: 38 PECAM1,TIMP1,VIM,TIMP2,PTPRC,CD44,...\n3: 8 PECAM1,CD34\n\n\n#CT_GSEA8:\n\nnew.cluster.ids <- unlist(lapply(res, function(x) {\n as.data.frame(x)[1, 1]\n}))\nalldata@colData$cellmarker_gsea <- new.cluster.ids[as.character(alldata@colData$louvain_SNNk15)]\n\nwrap_plots(\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"cellmarker_gsea\"),\n plotReducedDim(alldata, dimred = \"UMAP\", colour_by = \"ref_gsea\"),\n ncol = 2\n)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you think that the methods overlap well? Where do you see the most inconsistencies?\n\n\nIn this case we do not have any ground truth, and we cannot say which method performs best. You should keep in mind, that any celltype classification method is just a prediction, and you still need to use your common sense and knowledge of the biological system to judge if the results make sense.\nFinally, lets save the data with predictions.\n\nsaveRDS(ctrl.sce, \"data/covid/results/bioc_covid_qc_dr_int_cl_ct-ctrl13.rds\")" }, { "objectID": "labs/bioc/bioc_06_celltyping.html#meta-session", @@ -844,7 +844,7 @@ "href": "labs/bioc/bioc_08_spatial.html#meta-st_prep", "title": " Spatial Transcriptomics", "section": "1 Preparation", - "text": "1 Preparation\nLoad packages\n\n# BiocManager::install('DropletUtils',update = F)\n# BiocManager::install(\"Spaniel\",update = F)\n# remotes::install_github(\"RachelQueen1/Spaniel\", ref = \"Development\" ,upgrade = F,dependencies = F)\n# remotes::install_github(\"renozao/xbioc\")\n# remotes::install_github(\"meichendong/SCDC\")\n\nsuppressPackageStartupMessages({\n library(Spaniel)\n # library(biomaRt)\n library(SingleCellExperiment)\n library(Matrix)\n library(dplyr)\n library(scran)\n library(SingleR)\n library(scater)\n library(ggplot2)\n library(patchwork)\n})\n\nLoad ST data\n\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\n\nif (!dir.exists(\"data/spatial/visium/Anterior\")) dir.create(\"data/spatial/visium/Anterior\", recursive = T)\nif (!dir.exists(\"data/spatial/visium/Posterior\")) dir.create(\"data/spatial/visium/Posterior\", recursive = T)\n\nfile_list <- c(\n \"spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz\",\n \"spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz\",\n \"spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz\",\n \"spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz\"\n)\n\nfor (i in file_list) {\n if (!file.exists(file.path(\"data\", i))) {\n cat(paste0(\"Downloading \", file.path(path_data, i), \" to \", file.path(\"data\", i), \"\\n\"))\n download.file(url = file.path(path_data, i), destfile = file.path(\"data\", i))\n }\n cat(paste0(\"Uncompressing \", file.path(\"data\", i), \"\\n\"))\n system(paste0(\"tar -xvzf \", file.path(\"data\", i), \" -C \", dirname(file.path(\"data\", i))))\n}\n\nDownloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz to data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz\nUncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz\nDownloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz to data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz\nUncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz\nDownloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz to data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz\nUncompressing data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz\nDownloading https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz to data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz\nUncompressing data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz\n\n\nMerge the objects into one SCE object.\n\nsce.a <- Spaniel::createVisiumSCE(tenXDir = \"data/spatial/visium/Anterior\", resolution = \"Low\")\nsce.p <- Spaniel::createVisiumSCE(tenXDir = \"data/spatial/visium/Posterior\", resolution = \"Low\")\nsce <- cbind(sce.a, sce.p)\n\nsce$Sample <- basename(sub(\"/filtered_feature_bc_matrix\", \"\", sce$Sample))\n\nlll <- list(sce.a, sce.p)\nlll <- lapply(lll, function(x) x@metadata)\nnames(lll) <- c(\"Anterior\", \"Posterior\")\nsce@metadata <- lll\n\nWe can further convert the gene ensembl IDs to gene names using biomaRt.\n\nmart <- biomaRt::useMart(biomart = \"ENSEMBL_MART_ENSEMBL\", dataset = \"mmusculus_gene_ensembl\")\nannot <- biomaRt::getBM(attributes = c(\"ensembl_gene_id\", \"external_gene_name\", \"gene_biotype\"), mart = mart, useCache = F)\nsaveRDS(annot, \"data/spatial/visium/annot.rds\")\n\nWe will use a file that was created in advance.\n\npath_file <- \"data/spatial/visium/annot.rds\"\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"spatial/visium/annot.rds\"), destfile = path_file)\nannot <- readRDS(path_file)\n\n\ngene_names <- as.character(annot[match(rownames(sce), annot[, \"ensembl_gene_id\"]), \"external_gene_name\"])\ngene_names[is.na(gene_names)] <- \"\"\n\nsce <- sce[gene_names != \"\", ]\nrownames(sce) <- gene_names[gene_names != \"\"]\ndim(sce)\n\n[1] 32053 6050" + "text": "1 Preparation\nLoad packages\n\n# BiocManager::install('DropletUtils',update = F)\n# BiocManager::install(\"Spaniel\",update = F)\n# remotes::install_github(\"RachelQueen1/Spaniel\", ref = \"Development\" ,upgrade = F,dependencies = F)\n# remotes::install_github(\"renozao/xbioc\")\n# remotes::install_github(\"meichendong/SCDC\")\n\nsuppressPackageStartupMessages({\n library(Spaniel)\n # library(biomaRt)\n library(SingleCellExperiment)\n library(Matrix)\n library(dplyr)\n library(scran)\n library(SingleR)\n library(scater)\n library(ggplot2)\n library(patchwork)\n})\n\nLoad ST data\n\npath_data <- \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\n\nif (!dir.exists(\"data/spatial/visium/Anterior\")) dir.create(\"data/spatial/visium/Anterior\", recursive = T)\nif (!dir.exists(\"data/spatial/visium/Posterior\")) dir.create(\"data/spatial/visium/Posterior\", recursive = T)\n\nfile_list <- c(\n \"spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz\",\n \"spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz\",\n \"spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz\",\n \"spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz\"\n)\n\nfor (i in file_list) {\n if (!file.exists(file.path(\"data\", i))) {\n cat(paste0(\"Downloading \", file.path(path_data, i), \" to \", file.path(\"data\", i), \"\\n\"))\n download.file(url = file.path(path_data, i), destfile = file.path(\"data\", i))\n }\n cat(paste0(\"Uncompressing \", file.path(\"data\", i), \"\\n\"))\n system(paste0(\"tar -xvzf \", file.path(\"data\", i), \" -C \", dirname(file.path(\"data\", i))))\n}\n\nUncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.tar.gz\nUncompressing data/spatial/visium/Anterior/V1_Mouse_Brain_Sagittal_Anterior_spatial.tar.gz\nUncompressing data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.tar.gz\nUncompressing data/spatial/visium/Posterior/V1_Mouse_Brain_Sagittal_Posterior_spatial.tar.gz\n\n\nMerge the objects into one SCE object.\n\nsce.a <- Spaniel::createVisiumSCE(tenXDir = \"data/spatial/visium/Anterior\", resolution = \"Low\")\nsce.p <- Spaniel::createVisiumSCE(tenXDir = \"data/spatial/visium/Posterior\", resolution = \"Low\")\nsce <- cbind(sce.a, sce.p)\n\nsce$Sample <- basename(sub(\"/filtered_feature_bc_matrix\", \"\", sce$Sample))\n\nlll <- list(sce.a, sce.p)\nlll <- lapply(lll, function(x) x@metadata)\nnames(lll) <- c(\"Anterior\", \"Posterior\")\nsce@metadata <- lll\n\nWe can further convert the gene ensembl IDs to gene names using biomaRt.\n\nmart <- biomaRt::useMart(biomart = \"ENSEMBL_MART_ENSEMBL\", dataset = \"mmusculus_gene_ensembl\")\nannot <- biomaRt::getBM(attributes = c(\"ensembl_gene_id\", \"external_gene_name\", \"gene_biotype\"), mart = mart, useCache = F)\nsaveRDS(annot, \"data/spatial/visium/annot.rds\")\n\nWe will use a file that was created in advance.\n\npath_file <- \"data/spatial/visium/annot.rds\"\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"spatial/visium/annot.rds\"), destfile = path_file)\nannot <- readRDS(path_file)\n\n\ngene_names <- as.character(annot[match(rownames(sce), annot[, \"ensembl_gene_id\"]), \"external_gene_name\"])\ngene_names[is.na(gene_names)] <- \"\"\n\nsce <- sce[gene_names != \"\", ]\nrownames(sce) <- gene_names[gene_names != \"\"]\ndim(sce)\n\n[1] 32053 6050" }, { "objectID": "labs/bioc/bioc_08_spatial.html#meta-st_qc", @@ -858,14 +858,14 @@ "href": "labs/bioc/bioc_08_spatial.html#meta-st_analysis", "title": " Spatial Transcriptomics", "section": "3 Analysis", - "text": "3 Analysis\nWe will proceed with the data in a very similar manner to scRNA-seq data.\n\nsce <- computeSumFactors(sce, sizes = c(20, 40, 60, 80))\nsce <- logNormCounts(sce)\n\nNow we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.\n\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- c(\"Hpca\", \"Ttr\")\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Gene\",\n gene = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\n\nwrap_plots(plist, ncol = 2)\n\n\n\n\n\n\n\n\n\n3.1 Dimensionality reduction and clustering\nWe can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.\nBut make sure you run it on the SCT assay.\n\nvar.out <- modelGeneVar(sce, method = \"loess\")\nhvgs <- getTopHVGs(var.out, n = 2000)\nsce <- runPCA(sce,\n exprs_values = \"logcounts\",\n subset_row = hvgs,\n ncomponents = 50,\n ntop = 100,\n scale = T\n)\ng <- buildSNNGraph(sce, k = 5, use.dimred = \"PCA\")\nsce$louvain_SNNk5 <- factor(igraph::cluster_louvain(g)$membership)\nsce <- runUMAP(sce,\n dimred = \"PCA\", n_dimred = 50, ncomponents = 2, min_dist = 0.1, spread = .3,\n metric = \"correlation\", name = \"UMAP_on_PCA\"\n)\n\nWe can then plot clusters onto umap or onto the tissue section.\n\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- c(\"louvain_SNNk5\")\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Cluster\", clusterRes = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\n\nplist[[3]] <- plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"louvain_SNNk5\")\nplist[[4]] <- plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"Sample\")\n\nwrap_plots(plist, ncol = 2)\n\n\n\n\n\n\n\n\n\n\n3.2 Integration\nQuite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.\nWe will do a similar integration as in the Data Integration lab.\n\nmnn_out <- batchelor::fastMNN(sce, subset.row = hvgs, batch = factor(sce$Sample), k = 20, d = 50)\n\nreducedDim(sce, \"MNN\") <- reducedDim(mnn_out, \"corrected\")\nrm(mnn_out)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 10071341 537.9 14514548 775.2 14514548 775.2\nVcells 191849982 1463.7 373707381 2851.2 373703568 2851.2\n\n\nThen we run dimensionality reduction and clustering as before.\n\ng <- buildSNNGraph(sce, k = 5, use.dimred = \"MNN\")\nsce$louvain_SNNk5 <- factor(igraph::cluster_louvain(g)$membership)\nsce <- runUMAP(sce,\n dimred = \"MNN\", n_dimred = 50, ncomponents = 2, min_dist = 0.1, spread = .3,\n metric = \"correlation\", name = \"UMAP_on_MNN\"\n)\n\n\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- c(\"louvain_SNNk5\")\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Cluster\", clusterRes = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\n\nplist[[3]] <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"louvain_SNNk5\")\nplist[[4]] <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"Sample\")\n\nwrap_plots(plist, ncol = 2)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you see any differences between the integrated and non-integrated clustering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.\n\n\n\n\n3.3 Spatially Variable Features\nThere are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that have spatial patterning without taking clusters or spatial annotation into account. First, we will do differential expression between clusters just as we did for the scRNAseq data before.\n\n# differential expression between cluster 4 and cluster 6\ncell_selection <- sce[, sce$louvain_SNNk5 %in% c(4, 6)]\ncell_selection$louvain_SNNk5 <- factor(cell_selection$louvain_SNNk5)\n\nmarkers_genes <- scran::findMarkers(\n x = cell_selection,\n groups = cell_selection$louvain_SNNk5,\n lfc = .25,\n pval.type = \"all\",\n direction = \"up\"\n)\n\n# List of dataFrames with the results for each cluster\ntop5_cell_selection <- lapply(names(markers_genes), function(x) {\n temp <- markers_genes[[x]][1:5, 1:2]\n temp$gene <- rownames(markers_genes[[x]])[1:5]\n temp$cluster <- x\n return(temp)\n})\ntop5_cell_selection <- as_tibble(do.call(rbind, top5_cell_selection))\ntop5_cell_selection\n\n\n\n \n\n\n# plot top markers\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- top5_cell_selection$gene[1:5]\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Gene\",\n gene = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\nwrap_plots(plist, ncol = 2)" + "text": "3 Analysis\nWe will proceed with the data in a very similar manner to scRNA-seq data.\n\nsce <- computeSumFactors(sce, sizes = c(20, 40, 60, 80))\nsce <- logNormCounts(sce)\n\nNow we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.\n\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- c(\"Hpca\", \"Ttr\")\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Gene\",\n gene = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\n\nwrap_plots(plist, ncol = 2)\n\n\n\n\n\n\n\n\n\n3.1 Dimensionality reduction and clustering\nWe can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.\nBut make sure you run it on the SCT assay.\n\nvar.out <- modelGeneVar(sce, method = \"loess\")\nhvgs <- getTopHVGs(var.out, n = 2000)\nsce <- runPCA(sce,\n exprs_values = \"logcounts\",\n subset_row = hvgs,\n ncomponents = 50,\n ntop = 100,\n scale = T\n)\ng <- buildSNNGraph(sce, k = 5, use.dimred = \"PCA\")\nsce$louvain_SNNk5 <- factor(igraph::cluster_louvain(g)$membership)\nsce <- runUMAP(sce,\n dimred = \"PCA\", n_dimred = 50, ncomponents = 2, min_dist = 0.1, spread = .3,\n metric = \"correlation\", name = \"UMAP_on_PCA\"\n)\n\nWe can then plot clusters onto umap or onto the tissue section.\n\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- c(\"louvain_SNNk5\")\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Cluster\", clusterRes = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\n\nplist[[3]] <- plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"louvain_SNNk5\")\nplist[[4]] <- plotReducedDim(sce, dimred = \"UMAP_on_PCA\", colour_by = \"Sample\")\n\nwrap_plots(plist, ncol = 2)\n\n\n\n\n\n\n\n\n\n\n3.2 Integration\nQuite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.\nWe will do a similar integration as in the Data Integration lab.\n\nmnn_out <- batchelor::fastMNN(sce, subset.row = hvgs, batch = factor(sce$Sample), k = 20, d = 50)\n\nreducedDim(sce, \"MNN\") <- reducedDim(mnn_out, \"corrected\")\nrm(mnn_out)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 10077614 538.3 14514560 775.2 14514560 775.2\nVcells 191871231 1463.9 373705667 2851.2 373705055 2851.2\n\n\nThen we run dimensionality reduction and clustering as before.\n\ng <- buildSNNGraph(sce, k = 5, use.dimred = \"MNN\")\nsce$louvain_SNNk5 <- factor(igraph::cluster_louvain(g)$membership)\nsce <- runUMAP(sce,\n dimred = \"MNN\", n_dimred = 50, ncomponents = 2, min_dist = 0.1, spread = .3,\n metric = \"correlation\", name = \"UMAP_on_MNN\"\n)\n\n\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- c(\"louvain_SNNk5\")\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Cluster\", clusterRes = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\n\nplist[[3]] <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"louvain_SNNk5\")\nplist[[4]] <- plotReducedDim(sce, dimred = \"UMAP_on_MNN\", colour_by = \"Sample\")\n\nwrap_plots(plist, ncol = 2)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you see any differences between the integrated and non-integrated clustering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.\n\n\n\n\n3.3 Spatially Variable Features\nThere are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that have spatial patterning without taking clusters or spatial annotation into account. First, we will do differential expression between clusters just as we did for the scRNAseq data before.\n\n# differential expression between cluster 4 and cluster 6\ncell_selection <- sce[, sce$louvain_SNNk5 %in% c(4, 6)]\ncell_selection$louvain_SNNk5 <- factor(cell_selection$louvain_SNNk5)\n\nmarkers_genes <- scran::findMarkers(\n x = cell_selection,\n groups = cell_selection$louvain_SNNk5,\n lfc = .25,\n pval.type = \"all\",\n direction = \"up\"\n)\n\n# List of dataFrames with the results for each cluster\ntop5_cell_selection <- lapply(names(markers_genes), function(x) {\n temp <- markers_genes[[x]][1:5, 1:2]\n temp$gene <- rownames(markers_genes[[x]])[1:5]\n temp$cluster <- x\n return(temp)\n})\ntop5_cell_selection <- as_tibble(do.call(rbind, top5_cell_selection))\ntop5_cell_selection\n\n\n\n \n\n\n# plot top markers\nsamples <- c(\"Anterior\", \"Posterior\")\nto_plot <- top5_cell_selection$gene[1:5]\n\nplist <- list()\nn <- 1\nfor (j in to_plot) {\n for (i in samples) {\n temp <- sce[, sce$Sample == i]\n temp@metadata <- temp@metadata[[i]]\n plist[[n]] <- spanielPlot(\n object = temp,\n plotType = \"Gene\",\n gene = j,\n customTitle = j,\n techType = \"Visium\",\n ptSizeMax = 1, ptSizeMin = .1\n )\n n <- n + 1\n }\n}\nwrap_plots(plist, ncol = 2)" }, { "objectID": "labs/bioc/bioc_08_spatial.html#meta-st_ss", "href": "labs/bioc/bioc_08_spatial.html#meta-st_ss", "title": " Spatial Transcriptomics", "section": "4 Single cell data", - "text": "4 Single cell data\nWe can use a scRNA-seq dataset as a reference to predict the proportion of different celltypes in the Visium spots. Keep in mind that it is important to have a reference that contains all the celltypes you expect to find in your spots. Ideally it should be a scRNA-seq reference from the exact same tissue. We will use a reference scRNA-seq dataset of ~14,000 adult mouse cortical cell taxonomy from the Allen Institute, generated with the SMART-Seq2 protocol.\nFirst dowload the seurat data:\n\npath_file <- \"data/spatial/visium/allen_cortex.rds\"\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"spatial/visium/allen_cortex.rds\"), destfile = path_file)\n\nFor speed, and for a more fair comparison of the celltypes, we will subsample all celltypes to a maximum of 200 cells per class (subclass).\n\nar <- readRDS(path_file)\nar_sce <- Seurat::as.SingleCellExperiment(ar)\nrm(ar)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 10176004 543.5 18544292 990.4 18544292 990.4\nVcells 577825608 4408.5 833436874 6358.7 578228452 4411.6\n\n# check number of cells per subclass\nar_sce$subclass <- sub(\"/\", \"_\", sub(\" \", \"_\", ar_sce$subclass))\ntable(ar_sce$subclass)\n\n\n Astro CR Endo L2_3_IT L4 L5_IT L5_PT \n 368 7 94 982 1401 880 544 \n L6_CT L6_IT L6b Lamp5 Macrophage Meis2 NP \n 960 1872 358 1122 51 45 362 \n Oligo Peri Pvalb Serpinf1 SMC Sncg Sst \n 91 32 1337 27 55 125 1741 \n Vip VLMC \n 1728 67 \n\n# select 20 cells per subclass, fist set subclass as active.ident\nsubset_cells <- lapply(unique(ar_sce$subclass), function(x) {\n if (sum(ar_sce$subclass == x) > 20) {\n temp <- sample(colnames(ar_sce)[ar_sce$subclass == x], size = 20)\n } else {\n temp <- colnames(ar_sce)[ar_sce$subclass == x]\n }\n})\nar_sce <- ar_sce[, unlist(subset_cells)]\n\n# check again number of cells per subclass\ntable(ar_sce$subclass)\n\n\n Astro CR Endo L2_3_IT L4 L5_IT L5_PT \n 20 7 20 20 20 20 20 \n L6_CT L6_IT L6b Lamp5 Macrophage Meis2 NP \n 20 20 20 20 20 20 20 \n Oligo Peri Pvalb Serpinf1 SMC Sncg Sst \n 20 20 20 20 20 20 20 \n Vip VLMC \n 20 20 \n\n\nThen run normalization and dimensionality reduction.\n\nar_sce <- computeSumFactors(ar_sce, sizes = c(20, 40, 60, 80))\nar_sce <- logNormCounts(ar_sce)\nallen.var.out <- modelGeneVar(ar_sce, method = \"loess\")\nallen.hvgs <- getTopHVGs(allen.var.out, n = 2000)" + "text": "4 Single cell data\nWe can use a scRNA-seq dataset as a reference to predict the proportion of different celltypes in the Visium spots. Keep in mind that it is important to have a reference that contains all the celltypes you expect to find in your spots. Ideally it should be a scRNA-seq reference from the exact same tissue. We will use a reference scRNA-seq dataset of ~14,000 adult mouse cortical cell taxonomy from the Allen Institute, generated with the SMART-Seq2 protocol.\nFirst dowload the seurat data:\n\npath_file <- \"data/spatial/visium/allen_cortex.rds\"\nif (!file.exists(path_file)) download.file(url = file.path(path_data, \"spatial/visium/allen_cortex.rds\"), destfile = path_file)\n\nFor speed, and for a more fair comparison of the celltypes, we will subsample all celltypes to a maximum of 200 cells per class (subclass).\n\nar <- readRDS(path_file)\nar_sce <- Seurat::as.SingleCellExperiment(ar)\nrm(ar)\ngc()\n\n used (Mb) gc trigger (Mb) max used (Mb)\nNcells 10176084 543.5 18536281 990.0 18536281 990\nVcells 576826421 4400.9 831998051 6347.7 577229270 4404\n\n# check number of cells per subclass\nar_sce$subclass <- sub(\"/\", \"_\", sub(\" \", \"_\", ar_sce$subclass))\ntable(ar_sce$subclass)\n\n\n Astro CR Endo L2_3_IT L4 L5_IT L5_PT \n 368 7 94 982 1401 880 544 \n L6_CT L6_IT L6b Lamp5 Macrophage Meis2 NP \n 960 1872 358 1122 51 45 362 \n Oligo Peri Pvalb Serpinf1 SMC Sncg Sst \n 91 32 1337 27 55 125 1741 \n Vip VLMC \n 1728 67 \n\n# select 20 cells per subclass, fist set subclass as active.ident\nsubset_cells <- lapply(unique(ar_sce$subclass), function(x) {\n if (sum(ar_sce$subclass == x) > 20) {\n temp <- sample(colnames(ar_sce)[ar_sce$subclass == x], size = 20)\n } else {\n temp <- colnames(ar_sce)[ar_sce$subclass == x]\n }\n})\nar_sce <- ar_sce[, unlist(subset_cells)]\n\n# check again number of cells per subclass\ntable(ar_sce$subclass)\n\n\n Astro CR Endo L2_3_IT L4 L5_IT L5_PT \n 20 7 20 20 20 20 20 \n L6_CT L6_IT L6b Lamp5 Macrophage Meis2 NP \n 20 20 20 20 20 20 20 \n Oligo Peri Pvalb Serpinf1 SMC Sncg Sst \n 20 20 20 20 20 20 20 \n Vip VLMC \n 20 20 \n\n\nThen run normalization and dimensionality reduction.\n\nar_sce <- computeSumFactors(ar_sce, sizes = c(20, 40, 60, 80))\nar_sce <- logNormCounts(ar_sce)\nallen.var.out <- modelGeneVar(ar_sce, method = \"loess\")\nallen.hvgs <- getTopHVGs(allen.var.out, n = 2000)" }, { "objectID": "labs/bioc/bioc_08_spatial.html#meta-st_sub", @@ -893,56 +893,56 @@ "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_data", "title": " Quality Control", "section": "1 Get data", - "text": "1 Get data\nIn this tutorial, we will run all tutorials with a set of 8 PBMC 10x datasets from 4 covid-19 patients and 4 healthy controls, the samples have been subsampled to 1500 cells per sample. We can start by defining our paths.\n\nimport os\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_covid = \"./data/covid\"\nif not os.path.exists(path_covid):\n os.makedirs(path_covid, exist_ok=True)\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\n\nimport urllib.request\n\nfile_list = [\n \"normal_pbmc_13.h5\", \"normal_pbmc_14.h5\", \"normal_pbmc_19.h5\", \"normal_pbmc_5.h5\",\n \"ncov_pbmc_15.h5\", \"ncov_pbmc_16.h5\", \"ncov_pbmc_17.h5\", \"ncov_pbmc_1.h5\"\n]\n\nfor i in file_list:\n path_file = os.path.join(path_covid, i)\n if not os.path.exists(path_file):\n file_url = os.path.join(path_data, \"covid\", i)\n urllib.request.urlretrieve(file_url, path_file)\n\nWith data in place, now we can start loading libraries we will use in this tutorial.\n\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport warnings\n\nwarnings.simplefilter(action='ignore', category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3\nsc.settings.set_figure_params(dpi=80)\n\nWe can first load the data individually by reading directly from HDF5 file format (.h5).\n\ndata_cov1 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_1.h5'))\ndata_cov1.var_names_make_unique()\ndata_cov15 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_15.h5'))\ndata_cov15.var_names_make_unique()\ndata_cov17 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_17.h5'))\ndata_cov17.var_names_make_unique()\ndata_ctrl5 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_5.h5'))\ndata_ctrl5.var_names_make_unique()\ndata_ctrl13 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_13.h5'))\ndata_ctrl13.var_names_make_unique()\ndata_ctrl14 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_14.h5'))\ndata_ctrl14.var_names_make_unique()\n\nreading ./data/covid/ncov_pbmc_1.h5\n (0:00:00)\nreading ./data/covid/ncov_pbmc_15.h5\n (0:00:00)\nreading ./data/covid/ncov_pbmc_17.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_5.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_13.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_14.h5\n (0:00:00)" + "text": "1 Get data\nIn this tutorial, we will run all tutorials with a set of 8 PBMC 10x datasets from 4 covid-19 patients and 4 healthy controls, the samples have been subsampled to 1500 cells per sample. We can start by defining our paths.\n\nimport os\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_covid = \"./data/covid\"\nif not os.path.exists(path_covid):\n os.makedirs(path_covid, exist_ok=True)\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\n\nimport urllib.request\n\nfile_list = [\n \"normal_pbmc_13.h5\", \"normal_pbmc_14.h5\", \"normal_pbmc_19.h5\", \"normal_pbmc_5.h5\",\n \"ncov_pbmc_15.h5\", \"ncov_pbmc_16.h5\", \"ncov_pbmc_17.h5\", \"ncov_pbmc_1.h5\"\n]\n\nfor i in file_list:\n path_file = os.path.join(path_covid, i)\n if not os.path.exists(path_file):\n file_url = os.path.join(path_data, \"covid\", i)\n urllib.request.urlretrieve(file_url, path_file)\n\nWith data in place, now we can start loading libraries we will use in this tutorial.\n\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport warnings\n\nwarnings.simplefilter(action='ignore', category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3\nsc.settings.set_figure_params(dpi=80)\n\nWe can first load the data individually by reading directly from HDF5 file format (.h5).\nIn Scanpy we read them into an Anndata object with the the function read_10x_h5\n\ndata_cov1 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_1.h5'))\ndata_cov1.var_names_make_unique()\ndata_cov15 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_15.h5'))\ndata_cov15.var_names_make_unique()\ndata_cov16 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_16.h5'))\ndata_cov16.var_names_make_unique()\ndata_cov17 = sc.read_10x_h5(os.path.join(path_covid,'ncov_pbmc_17.h5'))\ndata_cov17.var_names_make_unique()\ndata_ctrl5 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_5.h5'))\ndata_ctrl5.var_names_make_unique()\ndata_ctrl13 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_13.h5'))\ndata_ctrl13.var_names_make_unique()\ndata_ctrl14 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_14.h5'))\ndata_ctrl14.var_names_make_unique()\ndata_ctrl19 = sc.read_10x_h5(os.path.join(path_covid,'normal_pbmc_19.h5'))\ndata_ctrl19.var_names_make_unique()\n\nreading ./data/covid/ncov_pbmc_1.h5\n (0:00:00)\nreading ./data/covid/ncov_pbmc_15.h5\n (0:00:00)\nreading ./data/covid/ncov_pbmc_16.h5\n (0:00:00)\nreading ./data/covid/ncov_pbmc_17.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_5.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_13.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_14.h5\n (0:00:00)\nreading ./data/covid/normal_pbmc_19.h5\n (0:00:00)" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_collate", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_collate", "title": " Quality Control", "section": "2 Collate", - "text": "2 Collate\n\n# add some metadata\ndata_cov1.obs['type']=\"Covid\"\ndata_cov1.obs['sample']=\"covid_1\"\ndata_cov15.obs['type']=\"Covid\"\ndata_cov15.obs['sample']=\"covid_15\"\ndata_cov17.obs['type']=\"Covid\"\ndata_cov17.obs['sample']=\"covid_17\"\ndata_ctrl5.obs['type']=\"Ctrl\"\ndata_ctrl5.obs['sample']=\"ctrl_5\"\ndata_ctrl13.obs['type']=\"Ctrl\"\ndata_ctrl13.obs['sample']=\"ctrl_13\"\ndata_ctrl14.obs['type']=\"Ctrl\"\ndata_ctrl14.obs['sample']=\"ctrl_14\"\n\n# merge into one object.\nadata = data_cov1.concatenate(data_cov15, data_cov17, data_ctrl5, data_ctrl13, data_ctrl14)\n\n# and delete individual datasets to save space\ndel(data_cov1, data_cov15, data_cov17)\ndel(data_ctrl5, data_ctrl13, data_ctrl14)\n\nYou can print a summary of the datasets in the Scanpy object, or a summary of the whole object.\n\nprint(adata.obs['sample'].value_counts())\nadata\n\nsample\ncovid_1 1500\ncovid_15 1500\ncovid_17 1500\nctrl_5 1500\nctrl_13 1500\nctrl_14 1500\nName: count, dtype: int64\n\n\nAnnData object with n_obs × n_vars = 9000 × 33538\n obs: 'type', 'sample', 'batch'\n var: 'gene_ids', 'feature_types', 'genome'" + "text": "2 Collate\n\n\n# add some metadata\ndata_cov1.obs['type']=\"Covid\"\ndata_cov1.obs['sample']=\"covid_1\"\ndata_cov15.obs['type']=\"Covid\"\ndata_cov15.obs['sample']=\"covid_15\"\ndata_cov16.obs['type']=\"Covid\"\ndata_cov16.obs['sample']=\"covid_16\"\ndata_cov17.obs['type']=\"Covid\"\ndata_cov17.obs['sample']=\"covid_17\"\ndata_ctrl5.obs['type']=\"Ctrl\"\ndata_ctrl5.obs['sample']=\"ctrl_5\"\ndata_ctrl13.obs['type']=\"Ctrl\"\ndata_ctrl13.obs['sample']=\"ctrl_13\"\ndata_ctrl14.obs['type']=\"Ctrl\"\ndata_ctrl14.obs['sample']=\"ctrl_14\"\ndata_ctrl19.obs['type']=\"Ctrl\"\ndata_ctrl19.obs['sample']=\"ctrl_19\"\n\n# merge into one object.\nadata = data_cov1.concatenate(data_cov15, data_cov16, data_cov17, data_ctrl5, data_ctrl13, data_ctrl14, data_ctrl19)\n\n# and delete individual datasets to save space\ndel(data_cov1, data_cov15, data_cov16, data_cov17)\ndel(data_ctrl5, data_ctrl13, data_ctrl14, data_ctrl19)\n\nYou can print a summary of the datasets in the Scanpy object, or a summary of the whole object.\n\nprint(adata.obs['sample'].value_counts())\nadata\n\nsample\ncovid_1 1500\ncovid_15 1500\ncovid_16 1500\ncovid_17 1500\nctrl_5 1500\nctrl_13 1500\nctrl_14 1500\nctrl_19 1500\nName: count, dtype: int64\n\n\nAnnData object with n_obs × n_vars = 12000 × 33538\n obs: 'type', 'sample', 'batch'\n var: 'gene_ids', 'feature_types', 'genome'" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_calqc", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_calqc", "title": " Quality Control", "section": "3 Calculate QC", - "text": "3 Calculate QC\nHaving the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitochondrial and ribosomal genes per cell and add to the metadata. The proportion hemoglobin genes can give an indication of red blood cell contamination. This will be helpful to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this. The QC metrics are finally added to the metadata table.\nCiting from Simple Single Cell workflows (Lun, McCarthy & Marioni, 2017): High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.\nFirst, let Scanpy calculate some general qc-stats for genes and cells with the function sc.pp.calculate_qc_metrics, similar to calculateQCmetrics() in Scater. It can also calculate proportion of counts for specific gene populations, so first we need to define which genes are mitochondrial, ribosomal and hemoglobin.\n\n# mitochondrial genes\nadata.var['mt'] = adata.var_names.str.startswith('MT-') \n# ribosomal genes\nadata.var['ribo'] = adata.var_names.str.startswith((\"RPS\",\"RPL\"))\n# hemoglobin genes.\nadata.var['hb'] = adata.var_names.str.contains((\"^HB[^(P)]\"))\n\nadata.var\n\n\n\n\n\n\n\n\ngene_ids\nfeature_types\ngenome\nmt\nribo\nhb\n\n\n\n\nMIR1302-2HG\nENSG00000243485\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nFAM138A\nENSG00000237613\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nOR4F5\nENSG00000186092\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAL627309.1\nENSG00000238009\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAL627309.3\nENSG00000239945\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\n...\n...\n...\n...\n...\n...\n...\n\n\nAC233755.2\nENSG00000277856\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAC233755.1\nENSG00000275063\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAC240274.1\nENSG00000271254\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAC213203.1\nENSG00000277475\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nFAM231C\nENSG00000268674\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\n\n\n33538 rows × 6 columns\n\n\n\n\nsc.pp.calculate_qc_metrics(adata, qc_vars=['mt','ribo','hb'], percent_top=None, log1p=False, inplace=True)\n\nNow you can see that we have additional data in the metadata slot.\n\nmito_genes = adata.var_names.str.startswith('MT-')\n# for each cell compute fraction of counts in mito genes vs. all genes\n# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)\nadata.obs['percent_mt2'] = np.sum(\n adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1\n# add the total counts per cell as observations-annotation to adata\nadata.obs['n_counts'] = adata.X.sum(axis=1).A1\n\nadata\n\nAnnData object with n_obs × n_vars = 9000 × 33538\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'" + "text": "3 Calculate QC\nHaving the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitochondrial and ribosomal genes per cell and add to the metadata. The proportion hemoglobin genes can give an indication of red blood cell contamination. This will be helpful to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this. The QC metrics are finally added to the metadata table.\nCiting from Simple Single Cell workflows (Lun, McCarthy & Marioni, 2017): High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.\nFirst, let Scanpy calculate some general qc-stats for genes and cells with the function sc.pp.calculate_qc_metrics, similar to calculateQCmetrics() in Scater. It can also calculate proportion of counts for specific gene populations, so first we need to define which genes are mitochondrial, ribosomal and hemoglobin.\n\n# mitochondrial genes\nadata.var['mt'] = adata.var_names.str.startswith('MT-') \n# ribosomal genes\nadata.var['ribo'] = adata.var_names.str.startswith((\"RPS\",\"RPL\"))\n# hemoglobin genes.\nadata.var['hb'] = adata.var_names.str.contains((\"^HB[^(P|E|S)]\"))\n\nadata.var\n\n\n\n\n\n\n\n\ngene_ids\nfeature_types\ngenome\nmt\nribo\nhb\n\n\n\n\nMIR1302-2HG\nENSG00000243485\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nFAM138A\nENSG00000237613\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nOR4F5\nENSG00000186092\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAL627309.1\nENSG00000238009\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAL627309.3\nENSG00000239945\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\n...\n...\n...\n...\n...\n...\n...\n\n\nAC233755.2\nENSG00000277856\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAC233755.1\nENSG00000275063\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAC240274.1\nENSG00000271254\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nAC213203.1\nENSG00000277475\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\nFAM231C\nENSG00000268674\nGene Expression\nGRCh38\nFalse\nFalse\nFalse\n\n\n\n\n33538 rows × 6 columns\n\n\n\n\nsc.pp.calculate_qc_metrics(adata, qc_vars=['mt','ribo','hb'], percent_top=None, log1p=False, inplace=True)\n\nNow you can see that we have additional data in the metadata slot.\nAnother opition to using the calculate_qc_metrics function is to calculate the values on your own and add to a metadata slot. An example for mito genes can be found below:\n\nmito_genes = adata.var_names.str.startswith('MT-')\n# for each cell compute fraction of counts in mito genes vs. all genes\n# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)\nadata.obs['percent_mt2'] = np.sum(\n adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1\n# add the total counts per cell as observations-annotation to adata\nadata.obs['n_counts'] = adata.X.sum(axis=1).A1\n\nadata\n\nAnnData object with n_obs × n_vars = 12000 × 33538\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_plotqc", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_plotqc", "title": " Quality Control", "section": "4 Plot QC", - "text": "4 Plot QC\nNow we can plot some of the QC variables as violin plots.\n\nsc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo', 'pct_counts_hb'], jitter=0.4, groupby = 'sample', rotation= 45)\n\n\n\n\n\n\n\n\nAs you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 sample having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.\n\nsc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color=\"sample\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nPlot additional QC stats that we have calculated as scatter plots. How are the different measures correlated? Can you explain why?" + "text": "4 Plot QC\nNow we can plot some of the QC variables as violin plots.\n\nsc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo', 'pct_counts_hb'], jitter=0.4, groupby = 'sample', rotation= 45)\n\n\n\n\n\n\n\n\nAs you can see, there is quite some difference in quality for the 4 datasets, with for instance the covid_15 and covid_16 samples having fewer cells with many detected genes and more mitochondrial content. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. We can also plot the different QC-measures as scatter plots.\n\nsc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color=\"sample\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nPlot additional QC stats that we have calculated as scatter plots. How are the different measures correlated? Can you explain why?" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_filter", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_filter", "title": " Quality Control", "section": "5 Filtering", - "text": "5 Filtering\n\n5.1 Detection-based filtering\nA standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.\n\nsc.pp.filter_cells(adata, min_genes=200)\nsc.pp.filter_genes(adata, min_cells=3)\n\nprint(adata.n_obs, adata.n_vars)\n\nfiltered out 897 cells that have less than 200 genes expressed\nfiltered out 14683 genes that are detected in less than 3 cells\n8103 18855\n\n\nExtremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In this case, we will run doublet prediction further down, so we will skip this step now, but the code below is an example of how it can be run:\n\n# skip for now as we are doing doublet prediction\n#keep_v2 = (adata.obs['n_genes_by_counts'] < 2000) & (adata.obs['n_genes_by_counts'] > 500) & (adata.obs['lib_prep'] == 'v2')\n#print(sum(keep_v2))\n\n# filter for gene detection for v3\n#keep_v3 = (adata.obs['n_genes_by_counts'] < 4100) & (adata.obs['n_genes_by_counts'] > 1000) & (adata.obs['lib_prep'] != 'v2')\n#print(sum(keep_v3))\n\n# keep both sets of cells\n#keep = (keep_v2) | (keep_v3)\n#print(sum(keep))\n#adata = adata[keep, :]\n\n#print(\"Remaining cells %d\"%adata.n_obs)\n\nAdditionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.\n\nsc.pl.highest_expr_genes(adata, n_top=20)\n\nnormalizing counts per cell\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\nAs you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.\n\n\n5.2 Mito/Ribo filtering\nWe also have quite a lot of cells with high proportion of mitochondrial and low proportion of ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent_mito variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 20% mitochondrial reads and that will be used as a cutoff. We will also remove cells with less than 5% ribosomal reads.\n\n# filter for percent mito\nadata = adata[adata.obs['pct_counts_mt'] < 20, :]\n\n# filter for percent ribo > 0.05\nadata = adata[adata.obs['pct_counts_ribo'] > 5, :]\n\nprint(\"Remaining cells %d\"%adata.n_obs)\n\nRemaining cells 5888\n\n\nAs you can see, a large proportion of sample covid_15 is filtered out. Also, there is still quite a lot of variation in percent_mito, so it will have to be dealt with in the data analysis step. We can also notice that the percent_ribo are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.\n\n\n5.3 Plot filtered QC\nLets plot the same QC-stats another time.\n\nsc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'pct_counts_hb'], jitter=0.4, groupby = 'sample', rotation = 45)\n\n\n\n\n\n\n\n\n\n\n5.4 Filter genes\nAs the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis.\n\nmalat1 = adata.var_names.str.startswith('MALAT1')\n# we need to redefine the mito_genes since they were first \n# calculated on the full object before removing low expressed genes.\nmito_genes = adata.var_names.str.startswith('MT-')\nhb_genes = adata.var_names.str.contains('^HB[^(P)]')\n\nremove = np.add(mito_genes, malat1)\nremove = np.add(remove, hb_genes)\nkeep = np.invert(remove)\n\nadata = adata[:,keep]\n\nprint(adata.n_obs, adata.n_vars)\n\n5888 18830" + "text": "5 Filtering\n\n5.1 Detection-based filtering\nA standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used.\n\nsc.pp.filter_cells(adata, min_genes=200)\nsc.pp.filter_genes(adata, min_cells=3)\n\nprint(adata.n_obs, adata.n_vars)\n\nfiltered out 1336 cells that have less than 200 genes expressed\nfiltered out 14047 genes that are detected in less than 3 cells\n10664 19491\n\n\nExtremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type. In this case, we will run doublet prediction further down, so we will skip this step now, but the code below is an example of how it can be run:\n\n# skip for now as we are doing doublet prediction\n#keep_v2 = (adata.obs['n_genes_by_counts'] < 2000) & (adata.obs['n_genes_by_counts'] > 500) & (adata.obs['lib_prep'] == 'v2')\n#print(sum(keep_v2))\n\n# filter for gene detection for v3\n#keep_v3 = (adata.obs['n_genes_by_counts'] < 4100) & (adata.obs['n_genes_by_counts'] > 1000) & (adata.obs['lib_prep'] != 'v2')\n#print(sum(keep_v3))\n\n# keep both sets of cells\n#keep = (keep_v2) | (keep_v3)\n#print(sum(keep))\n#adata = adata[keep, :]\n\n#print(\"Remaining cells %d\"%adata.n_obs)\n\nAdditionally, we can also see which genes contribute the most to such reads. We can for instance plot the percentage of counts per gene.\n\nsc.pl.highest_expr_genes(adata, n_top=20)\n\nnormalizing counts per cell\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\nAs you can see, MALAT1 constitutes up to 30% of the UMIs from a single cell and the other top genes are mitochondrial and ribosomal genes. It is quite common that nuclear lincRNAs have correlation with quality and mitochondrial reads, so high detection of MALAT1 may be a technical issue. Let us assemble some information about such genes, which are important for quality control and downstream filtering.\n\n\n5.2 Mito/Ribo filtering\nWe also have quite a lot of cells with high proportion of mitochondrial and low proportion of ribosomal reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent_mito variable during scaling. In this case we had as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much cell type signature left in those. Looking at the plots, make reasonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 20% mitochondrial reads and that will be used as a cutoff. We will also remove cells with less than 5% ribosomal reads.\n\n# filter for percent mito\nadata = adata[adata.obs['pct_counts_mt'] < 20, :]\n\n# filter for percent ribo > 0.05\nadata = adata[adata.obs['pct_counts_ribo'] > 5, :]\n\nprint(\"Remaining cells %d\"%adata.n_obs)\n\nRemaining cells 7431\n\n\nAs you can see, a large proportion of sample covid_15 is filtered out. Also, there is still quite a lot of variation in percent_mito, so it will have to be dealt with in the data analysis step. We can also notice that the percent_ribo are also highly variable, but that is expected since different cell types have different proportions of ribosomal content, according to their function.\n\n\n5.3 Plot filtered QC\nLets plot the same QC-stats another time.\n\nsc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'pct_counts_hb'], jitter=0.4, groupby = 'sample', rotation = 45)\n\n\n\n\n\n\n\n\n\n\n5.4 Filter genes\nAs the level of expression of mitochondrial and MALAT1 genes are judged as mainly technical, it can be wise to remove them from the dataset before any further analysis. In this case we will also remove the HB genes.\n\nmalat1 = adata.var_names.str.startswith('MALAT1')\n# we need to redefine the mito_genes since they were first \n# calculated on the full object before removing low expressed genes.\nmito_genes = adata.var_names.str.startswith('MT-')\nhb_genes = adata.var_names.str.contains('^HB[^(P|E|S)]')\n\nremove = np.add(mito_genes, malat1)\nremove = np.add(remove, hb_genes)\nkeep = np.invert(remove)\n\nadata = adata[:,keep]\n\nprint(adata.n_obs, adata.n_vars)\n\n7431 19468" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_sex", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_sex", "title": " Quality Control", "section": "6 Sample sex", - "text": "6 Sample sex\nWhen working with human or animal samples, you should ideally constrain you experiments to a single sex to avoid including sex bias in the conclusions. However this may not always be possible. By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is. It can also bee a good way to detect if there has been any sample mixups, if the sample metadata sex does not agree with the computational predictions.\nTo get choromosome information for all genes, you should ideally parse the information from the gtf file that you used in the mapping pipeline as it has the exact same annotation version/gene naming. However, it may not always be available, as in this case where we have downloaded public data. Hence, we will use biomart to fetch chromosome information.\n\n# requires pybiomart\nannot = sc.queries.biomart_annotations(\"hsapiens\", [\"ensembl_gene_id\", \"external_gene_name\", \"start_position\", \"end_position\", \"chromosome_name\"], ).set_index(\"external_gene_name\")\n# adata.var[annot.columns] = annot\n\nNow that we have the chromosome information, we can calculate per cell the proportion of reads that comes from chromosome Y.\n\nchrY_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == \"Y\"])\nchrY_genes\n\nadata.obs['percent_chrY'] = np.sum(\n adata[:, chrY_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 * 100\n\nThen plot XIST expression vs chrY proportion. As you can see, the samples are clearly on either side, even if some cells do not have detection of either.\n\n# color inputs must be from either .obs or .var, so add in XIST expression to obs.\nadata.obs[\"XIST-counts\"] = adata.X[:,adata.var_names.str.match('XIST')].toarray()\n\nsc.pl.scatter(adata, x='XIST-counts', y='percent_chrY', color=\"sample\")\n\n\n\n\n\n\n\n\nPlot as violins.\n\nsc.pl.violin(adata, [\"XIST-counts\", \"percent_chrY\"], jitter=0.4, groupby = 'sample', rotation= 45)\n\n\n\n\n\n\n\n\nHere, we can see clearly that we have two males and 4 females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?" + "text": "6 Sample sex\nWhen working with human or animal samples, you should ideally constrain you experiments to a single sex to avoid including sex bias in the conclusions. However this may not always be possible. By looking at reads from chromosomeY (males) and XIST (X-inactive specific transcript) expression (mainly female) it is quite easy to determine per sample which sex it is. It can also bee a good way to detect if there has been any sample mixups, if the sample metadata sex does not agree with the computational predictions.\nTo get choromosome information for all genes, you should ideally parse the information from the gtf file that you used in the mapping pipeline as it has the exact same annotation version/gene naming. However, it may not always be available, as in this case where we have downloaded public data. Hence, we will use biomart to fetch chromosome information.\n\n# requires pybiomart\nannot = sc.queries.biomart_annotations(\"hsapiens\", [\"ensembl_gene_id\", \"external_gene_name\", \"start_position\", \"end_position\", \"chromosome_name\"], ).set_index(\"external_gene_name\")\n# adata.var[annot.columns] = annot\n\nNow that we have the chromosome information, we can calculate per cell the proportion of reads that comes from chromosome Y.\n\nchrY_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == \"Y\"])\nchrY_genes\n\nadata.obs['percent_chrY'] = np.sum(\n adata[:, chrY_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 * 100\n\nThen plot XIST expression vs chrY proportion. As you can see, the samples are clearly on either side, even if some cells do not have detection of either.\n\n# color inputs must be from either .obs or .var, so add in XIST expression to obs.\nadata.obs[\"XIST-counts\"] = adata.X[:,adata.var_names.str.match('XIST')].toarray()\n\nsc.pl.scatter(adata, x='XIST-counts', y='percent_chrY', color=\"sample\")\n\n\n\n\n\n\n\n\nPlot as violins.\n\nsc.pl.violin(adata, [\"XIST-counts\", \"percent_chrY\"], jitter=0.4, groupby = 'sample', rotation= 45)\n\n\n\n\n\n\n\n\nHere, we can see clearly that we have three males and five females, can you see which samples they are? Do you think this will cause any problems for downstream analysis? Discuss with your group: what would be the best way to deal with this type of sex bias?" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_cellcycle", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_cellcycle", "title": " Quality Control", "section": "7 Cell cycle state", - "text": "7 Cell cycle state\nWe here perform cell cycle scoring. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.\nFirst read the file with cell cycle genes, from Regev lab and split into S and G2M phase genes. We first download the file.\n\npath_file = os.path.join(path_results, 'regev_lab_cell_cycle_genes.txt')\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(path_data, 'regev_lab_cell_cycle_genes.txt'), path_file)\n\n\ncell_cycle_genes = [x.strip() for x in open('./data/covid/results/regev_lab_cell_cycle_genes.txt')]\nprint(len(cell_cycle_genes))\n\n# Split into 2 lists\ns_genes = cell_cycle_genes[:43]\ng2m_genes = cell_cycle_genes[43:]\n\ncell_cycle_genes = [x for x in cell_cycle_genes if x in adata.var_names]\nprint(len(cell_cycle_genes))\n\n97\n94\n\n\nBefore running cell cycle we have to normalize the data. In the scanpy object, the data slot will be overwritten with the normalized data. So first, save the raw data into the slot raw. Then run normalization, log transformation and scale the data.\n\n# save normalized counts in raw slot.\nadata.raw = adata\n\n# normalize to depth 10 000\nsc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)\n\n# logaritmize\nsc.pp.log1p(adata)\n\n# scale\nsc.pp.scale(adata)\n\nnormalizing by total count per cell\n finished (0:00:00): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)\n... as `zero_center=True`, sparse input is densified and may lead to large memory consumption\n\n\nWe here perform cell cycle scoring. The function is actually a wrapper to sc.tl.score_gene_list, which is launched twice, to score separately S and G2M phases. Both sc.tl.score_gene_list and sc.tl.score_cell_cycle_genes are a port from Seurat and are supposed to work in a very similar way. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.\n\nsc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)\n\ncalculating cell cycle phase\ncomputing score 'S_score'\nWARNING: genes are not in var_names and ignored: ['MLF1IP']\n finished: added\n 'S_score', score of gene set (adata.obs).\n 727 total control genes are used. (0:00:00)\ncomputing score 'G2M_score'\nWARNING: genes are not in var_names and ignored: ['FAM64A', 'HN1']\n finished: added\n 'G2M_score', score of gene set (adata.obs).\n 771 total control genes are used. (0:00:00)\n--> 'phase', cell cycle phase (adata.obs)\n\n\nWe can now plot a violin plot for the cell cycle scores as well.\n\nsc.pl.violin(adata, ['S_score', 'G2M_score'], jitter=0.4, groupby = 'sample', rotation=45)\n\n\n\n\n\n\n\n\nIn this case it looks like we only have a few cycling cells in the datasets." + "text": "7 Cell cycle state\nWe here perform cell cycle scoring. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.\nFirst read the file with cell cycle genes, from Regev lab and split into S and G2M phase genes. We first download the file.\n\npath_file = os.path.join(path_results, 'regev_lab_cell_cycle_genes.txt')\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(path_data, 'regev_lab_cell_cycle_genes.txt'), path_file)\n\n\ncell_cycle_genes = [x.strip() for x in open('./data/covid/results/regev_lab_cell_cycle_genes.txt')]\nprint(len(cell_cycle_genes))\n\n# Split into 2 lists\ns_genes = cell_cycle_genes[:43]\ng2m_genes = cell_cycle_genes[43:]\n\ncell_cycle_genes = [x for x in cell_cycle_genes if x in adata.var_names]\nprint(len(cell_cycle_genes))\n\n97\n94\n\n\nBefore running cell cycle we have to normalize the data. In the scanpy object, the data slot will be overwritten with the normalized data. So first, save the raw data into the slot raw. Then run normalization, log transformation and scale the data.\n\n# save normalized counts in raw slot.\nadata.raw = adata\n\n# normalize to depth 10 000\nsc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)\n\n# logaritmize\nsc.pp.log1p(adata)\n\n# scale\nsc.pp.scale(adata)\n\nnormalizing by total count per cell\n finished (0:00:00): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)\n... as `zero_center=True`, sparse input is densified and may lead to large memory consumption\n\n\nWe here perform cell cycle scoring. The function is actually a wrapper to sc.tl.score_gene_list, which is launched twice, to score separately S and G2M phases. Both sc.tl.score_gene_list and sc.tl.score_cell_cycle_genes are a port from Seurat and are supposed to work in a very similar way. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase.\n\nsc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)\n\ncalculating cell cycle phase\ncomputing score 'S_score'\nWARNING: genes are not in var_names and ignored: ['MLF1IP']\n finished: added\n 'S_score', score of gene set (adata.obs).\n 774 total control genes are used. (0:00:00)\ncomputing score 'G2M_score'\nWARNING: genes are not in var_names and ignored: ['FAM64A', 'HN1']\n finished: added\n 'G2M_score', score of gene set (adata.obs).\n 772 total control genes are used. (0:00:00)\n--> 'phase', cell cycle phase (adata.obs)\n\n\nWe can now plot a violin plot for the cell cycle scores as well.\n\nsc.pl.violin(adata, ['S_score', 'G2M_score'], jitter=0.4, groupby = 'sample', rotation=45)\n\n\n\n\n\n\n\n\nIn this case it looks like we only have a few cycling cells in the datasets.\nScanpy does an automatic prediction of cell cycle phase with a default cutoff of the scores at zero. As you can see this does not fit this data very well, so be cautios with using these predictions. Instead we suggest that you look at the scores.\n\nsc.pl.scatter(adata, x='S_score', y='G2M_score', color=\"phase\")" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_doublet", "href": "labs/scanpy/scanpy_01_qc.html#meta-qc_doublet", "title": " Quality Control", "section": "8 Predict doublets", - "text": "8 Predict doublets\nDoublets/Multiples of cells in the same well/droplet is a common issue in scRNAseq protocols. Especially in droplet-based methods with overloading of cells. In a typical 10x experiment the proportion of doublets is linearly dependent on the amount of loaded cells. As indicated from the Chromium user guide, doublet rates are about as follows:\n\nMost doublet detectors simulates doublets by merging cell counts and predicts doublets as cells that have similar embeddings as the simulated doublets. Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.\n\n\n\n\n\n\nCaution\n\n\n\nIdeally doublet prediction should be run on each sample separately, especially if your different samples have different proportions of cell types. In this case, the data is subsampled so we have very few cells per sample and all samples are sorted PBMCs so it is okay to run them together.\n\n\nFor doublet detection, we will use the package Scrublet, so first we need to get the raw counts from adata.raw.X and run scrublet with that matrix. Then we add in the doublet prediction info into our anndata object.\nDoublet prediction should be run for each dataset separately, so first we need to split the adata object into 6 separate objects, one per sample and then run scrublet on each of them.\n\nimport scrublet as scr\n\n# split per batch into new objects.\nbatches = adata.obs['sample'].cat.categories.tolist()\nalldata = {}\nfor batch in batches:\n tmp = adata[adata.obs['sample'] == batch,]\n print(batch, \":\", tmp.shape[0], \" cells\")\n scrub = scr.Scrublet(tmp.raw.X)\n out = scrub.scrub_doublets(verbose=False, n_prin_comps = 20)\n alldata[batch] = pd.DataFrame({'doublet_score':out[0],'predicted_doublets':out[1]},index = tmp.obs.index)\n print(alldata[batch].predicted_doublets.sum(), \" predicted_doublets\")\n\ncovid_1 : 900 cells\n25 predicted_doublets\ncovid_15 : 599 cells\n8 predicted_doublets\ncovid_17 : 1101 cells\n18 predicted_doublets\nctrl_5 : 1052 cells\n24 predicted_doublets\nctrl_13 : 1173 cells\n56 predicted_doublets\nctrl_14 : 1063 cells\n32 predicted_doublets\n\n\n\n# add predictions to the adata object.\nscrub_pred = pd.concat(alldata.values())\nadata.obs['doublet_scores'] = scrub_pred['doublet_score'] \nadata.obs['predicted_doublets'] = scrub_pred['predicted_doublets'] \n\nsum(adata.obs['predicted_doublets'])\n\n163\n\n\nWe should expect that two cells have more detected genes than a single cell, lets check if our predicted doublets also have more detected genes in general.\n\n# add in column with singlet/doublet instead of True/Fals\n%matplotlib inline\n\nadata.obs['doublet_info'] = adata.obs[\"predicted_doublets\"].astype(str)\nsc.pl.violin(adata, 'n_genes_by_counts', jitter=0.4, groupby = 'doublet_info', rotation=45)\n\n\n\n\n\n\n\n\nNow, lets run PCA and UMAP and plot doublet scores onto UMAP to check the doublet predictions.\n\nsc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\nadata = adata[:, adata.var.highly_variable]\nsc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])\nsc.pp.scale(adata, max_value=10)\nsc.tl.pca(adata, svd_solver='arpack')\nsc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)\nsc.tl.umap(adata)\nsc.pl.umap(adata, color=['doublet_scores','doublet_info','sample'])\n\nextracting highly variable genes\n finished (0:00:01)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nregressing out ['total_counts', 'pct_counts_mt']\n finished (0:00:30)\ncomputing PCA\n on highly variable genes\n with n_comps=50\n finished (0:00:01)\ncomputing neighbors\n using 'X_pca' with n_pcs = 40\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:08)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:07)\n\n\n\n\n\n\n\n\n\nNow, lets remove all predicted doublets from our data.\n\n# also revert back to the raw counts as the main matrix in adata\nadata = adata.raw.to_adata() \n\nadata = adata[adata.obs['doublet_info'] == 'False',:]\nprint(adata.shape)\n\n(5725, 18830)" + "text": "8 Predict doublets\nDoublets/Multiples of cells in the same well/droplet is a common issue in scRNAseq protocols. Especially in droplet-based methods with overloading of cells. In a typical 10x experiment the proportion of doublets is linearly dependent on the amount of loaded cells. As indicated from the Chromium user guide, doublet rates are about as follows:\n\nMost doublet detectors simulates doublets by merging cell counts and predicts doublets as cells that have similar embeddings as the simulated doublets. Most such packages need an assumption about the number/proportion of expected doublets in the dataset. The data you are using is subsampled, but the original datasets contained about 5 000 cells per sample, hence we can assume that they loaded about 9 000 cells and should have a doublet rate at about 4%.\nFor doublet detection, we will use the package Scrublet, so first we need to get the raw counts from adata.raw.X and run scrublet with that matrix. Then we add in the doublet prediction info into our anndata object.\nDoublet prediction should be run for each dataset separately, so first we need to split the adata object into 6 separate objects, one per sample and then run scrublet on each of them.\n\nimport scrublet as scr\n\n# split per batch into new objects.\nbatches = adata.obs['sample'].cat.categories.tolist()\nalldata = {}\nfor batch in batches:\n tmp = adata[adata.obs['sample'] == batch,]\n print(batch, \":\", tmp.shape[0], \" cells\")\n scrub = scr.Scrublet(tmp.raw.X)\n out = scrub.scrub_doublets(verbose=False, n_prin_comps = 20)\n alldata[batch] = pd.DataFrame({'doublet_score':out[0],'predicted_doublets':out[1]},index = tmp.obs.index)\n print(alldata[batch].predicted_doublets.sum(), \" predicted_doublets\")\n\ncovid_1 : 900 cells\n24 predicted_doublets\ncovid_15 : 599 cells\n8 predicted_doublets\ncovid_16 : 373 cells\n3 predicted_doublets\ncovid_17 : 1101 cells\n17 predicted_doublets\nctrl_5 : 1052 cells\n35 predicted_doublets\nctrl_13 : 1173 cells\n52 predicted_doublets\nctrl_14 : 1063 cells\n33 predicted_doublets\nctrl_19 : 1170 cells\n37 predicted_doublets\n\n\n\n# add predictions to the adata object.\nscrub_pred = pd.concat(alldata.values())\nadata.obs['doublet_scores'] = scrub_pred['doublet_score'] \nadata.obs['predicted_doublets'] = scrub_pred['predicted_doublets'] \n\nsum(adata.obs['predicted_doublets'])\n\n209\n\n\nWe should expect that two cells have more detected genes than a single cell, lets check if our predicted doublets also have more detected genes in general.\n\n# add in column with singlet/doublet instead of True/Fals\n%matplotlib inline\n\nadata.obs['doublet_info'] = adata.obs[\"predicted_doublets\"].astype(str)\nsc.pl.violin(adata, 'n_genes_by_counts', jitter=0.4, groupby = 'doublet_info', rotation=45)\n\n\n\n\n\n\n\n\nNow, lets run PCA and UMAP and plot doublet scores onto UMAP to check the doublet predictions.\n\nsc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\nadata = adata[:, adata.var.highly_variable]\nsc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])\nsc.pp.scale(adata, max_value=10)\nsc.tl.pca(adata, svd_solver='arpack')\nsc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)\nsc.tl.umap(adata)\nsc.pl.umap(adata, color=['doublet_scores','doublet_info','sample'])\n\nextracting highly variable genes\n finished (0:00:01)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nregressing out ['total_counts', 'pct_counts_mt']\n finished (0:00:36)\ncomputing PCA\n on highly variable genes\n with n_comps=50\n finished (0:00:02)\ncomputing neighbors\n using 'X_pca' with n_pcs = 40\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:09)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:10)\n\n\n\n\n\n\n\n\n\nNow, lets remove all predicted doublets from our data.\n\n# also revert back to the raw counts as the main matrix in adata\nadata = adata.raw.to_adata() \n\nadata = adata[adata.obs['doublet_info'] == 'False',:]\nprint(adata.shape)\n\n(7222, 19468)" }, { "objectID": "labs/scanpy/scanpy_01_qc.html#meta-qc_save", @@ -956,7 +956,7 @@ "href": "labs/scanpy/scanpy_01_qc.html#meta-session", "title": " Quality Control", "section": "10 Session info", - "text": "10 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfuture 0.18.3\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nlazy_loader NA\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npybiomart 0.2.0\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrequests_cache 0.4.13\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nscrublet NA\nseaborn 0.12.2\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nskimage 0.22.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:17" + "text": "10 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfuture 0.18.3\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nlazy_loader NA\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npybiomart 0.2.0\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrequests_cache 0.4.13\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nscrublet NA\nseaborn 0.12.2\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nskimage 0.22.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:22" }, { "objectID": "labs/scanpy/scanpy_02_dimred.html", @@ -970,21 +970,21 @@ "href": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_prep", "title": " Dimensionality Reduction", "section": "1 Data preparation", - "text": "1 Data preparation\nFirst, let’s load all necessary libraries and the QC-filtered dataset from the previous step.\n\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3\n# sc.logging.print_versions()\n\nsc.settings.set_figure_params(dpi=80)\n\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\npath_file = \"data/covid/results/scanpy_covid_qc.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 5725 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'umap'\n obsm: 'X_pca', 'X_umap'\n obsp: 'connectivities', 'distances'\n\n\nBefore variable gene selection we need to normalize and log transform the data. Then store the full matrix in the raw slot before doing variable gene selection.\n\n# normalize to depth 10 000\nsc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)\n\n# log transform\nsc.pp.log1p(adata)\n\n# store normalized counts in the raw slot, \n# we will subset adata.X for variable genes, but want to keep all genes matrix as well.\nadata.raw = adata\n\nadata\n\nnormalizing by total count per cell\n finished (0:00:00): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)\nWARNING: adata.X seems to be already log-transformed.\n\n\nAnnData object with n_obs × n_vars = 5725 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'umap'\n obsm: 'X_pca', 'X_umap'\n obsp: 'connectivities', 'distances'" + "text": "1 Data preparation\nFirst, let’s load all necessary libraries and the QC-filtered dataset from the previous step.\n\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3\n# sc.logging.print_versions()\n\nsc.settings.set_figure_params(dpi=80)\n\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\npath_file = \"data/covid/results/scanpy_covid_qc.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 7222 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'umap'\n obsm: 'X_pca', 'X_umap'\n obsp: 'connectivities', 'distances'\n\n\nBefore variable gene selection we need to normalize and log transform the data. Then store the full matrix in the raw slot before doing variable gene selection.\n\n# normalize to depth 10 000\nsc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)\n\n# log transform\nsc.pp.log1p(adata)\n\n# store normalized counts in the raw slot, \n# we will subset adata.X for variable genes, but want to keep all genes matrix as well.\nadata.raw = adata\n\nadata\n\nnormalizing by total count per cell\n finished (0:00:00): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)\nWARNING: adata.X seems to be already log-transformed.\n\n\nAnnData object with n_obs × n_vars = 7222 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'umap'\n obsm: 'X_pca', 'X_umap'\n obsp: 'connectivities', 'distances'" }, { "objectID": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_fs", "href": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_fs", "title": " Dimensionality Reduction", "section": "2 Feature selection", - "text": "2 Feature selection\nNext, we first need to define which features/genes are important in our dataset to distinguish cell types. For this purpose, we need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.\n\n# compute variable genes\nsc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\nprint(\"Highly variable genes: %d\"%sum(adata.var.highly_variable))\n\n#plot variable genes\nsc.pl.highly_variable_genes(adata)\n\n# subset for variable genes in the dataset\nadata = adata[:, adata.var['highly_variable']]\n\nextracting highly variable genes\n finished (0:00:01)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nHighly variable genes: 2727" + "text": "2 Feature selection\nNext, we first need to define which features/genes are important in our dataset to distinguish cell types. For this purpose, we need to find genes that are highly variable across cells, which in turn will also provide a good separation of the cell clusters.\n\n# compute variable genes\nsc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\nprint(\"Highly variable genes: %d\"%sum(adata.var.highly_variable))\n\n#plot variable genes\nsc.pl.highly_variable_genes(adata)\n\n# subset for variable genes in the dataset\nadata = adata[:, adata.var['highly_variable']]\n\nextracting highly variable genes\n finished (0:00:01)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nHighly variable genes: 2626" }, { "objectID": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_zs", "href": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_zs", "title": " Dimensionality Reduction", "section": "3 Z-score transformation", - "text": "3 Z-score transformation\nNow that the data is prepared, we now proceed with PCA. Since each gene has a different expression level, it means that genes with higher expression values will naturally have higher variation that will be captured by PCA. This means that we need to somehow give each gene a similar weight when performing PCA (see below). The common practice is to center and scale each gene before performing PCA. This exact scaling is called Z-score normalization it is very useful for PCA, clustering and plotting heatmaps. Additionally, we can use regression to remove any unwanted sources of variation from the dataset, such as cell cycle, sequencing depth, percent mitochondria. This is achieved by doing a generalized linear regression using these parameters as co-variates in the model. Then the residuals of the model are taken as the regressed data. Although perhaps not in the best way, batch effect regression can also be done here. By default variables are scaled in the PCA step and is not done separately. But it could be achieved by running the commands below:\n\n#run this line if you get the \"AttributeError: swapaxes not found\" \n# adata = adata.copy()\n\n# regress out unwanted variables\nsc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])\n\n# scale data, clip values exceeding standard deviation 10.\nsc.pp.scale(adata, max_value=10)\n\nregressing out ['total_counts', 'pct_counts_mt']\n sparse input is densified and may lead to high memory use\n finished (0:00:37)" + "text": "3 Z-score transformation\nNow that the data is prepared, we now proceed with PCA. Since each gene has a different expression level, it means that genes with higher expression values will naturally have higher variation that will be captured by PCA. This means that we need to somehow give each gene a similar weight when performing PCA (see below). The common practice is to center and scale each gene before performing PCA. This exact scaling is called Z-score normalization it is very useful for PCA, clustering and plotting heatmaps. Additionally, we can use regression to remove any unwanted sources of variation from the dataset, such as cell cycle, sequencing depth, percent mitochondria. This is achieved by doing a generalized linear regression using these parameters as co-variates in the model. Then the residuals of the model are taken as the regressed data. Although perhaps not in the best way, batch effect regression can also be done here. By default variables are scaled in the PCA step and is not done separately. But it could be achieved by running the commands below:\n\n#run this line if you get the \"AttributeError: swapaxes not found\" \n# adata = adata.copy()\n\n# regress out unwanted variables\nsc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])\n\n# scale data, clip values exceeding standard deviation 10.\nsc.pp.scale(adata, max_value=10)\n\nregressing out ['total_counts', 'pct_counts_mt']\n sparse input is densified and may lead to high memory use\n finished (0:00:50)" }, { "objectID": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_pca", @@ -998,14 +998,14 @@ "href": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_tsne", "title": " Dimensionality Reduction", "section": "5 tSNE", - "text": "5 tSNE\nWe can now run BH-tSNE.\n\nsc.tl.tsne(adata, n_pcs = 30)\n\ncomputing tSNE\n using 'X_pca' with n_pcs = 30\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:10)\n\n\nWe can now plot the tSNE colored per dataset. We can clearly see the effect of batches present in the dataset.\n\nsc.pl.tsne(adata, color='sample')" + "text": "5 tSNE\nWe can now run BH-tSNE.\n\nsc.tl.tsne(adata, n_pcs = 30)\n\ncomputing tSNE\n using 'X_pca' with n_pcs = 30\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:13)\n\n\nWe can now plot the tSNE colored per dataset. We can clearly see the effect of batches present in the dataset.\n\nsc.pl.tsne(adata, color='sample')" }, { "objectID": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_umap", "href": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_umap", "title": " Dimensionality Reduction", "section": "6 UMAP", - "text": "6 UMAP\nThe UMAP implementation in SCANPY uses a neighborhood graph as the distance matrix, so we need to first calculate the graph.\n\nsc.pp.neighbors(adata, n_pcs = 30, n_neighbors = 20)\n\ncomputing neighbors\n using 'X_pca' with n_pcs = 30\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:08)\n\n\nWe can now run UMAP for cell embeddings.\n\nsc.tl.umap(adata)\nsc.pl.umap(adata, color='sample')\n\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:09)\n\n\n\n\n\n\n\n\n\nUMAP is plotted colored per dataset. Although less distinct as in the tSNE, we still see quite an effect of the different batches in the data.\n\n# run with 10 components, save to a new object so that the umap with 2D is not overwritten.\numap10 = sc.tl.umap(adata, n_components=10, copy=True)\nfig, axs = plt.subplots(1, 3, figsize=(10, 4), constrained_layout=True)\n\nsc.pl.umap(adata, color='sample', title=\"UMAP\",\n show=False, ax=axs[0], legend_loc=None)\nsc.pl.umap(umap10, color='sample', title=\"UMAP10\", show=False,\n ax=axs[1], components=['1,2'], legend_loc=None)\nsc.pl.umap(umap10, color='sample', title=\"UMAP10\",\n show=False, ax=axs[2], components=['3,4'], legend_loc=None)\n\n# we can also plot the umap with neighbor edges\nsc.pl.umap(adata, color='sample', title=\"UMAP\", edges=True)\n\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:10)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can now plot PCA, UMAP and tSNE side by side for comparison. Have a look at the UMAP and tSNE, what similarities/differences do you see, can you explain the differences based on what you learned during the lecture? Also, we can conclude from the dimensionality reductions that our dataset contains a batch effect that needs to be corrected before proceeding to clustering and differential gene expression analysis.\n\nfig, axs = plt.subplots(2, 2, figsize=(10, 8), constrained_layout=True)\nsc.pl.pca(adata, color='sample', components=['1,2'], ax=axs[0, 0], show=False)\nsc.pl.tsne(adata, color='sample', components=['1,2'], ax=axs[0, 1], show=False)\nsc.pl.umap(adata, color='sample', components=['1,2'], ax=axs[1, 0], show=False)\n\n<Axes: title={'center': 'sample'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nFinally, we can compare the PCA, tSNE and UMAP.\n\n\n\n\n\n\nDiscuss\n\n\n\nWe have now done Variable gene selection, PCA and UMAP with the settings we selected for you. Test a few different ways of selecting variable genes, number of PCs for UMAP and check how it influences your embedding." + "text": "6 UMAP\nThe UMAP implementation in SCANPY uses a neighborhood graph as the distance matrix, so we need to first calculate the graph.\n\nsc.pp.neighbors(adata, n_pcs = 30, n_neighbors = 20)\n\ncomputing neighbors\n using 'X_pca' with n_pcs = 30\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:08)\n\n\nWe can now run UMAP for cell embeddings.\n\nsc.tl.umap(adata)\nsc.pl.umap(adata, color='sample')\n\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:12)\n\n\n\n\n\n\n\n\n\nUMAP is plotted colored per dataset. Although less distinct as in the tSNE, we still see quite an effect of the different batches in the data.\n\n# run with 10 components, save to a new object so that the umap with 2D is not overwritten.\numap10 = sc.tl.umap(adata, n_components=10, copy=True)\nfig, axs = plt.subplots(1, 3, figsize=(10, 4), constrained_layout=True)\n\nsc.pl.umap(adata, color='sample', title=\"UMAP\",\n show=False, ax=axs[0], legend_loc=None)\nsc.pl.umap(umap10, color='sample', title=\"UMAP10\", show=False,\n ax=axs[1], components=['1,2'], legend_loc=None)\nsc.pl.umap(umap10, color='sample', title=\"UMAP10\",\n show=False, ax=axs[2], components=['3,4'], legend_loc=None)\n\n# we can also plot the umap with neighbor edges\nsc.pl.umap(adata, color='sample', title=\"UMAP\", edges=True)\n\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:13)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can now plot PCA, UMAP and tSNE side by side for comparison. Have a look at the UMAP and tSNE, what similarities/differences do you see, can you explain the differences based on what you learned during the lecture? Also, we can conclude from the dimensionality reductions that our dataset contains a batch effect that needs to be corrected before proceeding to clustering and differential gene expression analysis.\n\nfig, axs = plt.subplots(2, 2, figsize=(10, 8), constrained_layout=True)\nsc.pl.pca(adata, color='sample', components=['1,2'], ax=axs[0, 0], show=False)\nsc.pl.tsne(adata, color='sample', components=['1,2'], ax=axs[0, 1], show=False)\nsc.pl.umap(adata, color='sample', components=['1,2'], ax=axs[1, 0], show=False)\n\n<Axes: title={'center': 'sample'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nFinally, we can compare the PCA, tSNE and UMAP.\n\n\n\n\n\n\nDiscuss\n\n\n\nWe have now done Variable gene selection, PCA and UMAP with the settings we selected for you. Test a few different ways of selecting variable genes, number of PCs for UMAP and check how it influences your embedding." }, { "objectID": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_plotgenes", @@ -1019,14 +1019,14 @@ "href": "labs/scanpy/scanpy_02_dimred.html#meta-dimred_save", "title": " Dimensionality Reduction", "section": "8 Save data", - "text": "8 Save data\nWe can finally save the object for use in future steps.\n\nadata.write_h5ad('data/covid/results/scanpy_covid_qc_dr.h5ad')\n\n\nprint(adata.X.shape)\nprint(adata.raw.X.shape)\n\n(5725, 2727)\n(5725, 18830)" + "text": "8 Save data\nWe can finally save the object for use in future steps.\n\nadata.write_h5ad('data/covid/results/scanpy_covid_qc_dr.h5ad')\n\nJust a reminder, you need to keep in mind what you have in the X matrix. After these operations you have an X matrix with only variable genes, that are normalized, logtransformed and scaled.\nWe stored the expression of all genes in raw.X after doing lognormalization so that matrix is a sparse matrix with logtransformed values.\n\nprint(adata.X.shape)\nprint(adata.raw.X.shape)\n\nprint(adata.X[:3,:3])\nprint(adata.raw.X[:10,:10])\n\n(7222, 2626)\n(7222, 19468)\n[[-0.16998859 -0.06050171 -0.08070081]\n [-0.19315341 -0.09975121 -0.31379319]\n [-0.2051203 -0.11680799 -0.43194618]]\n (1, 4) 0.7825693876867097\n (8, 7) 1.1311041336746985" }, { "objectID": "labs/scanpy/scanpy_02_dimred.html#meta-session", "href": "labs/scanpy/scanpy_02_dimred.html#meta-session", "title": " Dimensionality Reduction", "section": "9 Session info", - "text": "9 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnetworkx 3.2.1\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:19" + "text": "9 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnetworkx 3.2.1\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:25" }, { "objectID": "labs/scanpy/scanpy_03_integration.html", @@ -1040,35 +1040,35 @@ "href": "labs/scanpy/scanpy_03_integration.html#meta-int_prep", "title": " Data Integration", "section": "1 Data preparation", - "text": "1 Data preparation\nLet’s first load necessary libraries and the data saved in the previous lab.\n\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action='ignore', category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3 \n\nsc.settings.set_figure_params(dpi=80)\n%matplotlib inline\n\nCreate individual adata objects per batch.\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\npath_file = \"data/covid/results/scanpy_covid_qc_dr.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 5725 × 2727\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'\n\n\n\nprint(adata.X.shape)\n\n(5725, 2727)\n\n\nAs the stored AnnData object contains scaled data based on variable genes, we need to make a new object with the logtransformed normalized counts. The new variable gene selection should not be performed on the scaled data matrix.\n\nadata2 = adata.raw.to_adata() \n\nadata2.uns['log1p']['base']=None\n\n# check that the matrix looks like normalized counts\nprint(adata2.X[1:10,1:10])\n\n (0, 2) 0.7825693876867097\n (7, 5) 1.1311041336746985" + "text": "1 Data preparation\nLet’s first load necessary libraries and the data saved in the previous lab.\n\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action='ignore', category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3 \n\nsc.settings.set_figure_params(dpi=80)\n%matplotlib inline\n\nCreate individual adata objects per batch.\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\npath_file = \"data/covid/results/scanpy_covid_qc_dr.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 7222 × 2626\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'\n\n\n\nprint(adata.X.shape)\n\n(7222, 2626)\n\n\nAs the stored AnnData object contains scaled data based on variable genes, we need to make a new object with the logtransformed normalized counts. The new variable gene selection should not be performed on the scaled data matrix.\n\nadata2 = adata.raw.to_adata() \n\n# in some versions of Anndata there is an issue with information on the logtransformation in the slot log1p.base so we set it to None to not get errors.\nadata2.uns['log1p']['base']=None\n\n# check that the matrix looks like normalized counts\nprint(adata2.X[1:10,1:10])\n\n (0, 3) 0.7825693876867097\n (7, 6) 1.1311041336746985" }, { "objectID": "labs/scanpy/scanpy_03_integration.html#detect-variable-genes", "href": "labs/scanpy/scanpy_03_integration.html#detect-variable-genes", "title": " Data Integration", "section": "2 Detect variable genes", - "text": "2 Detect variable genes\nVariable genes can be detected across the full dataset, but then we run the risk of getting many batch-specific genes that will drive a lot of the variation. Or we can select variable genes from each batch separately to get only celltype variation. In the dimensionality reduction exercise, we already selected variable genes, so they are already stored in adata.var.highly_variable.\n\nvar_genes_all = adata.var.highly_variable\n\nprint(\"Highly variable genes: %d\"%sum(var_genes_all))\n\nHighly variable genes: 2727\n\n\nDetect variable genes in each dataset separately using the batch_key parameter.\n\nsc.pp.highly_variable_genes(adata2, min_mean=0.0125, max_mean=3, min_disp=0.5, batch_key = 'sample')\n\nprint(\"Highly variable genes intersection: %d\"%sum(adata2.var.highly_variable_intersection))\n\nprint(\"Number of batches where gene is variable:\")\nprint(adata2.var.highly_variable_nbatches.value_counts())\n\nvar_genes_batch = adata2.var.highly_variable_nbatches > 0\n\nextracting highly variable genes\n finished (0:00:02)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nHighly variable genes intersection: 196\nNumber of batches where gene is variable:\nhighly_variable_nbatches\n0 8436\n1 4729\n2 3037\n3 1504\n4 627\n5 301\n6 196\nName: count, dtype: int64\n\n\nCompare overlap of variable genes with batches or with all data.\n\nprint(\"Any batch var genes: %d\"%sum(var_genes_batch))\nprint(\"All data var genes: %d\"%sum(var_genes_all))\nprint(\"Overlap: %d\"%sum(var_genes_batch & var_genes_all))\nprint(\"Variable genes in all batches: %d\"%sum(adata2.var.highly_variable_nbatches == 6))\nprint(\"Overlap batch instersection and all: %d\"%sum(var_genes_all & adata2.var.highly_variable_intersection))\n\nAny batch var genes: 10394\nAll data var genes: 2727\nOverlap: 2724\nVariable genes in all batches: 196\nOverlap batch instersection and all: 193\n\n\nSelect all genes that are variable in at least 2 datasets and use for remaining analysis.\n\nvar_select = adata2.var.highly_variable_nbatches > 2\nvar_genes = var_select.index[var_select]\nlen(var_genes)\n\n2628" + "text": "2 Detect variable genes\nVariable genes can be detected across the full dataset, but then we run the risk of getting many batch-specific genes that will drive a lot of the variation. Or we can select variable genes from each batch separately to get only celltype variation. In the dimensionality reduction exercise, we already selected variable genes, so they are already stored in adata.var.highly_variable.\n\nvar_genes_all = adata.var.highly_variable\n\nprint(\"Highly variable genes: %d\"%sum(var_genes_all))\n\nHighly variable genes: 2626\n\n\nDetect variable genes in each dataset separately using the batch_key parameter.\n\nsc.pp.highly_variable_genes(adata2, min_mean=0.0125, max_mean=3, min_disp=0.5, batch_key = 'sample')\n\nprint(\"Highly variable genes intersection: %d\"%sum(adata2.var.highly_variable_intersection))\n\nprint(\"Number of batches where gene is variable:\")\nprint(adata2.var.highly_variable_nbatches.value_counts())\n\nvar_genes_batch = adata2.var.highly_variable_nbatches > 0\n\nextracting highly variable genes\n finished (0:00:02)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nHighly variable genes intersection: 122\nNumber of batches where gene is variable:\nhighly_variable_nbatches\n0 7876\n1 4163\n2 3161\n3 2025\n4 1115\n5 559\n6 277\n7 170\n8 122\nName: count, dtype: int64\n\n\nCompare overlap of variable genes with batches or with all data.\n\nprint(\"Any batch var genes: %d\"%sum(var_genes_batch))\nprint(\"All data var genes: %d\"%sum(var_genes_all))\nprint(\"Overlap: %d\"%sum(var_genes_batch & var_genes_all))\nprint(\"Variable genes in all batches: %d\"%sum(adata2.var.highly_variable_nbatches == 6))\nprint(\"Overlap batch instersection and all: %d\"%sum(var_genes_all & adata2.var.highly_variable_intersection))\n\nAny batch var genes: 11592\nAll data var genes: 2626\nOverlap: 2625\nVariable genes in all batches: 277\nOverlap batch instersection and all: 122\n\n\nSelect all genes that are variable in at least 2 datasets and use for remaining analysis.\n\nvar_select = adata2.var.highly_variable_nbatches > 2\nvar_genes = var_select.index[var_select]\nlen(var_genes)\n\n4268" }, { "objectID": "labs/scanpy/scanpy_03_integration.html#bbknn", "href": "labs/scanpy/scanpy_03_integration.html#bbknn", "title": " Data Integration", "section": "3 BBKNN", - "text": "3 BBKNN\nFirst, we will run BBKNN that is implemented in scanpy.\n\nimport bbknn\nbbknn.bbknn(adata2,batch_key='sample')\n\n# then run umap on the integrated space\nsc.tl.umap(adata2)\nsc.tl.tsne(adata2)\n\ncomputing batch balanced neighbors\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:09)\ncomputing tSNE\n using 'X_pca' with n_pcs = 50\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:10)\n\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN Corrected tsne\", ax=axs[0,0], show=False)\nsc.pl.tsne(adata, color=\"sample\", title=\"Uncorrected tsne\", ax=axs[0,1], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN Corrected umap\", ax=axs[1,0], show=False)\nsc.pl.umap(adata, color=\"sample\", title=\"Uncorrected umap\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Uncorrected umap'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nLet’s save the integrated data for further analysis.\n\nsave_file = './data/covid/results/scanpy_covid_qc_dr_bbknn.h5ad'\nadata2.write_h5ad(save_file)" + "text": "3 BBKNN\nFirst, we will run BBKNN that is implemented in scanpy.\n\nimport bbknn\nbbknn.bbknn(adata2,batch_key='sample')\n\n# then run umap on the integrated space\nsc.tl.umap(adata2)\nsc.tl.tsne(adata2)\n\ncomputing batch balanced neighbors\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:12)\ncomputing tSNE\n using 'X_pca' with n_pcs = 50\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:12)\n\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN Corrected tsne\", ax=axs[0,0], show=False)\nsc.pl.tsne(adata, color=\"sample\", title=\"Uncorrected tsne\", ax=axs[0,1], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN Corrected umap\", ax=axs[1,0], show=False)\nsc.pl.umap(adata, color=\"sample\", title=\"Uncorrected umap\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Uncorrected umap'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nLet’s save the integrated data for further analysis.\n\nsave_file = './data/covid/results/scanpy_covid_qc_dr_bbknn.h5ad'\nadata2.write_h5ad(save_file)" }, { "objectID": "labs/scanpy/scanpy_03_integration.html#combat", "href": "labs/scanpy/scanpy_03_integration.html#combat", "title": " Data Integration", "section": "4 Combat", - "text": "4 Combat\nBatch correction can also be performed with combat. Note that ComBat batch correction requires a dense matrix format as input (which is already the case in this example).\n\n# create a new object with lognormalized counts\nadata_combat = sc.AnnData(X=adata.raw.X, var=adata.raw.var, obs = adata.obs)\n\n# first store the raw data \nadata_combat.raw = adata_combat\n\n# run combat\nsc.pp.combat(adata_combat, key='sample')\n\nStandardizing Data across genes.\n\nFound 6 batches\n\nFound 0 numerical variables:\n \n\nFound 37 genes with zero variance.\nFitting L/S model and finding priors\n\nFinding parametric adjustments\n\nAdjusting data\n\n\n\nThen we run the regular steps of dimensionality reduction on the combat corrected data. Variable gene selection, pca and umap with combat data.\n\nsc.pp.highly_variable_genes(adata_combat)\nprint(\"Highly variable genes: %d\"%sum(adata_combat.var.highly_variable))\nsc.pl.highly_variable_genes(adata_combat)\n\nsc.pp.pca(adata_combat, n_comps=30, use_highly_variable=True, svd_solver='arpack')\n\nsc.pp.neighbors(adata_combat)\n\nsc.tl.umap(adata_combat)\nsc.tl.tsne(adata_combat)\n\nextracting highly variable genes\n finished (0:00:01)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nHighly variable genes: 3533\ncomputing PCA\n on highly variable genes\n with n_comps=30\n finished (0:00:01)\ncomputing neighbors\n using 'X_pca' with n_pcs = 30\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:08)\ncomputing tSNE\n using 'X_pca' with n_pcs = 30\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:09)\n\n\n\n\n\n\n\n\n\n\n# compare var_genes\nvar_genes_combat = adata_combat.var.highly_variable\nprint(\"With all data %d\"%sum(var_genes_all))\nprint(\"With combat %d\"%sum(var_genes_combat))\nprint(\"Overlap %d\"%sum(var_genes_all & var_genes_combat))\n\nprint(\"With 2 batches %d\"%sum(var_select))\nprint(\"Overlap %d\"%sum(var_genes_combat & var_select))\n\nWith all data 2727\nWith combat 3533\nOverlap 2003\nWith 2 batches 2628\nOverlap 1896\n\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN tsne\", ax=axs[0,0], show=False)\nsc.pl.tsne(adata_combat, color=\"sample\", title=\"Combat tsne\", ax=axs[0,1], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN umap\", ax=axs[1,0], show=False)\nsc.pl.umap(adata_combat, color=\"sample\", title=\"Combat umap\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Combat umap'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nLet’s save the integrated data for further analysis.\n\n#save to file\nsave_file = './data/covid/results/scanpy_covid_qc_dr_combat.h5ad'\nadata_combat.write_h5ad(save_file)" + "text": "4 Combat\nBatch correction can also be performed with combat. Note that ComBat batch correction requires a dense matrix format as input (which is already the case in this example).\n\n# create a new object with lognormalized counts\nadata_combat = sc.AnnData(X=adata.raw.X, var=adata.raw.var, obs = adata.obs)\n\n# first store the raw data \nadata_combat.raw = adata_combat\n\n# run combat\nsc.pp.combat(adata_combat, key='sample')\n\nStandardizing Data across genes.\n\nFound 8 batches\n\nFound 0 numerical variables:\n \n\nFound 39 genes with zero variance.\nFitting L/S model and finding priors\n\nFinding parametric adjustments\n\nAdjusting data\n\n\n\nThen we run the regular steps of dimensionality reduction on the combat corrected data. Variable gene selection, pca and umap with combat data.\n\nsc.pp.highly_variable_genes(adata_combat)\nprint(\"Highly variable genes: %d\"%sum(adata_combat.var.highly_variable))\nsc.pl.highly_variable_genes(adata_combat)\n\nsc.pp.pca(adata_combat, n_comps=30, use_highly_variable=True, svd_solver='arpack')\n\nsc.pp.neighbors(adata_combat)\n\nsc.tl.umap(adata_combat)\nsc.tl.tsne(adata_combat)\n\nextracting highly variable genes\n finished (0:00:01)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\nHighly variable genes: 3923\ncomputing PCA\n on highly variable genes\n with n_comps=30\n finished (0:00:01)\ncomputing neighbors\n using 'X_pca' with n_pcs = 30\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:10)\ncomputing tSNE\n using 'X_pca' with n_pcs = 30\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:13)\n\n\n\n\n\n\n\n\n\n\n# compare var_genes\nvar_genes_combat = adata_combat.var.highly_variable\nprint(\"With all data %d\"%sum(var_genes_all))\nprint(\"With combat %d\"%sum(var_genes_combat))\nprint(\"Overlap %d\"%sum(var_genes_all & var_genes_combat))\n\nprint(\"With 2 batches %d\"%sum(var_select))\nprint(\"Overlap %d\"%sum(var_genes_combat & var_select))\n\nWith all data 2626\nWith combat 3923\nOverlap 1984\nWith 2 batches 4268\nOverlap 2729\n\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN tsne\", ax=axs[0,0], show=False)\nsc.pl.tsne(adata_combat, color=\"sample\", title=\"Combat tsne\", ax=axs[0,1], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN umap\", ax=axs[1,0], show=False)\nsc.pl.umap(adata_combat, color=\"sample\", title=\"Combat umap\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Combat umap'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nLet’s save the integrated data for further analysis.\n\n#save to file\nsave_file = './data/covid/results/scanpy_covid_qc_dr_combat.h5ad'\nadata_combat.write_h5ad(save_file)" }, { "objectID": "labs/scanpy/scanpy_03_integration.html#meta-int_scanorama", "href": "labs/scanpy/scanpy_03_integration.html#meta-int_scanorama", "title": " Data Integration", "section": "5 Scanorama", - "text": "5 Scanorama\nTry out Scanorama for data integration as well. First we need to create individual AnnData objects from each of the datasets.\n\n# split per batch into new objects.\nbatches = adata.obs['sample'].cat.categories.tolist()\nalldata = {}\nfor batch in batches:\n alldata[batch] = adata2[adata2.obs['sample'] == batch,]\n\nalldata \n\n{'covid_1': View of AnnData object with n_obs × n_vars = 875 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'covid_15': View of AnnData object with n_obs × n_vars = 591 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'covid_17': View of AnnData object with n_obs × n_vars = 1083 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_5': View of AnnData object with n_obs × n_vars = 1028 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_13': View of AnnData object with n_obs × n_vars = 1117 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_14': View of AnnData object with n_obs × n_vars = 1031 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances'}\n\n\n\nimport scanorama\n\n#subset the individual dataset to the variable genes we defined at the beginning\nalldata2 = dict()\nfor ds in alldata.keys():\n print(ds)\n alldata2[ds] = alldata[ds][:,var_genes]\n\n#convert to list of AnnData objects\nadatas = list(alldata2.values())\n\n# run scanorama.integrate\nscanorama.integrate_scanpy(adatas, dimred = 50)\n\ncovid_1\ncovid_15\ncovid_17\nctrl_5\nctrl_13\nctrl_14\nFound 2628 genes among all datasets\n[[0. 0.74450085 0.2843952 0.63521401 0.456 0.37485714]\n [0. 0. 0.5177665 0.48346304 0.32656514 0.36886633]\n [0. 0. 0. 0.32976654 0.11080332 0.15327793]\n [0. 0. 0. 0. 0.83754864 0.74319066]\n [0. 0. 0. 0. 0. 0.85675918]\n [0. 0. 0. 0. 0. 0. ]]\nProcessing datasets (4, 5)\nProcessing datasets (3, 4)\nProcessing datasets (0, 1)\nProcessing datasets (3, 5)\nProcessing datasets (0, 3)\nProcessing datasets (1, 2)\nProcessing datasets (1, 3)\nProcessing datasets (0, 4)\nProcessing datasets (0, 5)\nProcessing datasets (1, 5)\nProcessing datasets (2, 3)\nProcessing datasets (1, 4)\nProcessing datasets (0, 2)\nProcessing datasets (2, 5)\nProcessing datasets (2, 4)\n\n\n\n#scanorama adds the corrected matrix to adata.obsm in each of the datasets in adatas.\nadatas[0].obsm['X_scanorama'].shape\n\n(875, 50)\n\n\n\n# Get all the integrated matrices.\nscanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]\n\n# make into one matrix.\nall_s = np.concatenate(scanorama_int)\nprint(all_s.shape)\n\n# add to the AnnData object, create a new object first\nadata_sc = adata.copy()\nadata_sc.obsm[\"Scanorama\"] = all_s\n\n(5725, 50)\n\n\n\n# tsne and umap\nsc.pp.neighbors(adata_sc, n_pcs =30, use_rep = \"Scanorama\")\nsc.tl.umap(adata_sc)\nsc.tl.tsne(adata_sc, n_pcs = 30, use_rep = \"Scanorama\")\n\ncomputing neighbors\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:08)\ncomputing tSNE\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:09)\n\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN tsne\", ax=axs[0,0], show=False)\nsc.pl.umap(adata, color=\"sample\", title=\"Scanorama tsne\", ax=axs[0,1], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN umap\", ax=axs[1,0], show=False)\nsc.pl.umap(adata, color=\"sample\", title=\"Scanorama umap\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Scanorama umap'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nLet’s save the integrated data for further analysis.\n\n#save to file\nsave_file = './data/covid/results/scanpy_covid_qc_dr_scanorama.h5ad'\nadata_sc.write_h5ad(save_file)" + "text": "5 Scanorama\nTry out Scanorama for data integration as well. First we need to create individual AnnData objects from each of the datasets.\n\n# split per batch into new objects.\nbatches = adata.obs['sample'].cat.categories.tolist()\nalldata = {}\nfor batch in batches:\n alldata[batch] = adata2[adata2.obs['sample'] == batch,]\n\nalldata \n\n{'covid_1': View of AnnData object with n_obs × n_vars = 876 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'covid_15': View of AnnData object with n_obs × n_vars = 591 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'covid_16': View of AnnData object with n_obs × n_vars = 370 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'covid_17': View of AnnData object with n_obs × n_vars = 1084 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_5': View of AnnData object with n_obs × n_vars = 1017 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_13': View of AnnData object with n_obs × n_vars = 1121 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_14': View of AnnData object with n_obs × n_vars = 1030 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances',\n 'ctrl_19': View of AnnData object with n_obs × n_vars = 1133 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances'}\n\n\n\nimport scanorama\n\n#subset the individual dataset to the variable genes we defined at the beginning\nalldata2 = dict()\nfor ds in alldata.keys():\n print(ds)\n alldata2[ds] = alldata[ds][:,var_genes]\n\n#convert to list of AnnData objects\nadatas = list(alldata2.values())\n\n# run scanorama.integrate\nscanorama.integrate_scanpy(adatas, dimred = 50)\n\ncovid_1\ncovid_15\ncovid_16\ncovid_17\nctrl_5\nctrl_13\nctrl_14\nctrl_19\nFound 4268 genes among all datasets\n[[0. 0.50761421 0.52972973 0.26845018 0.59488692 0.48401826\n 0.36757991 0.09973522]\n [0. 0. 0.81891892 0.33840948 0.43362832 0.23181049\n 0.29949239 0.17597293]\n [0. 0. 0. 0.22702703 0.49459459 0.52972973\n 0.42702703 0.3 ]\n [0. 0. 0. 0. 0.27138643 0.09132841\n 0.1300738 0.17387467]\n [0. 0. 0. 0. 0. 0.8446411\n 0.73647984 0.25419241]\n [0. 0. 0. 0. 0. 0.\n 0.82815534 0.44836717]\n [0. 0. 0. 0. 0. 0.\n 0. 0.78022948]\n [0. 0. 0. 0. 0. 0.\n 0. 0. ]]\nProcessing datasets (4, 5)\nProcessing datasets (5, 6)\nProcessing datasets (1, 2)\nProcessing datasets (6, 7)\nProcessing datasets (4, 6)\nProcessing datasets (0, 4)\nProcessing datasets (2, 5)\nProcessing datasets (0, 2)\nProcessing datasets (0, 1)\nProcessing datasets (2, 4)\nProcessing datasets (0, 5)\nProcessing datasets (5, 7)\nProcessing datasets (1, 4)\nProcessing datasets (2, 6)\nProcessing datasets (0, 6)\nProcessing datasets (1, 3)\nProcessing datasets (2, 7)\nProcessing datasets (1, 6)\nProcessing datasets (3, 4)\nProcessing datasets (0, 3)\nProcessing datasets (4, 7)\nProcessing datasets (1, 5)\nProcessing datasets (2, 3)\nProcessing datasets (1, 7)\nProcessing datasets (3, 7)\nProcessing datasets (3, 6)\n\n\n\n#scanorama adds the corrected matrix to adata.obsm in each of the datasets in adatas.\nadatas[0].obsm['X_scanorama'].shape\n\n(876, 50)\n\n\n\n# Get all the integrated matrices.\nscanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]\n\n# make into one matrix.\nall_s = np.concatenate(scanorama_int)\nprint(all_s.shape)\n\n# add to the AnnData object, create a new object first\nadata_sc = adata.copy()\nadata_sc.obsm[\"Scanorama\"] = all_s\n\n(7222, 50)\n\n\n\n# tsne and umap\nsc.pp.neighbors(adata_sc, n_pcs =30, use_rep = \"Scanorama\")\nsc.tl.umap(adata_sc)\nsc.tl.tsne(adata_sc, n_pcs = 30, use_rep = \"Scanorama\")\n\ncomputing neighbors\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:10)\ncomputing tSNE\n using sklearn.manifold.TSNE\n finished: added\n 'X_tsne', tSNE coordinates (adata.obsm) (0:00:12)\n\n\nWe can now plot the unintegrated and the integrated space reduced dimensions.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.tsne(adata2, color=\"sample\", title=\"BBKNN tsne\", ax=axs[0,0], show=False)\nsc.pl.tsne(adata_sc, color=\"sample\", title=\"Scanorama tsne\", ax=axs[0,1], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN umap\", ax=axs[1,0], show=False)\nsc.pl.umap(adata_sc, color=\"sample\", title=\"Scanorama umap\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Scanorama umap'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\nLet’s save the integrated data for further analysis.\n\n#save to file\nsave_file = './data/covid/results/scanpy_covid_qc_dr_scanorama.h5ad'\nadata_sc.write_h5ad(save_file)" }, { "objectID": "labs/scanpy/scanpy_03_integration.html#compare-all", @@ -1081,22 +1081,22 @@ "objectID": "labs/scanpy/scanpy_03_integration.html#meta-session", "href": "labs/scanpy/scanpy_03_integration.html#meta-session", "title": " Data Integration", - "section": "7 Session info", - "text": "7 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\nbbknn 1.6.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfbpca NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nintervaltree NA\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscanorama 1.7.4\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsortedcontainers 2.4.0\nsparse 0.14.0\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:21" + "section": "8 Session info", + "text": "8 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\nbbknn 1.6.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfbpca NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nintervaltree NA\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscanorama 1.7.4\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsortedcontainers 2.4.0\nsparse 0.14.0\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:27" }, { "objectID": "labs/scanpy/scanpy_04_clustering.html", "href": "labs/scanpy/scanpy_04_clustering.html", "title": " Clustering", "section": "", - "text": "Note\n\n\n\nCode chunks run Python commands unless it starts with %%bash, in which case, those chunks run shell commands.\nIn this tutorial we will continue the analysis of the integrated dataset. We will use the scanpy enbedding to perform the clustering using graph community detection algorithms.\nLet’s first load all necessary libraries and also the integrated dataset from the previous step.\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3\nsc.settings.set_figure_params(dpi=80)\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\npath_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr_scanorama.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 5725 × 2727\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'" + "text": "Note\n\n\n\nCode chunks run Python commands unless it starts with %%bash, in which case, those chunks run shell commands.\nIn this tutorial we will continue the analysis of the integrated dataset. We will use the scanpy enbedding to perform the clustering using graph community detection algorithms.\nLet’s first load all necessary libraries and also the integrated dataset from the previous step.\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 3\nsc.settings.set_figure_params(dpi=80)\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\npath_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr_scanorama.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 7222 × 2626\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'doublet_info_colors', 'hvg', 'log1p', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'" }, { "objectID": "labs/scanpy/scanpy_04_clustering.html#meta-clust_graphclust", "href": "labs/scanpy/scanpy_04_clustering.html#meta-clust_graphclust", "title": " Clustering", "section": "1 Graph clustering", - "text": "1 Graph clustering\nThe procedure of clustering on a Graph can be generalized as 3 main steps: 1) Build a kNN graph from the data. 2) Prune spurious connections from kNN graph (optional step). This is a SNN graph. 3) Find groups of cells that maximizes the connections within the group compared other groups.\nIf you recall from the integration, we already constructed a knn graph before running UMAP. Hence we do not need to do it again, and can run the community detection right away.\nThe modularity optimization algoritm in Scanpy are Leiden and Louvain. Lets test both and see how they compare.\n\n1.1 Leiden\n\nsc.tl.leiden(adata, key_added = \"leiden_1.0\") # default resolution in 1.0\nsc.tl.leiden(adata, resolution = 0.6, key_added = \"leiden_0.6\")\nsc.tl.leiden(adata, resolution = 0.4, key_added = \"leiden_0.4\")\nsc.tl.leiden(adata, resolution = 1.4, key_added = \"leiden_1.4\")\n\nrunning Leiden clustering\n finished: found 16 clusters and added\n 'leiden_1.0', the cluster labels (adata.obs, categorical) (0:00:01)\nrunning Leiden clustering\n finished: found 12 clusters and added\n 'leiden_0.6', the cluster labels (adata.obs, categorical) (0:00:01)\nrunning Leiden clustering\n finished: found 10 clusters and added\n 'leiden_0.4', the cluster labels (adata.obs, categorical) (0:00:01)\nrunning Leiden clustering\n finished: found 18 clusters and added\n 'leiden_1.4', the cluster labels (adata.obs, categorical) (0:00:02)\n\n\nPlot the clusters, as you can see, with increased resolution, we get higher granularity in the clustering.\n\nsc.pl.umap(adata, color=['leiden_0.4', 'leiden_0.6', 'leiden_1.0','leiden_1.4'])\n\n\n\n\n\n\n\n\nOnce we have done clustering, the relationships between clusters can be calculated as correlation in PCA space and we also visualize some of the marker genes that we used in the Dim Reduction lab onto the clusters.\n\nsc.tl.dendrogram(adata, groupby = \"leiden_0.6\")\nsc.pl.dendrogram(adata, groupby = \"leiden_0.6\")\n\ngenes = [\"CD3E\", \"CD4\", \"CD8A\", \"GNLY\",\"NKG7\", \"MS4A1\",\"FCGR3A\",\"CD14\",\"LYZ\",\"CST3\",\"MS4A7\",\"FCGR1A\"]\nsc.pl.dotplot(adata, genes, groupby='leiden_0.6', dendrogram=True)\n\n using 'X_pca' with n_pcs = 50\nStoring dendrogram info using `.uns['dendrogram_leiden_0.6']`\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPlot proportion of cells from each condition per cluster.\n\ntmp = pd.crosstab(adata.obs['leiden_0.6'],adata.obs['type'], normalize='index')\ntmp.plot.bar(stacked=True).legend(loc='upper right')\n\n<matplotlib.legend.Legend at 0x7ffff99a2c20>\n\n\n\n\n\n\n\n\n\n\n\n1.2 Louvain\n\nsc.tl.louvain(adata, key_added = \"louvain_1.0\") # default resolution in 1.0\nsc.tl.louvain(adata, resolution = 0.6, key_added = \"louvain_0.6\")\nsc.tl.louvain(adata, resolution = 0.4, key_added = \"louvain_0.4\")\nsc.tl.louvain(adata, resolution = 1.4, key_added = \"louvain_1.4\")\n\nsc.pl.umap(adata, color=['louvain_0.4', 'louvain_0.6', 'louvain_1.0','louvain_1.4'])\n\nsc.tl.dendrogram(adata, groupby = \"louvain_0.6\")\nsc.pl.dendrogram(adata, groupby = \"louvain_0.6\")\n\ngenes = [\"CD3E\", \"CD4\", \"CD8A\", \"GNLY\",\"NKG7\", \"MS4A1\",\"FCGR3A\",\"CD14\",\"LYZ\",\"CST3\",\"MS4A7\",\"FCGR1A\"]\n\nsc.pl.dotplot(adata, genes, groupby='louvain_0.6', dendrogram=True)\n\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 12 clusters and added\n 'louvain_1.0', the cluster labels (adata.obs, categorical) (0:00:00)\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 9 clusters and added\n 'louvain_0.6', the cluster labels (adata.obs, categorical) (0:00:00)\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 7 clusters and added\n 'louvain_0.4', the cluster labels (adata.obs, categorical) (0:00:00)\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 17 clusters and added\n 'louvain_1.4', the cluster labels (adata.obs, categorical) (0:00:00)\n using 'X_pca' with n_pcs = 50\nStoring dendrogram info using `.uns['dendrogram_louvain_0.6']`" + "text": "1 Graph clustering\nThe procedure of clustering on a Graph can be generalized as 3 main steps: 1) Build a kNN graph from the data. 2) Prune spurious connections from kNN graph (optional step). This is a SNN graph. 3) Find groups of cells that maximizes the connections within the group compared other groups.\nIf you recall from the integration, we already constructed a knn graph before running UMAP. Hence we do not need to do it again, and can run the community detection right away.\nThe modularity optimization algoritm in Scanpy are Leiden and Louvain. Lets test both and see how they compare.\n\n1.1 Leiden\n\nsc.tl.leiden(adata, key_added = \"leiden_1.0\") # default resolution in 1.0\nsc.tl.leiden(adata, resolution = 0.6, key_added = \"leiden_0.6\")\nsc.tl.leiden(adata, resolution = 0.4, key_added = \"leiden_0.4\")\nsc.tl.leiden(adata, resolution = 1.4, key_added = \"leiden_1.4\")\n\nrunning Leiden clustering\n finished: found 20 clusters and added\n 'leiden_1.0', the cluster labels (adata.obs, categorical) (0:00:02)\nrunning Leiden clustering\n finished: found 16 clusters and added\n 'leiden_0.6', the cluster labels (adata.obs, categorical) (0:00:01)\nrunning Leiden clustering\n finished: found 13 clusters and added\n 'leiden_0.4', the cluster labels (adata.obs, categorical) (0:00:01)\nrunning Leiden clustering\n finished: found 23 clusters and added\n 'leiden_1.4', the cluster labels (adata.obs, categorical) (0:00:02)\n\n\nPlot the clusters, as you can see, with increased resolution, we get higher granularity in the clustering.\n\nsc.pl.umap(adata, color=['leiden_0.4', 'leiden_0.6', 'leiden_1.0','leiden_1.4'])\n\n\n\n\n\n\n\n\nOnce we have done clustering, the relationships between clusters can be calculated as correlation in PCA space and we also visualize some of the marker genes that we used in the Dim Reduction lab onto the clusters.\n\nsc.tl.dendrogram(adata, groupby = \"leiden_0.6\")\nsc.pl.dendrogram(adata, groupby = \"leiden_0.6\")\n\ngenes = [\"CD3E\", \"CD4\", \"CD8A\", \"GNLY\",\"NKG7\", \"MS4A1\",\"FCGR3A\",\"CD14\",\"LYZ\",\"CST3\",\"MS4A7\",\"FCGR1A\"]\nsc.pl.dotplot(adata, genes, groupby='leiden_0.6', dendrogram=True)\n\n using 'X_pca' with n_pcs = 50\nStoring dendrogram info using `.uns['dendrogram_leiden_0.6']`\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1.2 Louvain\n\nsc.tl.louvain(adata, key_added = \"louvain_1.0\") # default resolution in 1.0\nsc.tl.louvain(adata, resolution = 0.6, key_added = \"louvain_0.6\")\nsc.tl.louvain(adata, resolution = 0.4, key_added = \"louvain_0.4\")\nsc.tl.louvain(adata, resolution = 1.4, key_added = \"louvain_1.4\")\n\nsc.pl.umap(adata, color=['louvain_0.4', 'louvain_0.6', 'louvain_1.0','louvain_1.4'])\n\nsc.tl.dendrogram(adata, groupby = \"louvain_0.6\")\nsc.pl.dendrogram(adata, groupby = \"louvain_0.6\")\n\ngenes = [\"CD3E\", \"CD4\", \"CD8A\", \"GNLY\",\"NKG7\", \"MS4A1\",\"FCGR3A\",\"CD14\",\"LYZ\",\"CST3\",\"MS4A7\",\"FCGR1A\"]\n\nsc.pl.dotplot(adata, genes, groupby='louvain_0.6', dendrogram=True)\n\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 15 clusters and added\n 'louvain_1.0', the cluster labels (adata.obs, categorical) (0:00:00)\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 11 clusters and added\n 'louvain_0.6', the cluster labels (adata.obs, categorical) (0:00:00)\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 8 clusters and added\n 'louvain_0.4', the cluster labels (adata.obs, categorical) (0:00:00)\nrunning Louvain clustering\n using the \"louvain\" package of Traag (2017)\n finished: found 20 clusters and added\n 'louvain_1.4', the cluster labels (adata.obs, categorical) (0:00:00)\n using 'X_pca' with n_pcs = 50\nStoring dendrogram info using `.uns['dendrogram_louvain_0.6']`" }, { "objectID": "labs/scanpy/scanpy_04_clustering.html#meta-clust_kmean", @@ -1110,28 +1110,28 @@ "href": "labs/scanpy/scanpy_04_clustering.html#meta-clust_hier", "title": " Clustering", "section": "3 Hierarchical clustering", - "text": "3 Hierarchical clustering\nHierarchical clustering is another generic form of clustering that can be applied also to scRNA-seq data. As K-means, it is typically applied to a reduced dimension representation of the data. Hierarchical clustering returns an entire hierarchy of partitionings (a dendrogram) that can be cut at different levels. Hierarchical clustering is done in these steps:\n\nDefine the distances between samples. The most common are Euclidean distance (a.k.a. straight line between two points) or correlation coefficients.\nDefine a measure of distances between clusters, called linkage criteria. It can for example be average distances between clusters. Commonly used methods are single, complete, average, median, centroid and ward.\nDefine the dendrogram among all samples using Bottom-up or Top-down approach. Bottom-up is where samples start with their own cluster which end up merged pair-by-pair until only one cluster is left. Top-down is where samples start all in the same cluster that end up being split by 2 until each sample has its own cluster.\n\nAs you might have realized, correlation is not a method implemented in the dist() function. However, we can create our own distances and transform them to a distance object. We can first compute sample correlations using the cor function.\nAs you already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):\n[adj = ]\nOnce we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.\nThe function AgglomerativeClustering has the option of running with disntance metrics “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. However, with ward linkage only euklidean distances works. Here we will try out euclidean distance and ward linkage calculated in PCA space.\n\nfrom sklearn.cluster import AgglomerativeClustering\n\ncluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')\nadata.obs['hclust_5'] = cluster.fit_predict(X_pca).astype(str)\n\ncluster = AgglomerativeClustering(n_clusters=10, affinity='euclidean', linkage='ward')\nadata.obs['hclust_10'] = cluster.fit_predict(X_pca).astype(str)\n\ncluster = AgglomerativeClustering(n_clusters=15, affinity='euclidean', linkage='ward')\nadata.obs['hclust_15'] = cluster.fit_predict(X_pca).astype(str)\n\nsc.pl.umap(adata, color=['hclust_5', 'hclust_10', 'hclust_15'])\n\n\n\n\n\n\n\n\nFinally, lets save the clustered data for further analysis.\n\nadata.write_h5ad('./data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad')\n\n\n\n\n\n\n\nDiscuss\n\n\n\nBy now you should know how to plot different features onto your data. Take the QC metrics that were calculated in the first exercise, that should be stored in your data object, and plot it as violin plots per cluster using the clustering method of your choice. For example, plot number of UMIS, detected genes, percent mitochondrial reads. Then, check carefully if there is any bias in how your data is separated due to quality metrics. Could it be explained biologically, or could you have technical bias there?" + "text": "3 Hierarchical clustering\nHierarchical clustering is another generic form of clustering that can be applied also to scRNA-seq data. As K-means, it is typically applied to a reduced dimension representation of the data. Hierarchical clustering returns an entire hierarchy of partitionings (a dendrogram) that can be cut at different levels. Hierarchical clustering is done in these steps:\n\nDefine the distances between samples. The most common are Euclidean distance (a.k.a. straight line between two points) or correlation coefficients.\nDefine a measure of distances between clusters, called linkage criteria. It can for example be average distances between clusters. Commonly used methods are single, complete, average, median, centroid and ward.\nDefine the dendrogram among all samples using Bottom-up or Top-down approach. Bottom-up is where samples start with their own cluster which end up merged pair-by-pair until only one cluster is left. Top-down is where samples start all in the same cluster that end up being split by 2 until each sample has its own cluster.\n\nAs you might have realized, correlation is not a method implemented in the dist() function. However, we can create our own distances and transform them to a distance object. We can first compute sample correlations using the cor function.\nAs you already know, correlation range from -1 to 1, where 1 indicates that two samples are closest, -1 indicates that two samples are the furthest and 0 is somewhat in between. This, however, creates a problem in defining distances because a distance of 0 indicates that two samples are closest, 1 indicates that two samples are the furthest and distance of -1 is not meaningful. We thus need to transform the correlations to a positive scale (a.k.a. adjacency):\n[adj = ]\nOnce we transformed the correlations to a 0-1 scale, we can simply convert it to a distance object using as.dist function. The transformation does not need to have a maximum of 1, but it is more intuitive to have it at 1, rather than at any other number.\nThe function AgglomerativeClustering has the option of running with disntance metrics “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. However, with ward linkage only euklidean distances works. Here we will try out euclidean distance and ward linkage calculated in PCA space.\n\nfrom sklearn.cluster import AgglomerativeClustering\n\ncluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')\nadata.obs['hclust_5'] = cluster.fit_predict(X_pca).astype(str)\n\ncluster = AgglomerativeClustering(n_clusters=10, affinity='euclidean', linkage='ward')\nadata.obs['hclust_10'] = cluster.fit_predict(X_pca).astype(str)\n\ncluster = AgglomerativeClustering(n_clusters=15, affinity='euclidean', linkage='ward')\nadata.obs['hclust_15'] = cluster.fit_predict(X_pca).astype(str)\n\nsc.pl.umap(adata, color=['hclust_5', 'hclust_10', 'hclust_15'])\n\n\n\n\n\n\n\n\nFinally, lets save the clustered data for further analysis.\n\nadata.write_h5ad('./data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad')" }, { "objectID": "labs/scanpy/scanpy_04_clustering.html#meta-session", "href": "labs/scanpy/scanpy_04_clustering.html#meta-session", "title": " Clustering", - "section": "4 Session info", - "text": "4 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:22" + "section": "5 Session info", + "text": "5 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:28" }, { "objectID": "labs/scanpy/scanpy_05_dge.html", "href": "labs/scanpy/scanpy_05_dge.html", "title": " Differential gene expression", "section": "", - "text": "Note\n\n\n\nCode chunks run Python commands unless it starts with %%bash, in which case, those chunks run shell commands.\nIn this tutorial we will cover about Differetial gene expression, which comprises an extensive range of topics and methods. In single cell, differential expresison can have multiple functionalities such as of identifying marker genes for cell populations, as well as differentially regulated genes across conditions (healthy vs control). We will also exercise on how to account the batch information in your test.\nDifferential expression is performed with the function rank_genes_group. The default method to compute differential expression is the t-test_overestim_var. Other implemented methods are: logreg, t-test and wilcoxon.\nBy default, the .raw attribute of AnnData is used in case it has been initialized, it can be changed by setting use_raw=False.\nThe clustering with resolution 0.6 seems to give a reasonable number of clusters, so we will use that clustering for all DE tests.\nFirst, let’s import libraries and fetch the clustered data from the previous lab.\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport gseapy\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 2\n\nsc.settings.set_figure_params(dpi=80)\nRead in the clustered data object.\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n \n# path_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\npath_file = \"data/covid/results/scanpy_clustered_covid.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 5725 × 2727\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden_1.0', 'leiden_0.6', 'leiden_0.4', 'leiden_1.4', 'louvain_1.0', 'louvain_0.6', 'louvain_0.4', 'louvain_1.4', 'kmeans5', 'kmeans10', 'kmeans15', 'hclust_5', 'hclust_10', 'hclust_15'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'dendrogram_leiden_0.6', 'dendrogram_louvain_0.6', 'doublet_info_colors', 'hclust_10_colors', 'hclust_15_colors', 'hclust_5_colors', 'hvg', 'kmeans10_colors', 'kmeans15_colors', 'kmeans5_colors', 'leiden', 'leiden_0.4_colors', 'leiden_0.6_colors', 'leiden_1.0_colors', 'leiden_1.4_colors', 'log1p', 'louvain', 'louvain_0.4_colors', 'louvain_0.6_colors', 'louvain_1.0_colors', 'louvain_1.4_colors', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'\nprint(adata.X.shape)\nprint(adata.raw.X.shape)\nprint(adata.raw.X[:10,:10])\n\n(5725, 2727)\n(5725, 18830)\n (1, 3) 0.7825693876867097\n (8, 6) 1.1311041336746985\nAs you can see, the X matrix only contains the variable genes, while the raw matrix contains all genes.\nPrinting a few of the values in adata.raw.X shows that the raw matrix is not normalized.\nFor DGE analysis we would like to run with all genes, but on normalized values, so we will have to revert back to the raw matrix and renormalize.\nadata = adata.raw.to_adata()\nsc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)\nsc.pp.log1p(adata)\n\nnormalizing by total count per cell\n finished (0:00:00): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs)\nWARNING: adata.X seems to be already log-transformed.\nNow lets look at the clustering of the object we loaded in the umap. We will use louvain_0.6 clustering in this exercise.\nsc.pl.umap(adata, color='louvain_0.6')" + "text": "Note\n\n\n\nCode chunks run Python commands unless it starts with %%bash, in which case, those chunks run shell commands.\nIn this tutorial we will cover about Differetial gene expression, which comprises an extensive range of topics and methods. In single cell, differential expresison can have multiple functionalities such as of identifying marker genes for cell populations, as well as differentially regulated genes across conditions (healthy vs control). We will also exercise on how to account the batch information in your test.\nDifferential expression is performed with the function rank_genes_group. The default method to compute differential expression is the t-test_overestim_var. Other implemented methods are: logreg, t-test and wilcoxon.\nBy default, the .raw attribute of AnnData is used in case it has been initialized, it can be changed by setting use_raw=False.\nThe clustering with resolution 0.6 seems to give a reasonable number of clusters, so we will use that clustering for all DE tests.\nFirst, let’s import libraries and fetch the clustered data from the previous lab.\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport gseapy\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 2\n\nsc.settings.set_figure_params(dpi=80)\nRead in the clustered data object.\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n \n# path_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\npath_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 7222 × 2626\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden_1.0', 'leiden_0.6', 'leiden_0.4', 'leiden_1.4', 'louvain_1.0', 'louvain_0.6', 'louvain_0.4', 'louvain_1.4', 'kmeans5', 'kmeans10', 'kmeans15', 'hclust_5', 'hclust_10', 'hclust_15'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'dendrogram_leiden_0.6', 'dendrogram_louvain_0.6', 'doublet_info_colors', 'hclust_10_colors', 'hclust_15_colors', 'hclust_5_colors', 'hvg', 'kmeans10_colors', 'kmeans15_colors', 'kmeans5_colors', 'leiden', 'leiden_0.4_colors', 'leiden_0.6_colors', 'leiden_1.0_colors', 'leiden_1.4_colors', 'log1p', 'louvain', 'louvain_0.4_colors', 'louvain_0.6_colors', 'louvain_1.0_colors', 'louvain_1.4_colors', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'\nprint(adata.X.shape)\nprint(adata.raw.X.shape)\nprint(adata.raw.X[:10,:10])\n\n(7222, 2626)\n(7222, 19468)\n (1, 4) 0.7825693876867097\n (8, 7) 1.1311041336746985\nAs you can see, the X matrix only contains the variable genes, while the raw matrix contains all genes.\nPrinting a few of the values in adata.raw.X shows that the raw matrix is normalized.\nFor DGE analysis we would like to run with all genes, on normalized values, so we will have to revert back to the raw matrix. In case you have raw counts in the matrix you also have to renormalize and logtransform.\nadata = adata.raw.to_adata()\nNow lets look at the clustering of the object we loaded in the umap. We will use louvain_0.6 clustering in this exercise.\nsc.pl.umap(adata, color='louvain_0.6')" }, { "objectID": "labs/scanpy/scanpy_05_dge.html#t-test", "href": "labs/scanpy/scanpy_05_dge.html#t-test", "title": " Differential gene expression", "section": "1 T-test", - "text": "1 T-test\n\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='t-test', key_added = \"t-test\")\nsc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key = \"t-test\")\n\n# results are stored in the adata.uns[\"t-test\"] slot\nadata\n\nranking genes\n finished (0:00:02)\n\n\n\n\n\n\n\n\n\nAnnData object with n_obs × n_vars = 5725 × 18830\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden_1.0', 'leiden_0.6', 'leiden_0.4', 'leiden_1.4', 'louvain_1.0', 'louvain_0.6', 'louvain_0.4', 'louvain_1.4', 'kmeans5', 'kmeans10', 'kmeans15', 'hclust_5', 'hclust_10', 'hclust_15'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'\n uns: 'dendrogram_leiden_0.6', 'dendrogram_louvain_0.6', 'doublet_info_colors', 'hclust_10_colors', 'hclust_15_colors', 'hclust_5_colors', 'hvg', 'kmeans10_colors', 'kmeans15_colors', 'kmeans5_colors', 'leiden', 'leiden_0.4_colors', 'leiden_0.6_colors', 'leiden_1.0_colors', 'leiden_1.4_colors', 'log1p', 'louvain', 'louvain_0.4_colors', 'louvain_0.6_colors', 'louvain_1.0_colors', 'louvain_1.4_colors', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap', 't-test'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances'" + "text": "1 T-test\n\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='t-test', key_added = \"t-test\")\nsc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key = \"t-test\")\n\n# results are stored in the adata.uns[\"t-test\"] slot\nadata\n\nranking genes\n finished (0:00:02)\n\n\n\n\n\n\n\n\n\nAnnData object with n_obs × n_vars = 7222 × 19468\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden_1.0', 'leiden_0.6', 'leiden_0.4', 'leiden_1.4', 'louvain_1.0', 'louvain_0.6', 'louvain_0.4', 'louvain_1.4', 'kmeans5', 'kmeans10', 'kmeans15', 'hclust_5', 'hclust_10', 'hclust_15'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'\n uns: 'dendrogram_leiden_0.6', 'dendrogram_louvain_0.6', 'doublet_info_colors', 'hclust_10_colors', 'hclust_15_colors', 'hclust_5_colors', 'hvg', 'kmeans10_colors', 'kmeans15_colors', 'kmeans5_colors', 'leiden', 'leiden_0.4_colors', 'leiden_0.6_colors', 'leiden_1.0_colors', 'leiden_1.4_colors', 'log1p', 'louvain', 'louvain_0.4_colors', 'louvain_0.6_colors', 'louvain_1.0_colors', 'louvain_1.4_colors', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap', 't-test'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n obsp: 'connectivities', 'distances'" }, { "objectID": "labs/scanpy/scanpy_05_dge.html#t-test-overestimated_variance", @@ -1145,21 +1145,21 @@ "href": "labs/scanpy/scanpy_05_dge.html#wilcoxon-rank-sum", "title": " Differential gene expression", "section": "3 Wilcoxon rank-sum", - "text": "3 Wilcoxon rank-sum\nThe result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar. We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy.\n\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:07)" + "text": "3 Wilcoxon rank-sum\nThe result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar. We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy.\n\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:09)" }, { "objectID": "labs/scanpy/scanpy_05_dge.html#logistic-regression-test", "href": "labs/scanpy/scanpy_05_dge.html#logistic-regression-test", "title": " Differential gene expression", "section": "4 Logistic regression test", - "text": "4 Logistic regression test\nAs an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details.\n\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='logreg',key_added = \"logreg\")\nsc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key = \"logreg\")\n\nranking genes\n finished (0:00:17)" + "text": "4 Logistic regression test\nAs an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details.\n\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='logreg',key_added = \"logreg\")\nsc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key = \"logreg\")\n\nranking genes\n finished (0:00:20)" }, { "objectID": "labs/scanpy/scanpy_05_dge.html#compare-genes", "href": "labs/scanpy/scanpy_05_dge.html#compare-genes", "title": " Differential gene expression", "section": "5 Compare genes", - "text": "5 Compare genes\nTake all significant DE genes for cluster0 with each test and compare the overlap.\n\n#compare cluster1 genes, only stores top 100 by default\n\nwc = sc.get.rank_genes_groups_df(adata, group='0', key='wilcoxon', pval_cutoff=0.01, log2fc_min=0)['names']\ntt = sc.get.rank_genes_groups_df(adata, group='0', key='t-test', pval_cutoff=0.01, log2fc_min=0)['names']\ntt_ov = sc.get.rank_genes_groups_df(adata, group='0', key='t-test_ov', pval_cutoff=0.01, log2fc_min=0)['names']\n\nfrom matplotlib_venn import venn3\n\nvenn3([set(wc),set(tt),set(tt_ov)], ('Wilcox','T-test','T-test_ov') )\nplt.show()\n\n\n\n\n\n\n\n\nAs you can see, the Wilcoxon test and the T-test with overestimated variance gives very similar result. Also the regular T-test has good overlap, while the Logistic regression gives quite different genes." + "text": "5 Compare genes\nTake all significant DE genes for cluster0 with each test and compare the overlap.\n\n#compare cluster1 genes, only stores top 100 by default\n\nwc = sc.get.rank_genes_groups_df(adata, group='0', key='wilcoxon', pval_cutoff=0.01, log2fc_min=0)['names']\ntt = sc.get.rank_genes_groups_df(adata, group='0', key='t-test', pval_cutoff=0.01, log2fc_min=0)['names']\ntt_ov = sc.get.rank_genes_groups_df(adata, group='0', key='t-test_ov', pval_cutoff=0.01, log2fc_min=0)['names']\n\nfrom matplotlib_venn import venn3\n\nvenn3([set(wc),set(tt),set(tt_ov)], ('Wilcox','T-test','T-test_ov') )\nplt.show()\n\n\n\n\n\n\n\n\nAs you can see, the Wilcoxon test and the T-test with overestimated variance gives very similar result. Also the regular T-test has good overlap." }, { "objectID": "labs/scanpy/scanpy_05_dge.html#visualization", @@ -1180,56 +1180,56 @@ "href": "labs/scanpy/scanpy_05_dge.html#meta-dge_cond", "title": " Differential gene expression", "section": "8 DGE across conditions", - "text": "8 DGE across conditions\nThe second way of computing differential expression is to answer which genes are differentially expressed within a cluster. For example, in our case we have libraries comming from patients and controls and we would like to know which genes are influenced the most in a particular cell type. For this end, we will first subset our data for the desired cell cluster, then change the cell identities to the variable of comparison (which now in our case is the “type”, e.g. Covid/Ctrl).\n\ncl1 = adata[adata.obs['louvain_0.6'] == '4',:]\ncl1.obs['type'].value_counts()\n\nsc.tl.rank_genes_groups(cl1, 'type', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(cl1, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\n\nsc.pl.rank_genes_groups_violin(cl1, n_genes=10, key=\"wilcoxon\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can also plot these genes across all clusters, but split by “type”, to check if the genes are also up/downregulated in other celltypes.\n\nimport seaborn as sns\n\ngenes1 = sc.get.rank_genes_groups_df(cl1, group='Covid', key='wilcoxon')['names'][:5]\ngenes2 = sc.get.rank_genes_groups_df(cl1, group='Ctrl', key='wilcoxon')['names'][:5]\ngenes = genes1.tolist() + genes2.tolist() \ndf = sc.get.obs_df(adata, genes + ['louvain_0.6','type'], use_raw=False)\ndf2 = df.melt(id_vars=[\"louvain_0.6\",'type'], value_vars=genes)\n\nsns.catplot(x = \"louvain_0.6\", y = \"value\", hue = \"type\", kind = 'violin', col = \"variable\", data = df2, col_wrap=4, inner=None)\n\n\n\n\n\n\n\n\nAs you can see, we have many sex chromosome related genes among the top DE genes. And if you remember from the QC lab, we have inbalanced sex distribution among our subjects, so this may not be related to covid at all.\n\n8.1 Remove sex chromosome genes\nTo remove some of the bias due to inbalanced sex in the subjects we can remove the sex chromosome related genes.\n\nannot = sc.queries.biomart_annotations(\n \"hsapiens\",\n [\"ensembl_gene_id\", \"external_gene_name\", \"start_position\", \"end_position\", \"chromosome_name\"],\n ).set_index(\"external_gene_name\")\n\nchrY_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == \"Y\"])\nchrX_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == \"X\"])\n\nsex_genes = chrY_genes.union(chrX_genes)\nprint(len(sex_genes))\nall_genes = cl1.var.index.tolist()\nprint(len(all_genes))\n\nkeep_genes = [x for x in all_genes if x not in sex_genes]\nprint(len(keep_genes))\n\ncl1 = cl1[:,keep_genes]\n\n536\n18830\n18294\n\n\nRerun differential expression.\n\nsc.tl.rank_genes_groups(cl1, 'type', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(cl1, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\n\n\n8.2 Patient batch effects\nWhen we are testing for Covid vs Control we are running a DGE test for 3 vs 3 individuals. That will be very sensitive to sample differences unless we find a way to control for it. So first, lets check how the top DGEs are expressed across the individuals:\n\ngenes1 = sc.get.rank_genes_groups_df(cl1, group='Covid', key='wilcoxon')['names'][:5]\ngenes2 = sc.get.rank_genes_groups_df(cl1, group='Ctrl', key='wilcoxon')['names'][:5]\ngenes = genes1.tolist() + genes2.tolist() \n\nsc.pl.violin(cl1, genes1, groupby='sample')\nsc.pl.violin(cl1, genes2, groupby='sample')\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAs you can see, many of the genes detected as DGE in Covid are unique to one or 2 patients.\nWe can examine more genes with a DotPlot:\nWe can also plot the top Covid and top Ctrl genes as a dotplot:\n\ngenes1 = sc.get.rank_genes_groups_df(cl1, group='Covid', key='wilcoxon')['names'][:20]\ngenes2 = sc.get.rank_genes_groups_df(cl1, group='Ctrl', key='wilcoxon')['names'][:20]\ngenes = genes1.tolist() + genes2.tolist() \n\nsc.pl.dotplot(cl1,genes, groupby='sample')\n\n\n\n\n\n\n\n\nClearly many of the top Covid genes are only high in the covid_17 sample, and not a general feature of covid patients.\nThis is also the patient with the highest number of cells in this cluster:\n\ncl1.obs['sample'].value_counts()\n\nsample\ncovid_17 129\nctrl_5 115\ncovid_1 110\nctrl_13 64\nctrl_14 63\ncovid_15 35\nName: count, dtype: int64\n\n\n\n\n8.3 Subsample\nSo one obvious thing to consider is an equal amount of cells per individual so that the DGE results are not dominated by a single sample.\nSo we will downsample to an equal number of cells per sample.\n\ncl1.obs['sample'].value_counts()\n\nsample\ncovid_17 129\nctrl_5 115\ncovid_1 110\nctrl_13 64\nctrl_14 63\ncovid_15 35\nName: count, dtype: int64\n\n\n\ntarget_cells = 50\n\ntmp = [cl1[cl1.obs['sample'] == s] for s in cl1.obs['sample'].cat.categories]\n\nfor dat in tmp:\n if dat.n_obs > target_cells:\n sc.pp.subsample(dat, n_obs=target_cells)\n\ncl1_sub = tmp[0].concatenate(*tmp[1:])\n\ncl1_sub.obs['sample'].value_counts()\n\nsample\ncovid_1 50\ncovid_17 50\nctrl_5 50\nctrl_13 50\nctrl_14 50\ncovid_15 35\nName: count, dtype: int64\n\n\n\nsc.tl.rank_genes_groups(cl1_sub, 'type', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(cl1_sub, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\n\ngenes1 = sc.get.rank_genes_groups_df(cl1_sub, group='Covid', key='wilcoxon')['names'][:20]\ngenes2 = sc.get.rank_genes_groups_df(cl1_sub, group='Ctrl', key='wilcoxon')['names'][:20]\ngenes = genes1.tolist() + genes2.tolist() \n\nsc.pl.dotplot(cl1,genes, groupby='sample')\n\n\n\n\n\n\n\n\nIt looks much better now. But if we look per patient you can see that we still have some genes that are dominated by a single patient. Still, it is often a good idea to control the number of cells from each sample when doing differential expression.\nWhy do you think this is?\nThere are many different ways to try and resolve the issue of patient batch effects, however most of them require R packages. These can be run via rpy2 as is demonstraded in this compendium: https://www.sc-best-practices.org/conditions/differential_gene_expression.html\nHowever, we have not included it here as of now. So please have a look at the patient batch effect section in the seurat DGE tutorial where we run EdgeR on pseudobulk and MAST with random effect." + "text": "8 DGE across conditions\nThe second way of computing differential expression is to answer which genes are differentially expressed within a cluster. For example, in our case we have libraries comming from patients and controls and we would like to know which genes are influenced the most in a particular cell type. For this end, we will first subset our data for the desired cell cluster, then change the cell identities to the variable of comparison (which now in our case is the “type”, e.g. Covid/Ctrl).\n\ncl1 = adata[adata.obs['louvain_0.6'] == '4',:]\ncl1.obs['type'].value_counts()\n\nsc.tl.rank_genes_groups(cl1, 'type', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(cl1, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\n\nsc.pl.rank_genes_groups_violin(cl1, n_genes=10, key=\"wilcoxon\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can also plot these genes across all clusters, but split by “type”, to check if the genes are also up/downregulated in other celltypes.\n\nimport seaborn as sns\n\ngenes1 = sc.get.rank_genes_groups_df(cl1, group='Covid', key='wilcoxon')['names'][:5]\ngenes2 = sc.get.rank_genes_groups_df(cl1, group='Ctrl', key='wilcoxon')['names'][:5]\ngenes = genes1.tolist() + genes2.tolist() \ndf = sc.get.obs_df(adata, genes + ['louvain_0.6','type'], use_raw=False)\ndf2 = df.melt(id_vars=[\"louvain_0.6\",'type'], value_vars=genes)\n\nsns.catplot(x = \"louvain_0.6\", y = \"value\", hue = \"type\", kind = 'violin', col = \"variable\", data = df2, col_wrap=4, inner=None)\n\n\n\n\n\n\n\n\nAs you can see, we have many sex chromosome related genes among the top DE genes. And if you remember from the QC lab, we have inbalanced sex distribution among our subjects, so this may not be related to covid at all.\n\n8.1 Remove sex chromosome genes\nTo remove some of the bias due to inbalanced sex in the subjects we can remove the sex chromosome related genes.\n\nannot = sc.queries.biomart_annotations(\n \"hsapiens\",\n [\"ensembl_gene_id\", \"external_gene_name\", \"start_position\", \"end_position\", \"chromosome_name\"],\n ).set_index(\"external_gene_name\")\n\nchrY_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == \"Y\"])\nchrX_genes = adata.var_names.intersection(annot.index[annot.chromosome_name == \"X\"])\n\nsex_genes = chrY_genes.union(chrX_genes)\nprint(len(sex_genes))\nall_genes = cl1.var.index.tolist()\nprint(len(all_genes))\n\nkeep_genes = [x for x in all_genes if x not in sex_genes]\nprint(len(keep_genes))\n\ncl1 = cl1[:,keep_genes]\n\n551\n19468\n18917\n\n\nRerun differential expression.\n\nsc.tl.rank_genes_groups(cl1, 'type', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(cl1, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\n\n\n8.2 Patient batch effects\nWhen we are testing for Covid vs Control we are running a DGE test for 3 vs 3 individuals. That will be very sensitive to sample differences unless we find a way to control for it. So first, lets check how the top DGEs are expressed across the individuals:\n\ngenes1 = sc.get.rank_genes_groups_df(cl1, group='Covid', key='wilcoxon')['names'][:5]\ngenes2 = sc.get.rank_genes_groups_df(cl1, group='Ctrl', key='wilcoxon')['names'][:5]\ngenes = genes1.tolist() + genes2.tolist() \n\nsc.pl.violin(cl1, genes1, groupby='sample')\nsc.pl.violin(cl1, genes2, groupby='sample')\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAs you can see, many of the genes detected as DGE in Covid are unique to one or 2 patients.\nWe can also plot the top Covid and top Ctrl genes as a dotplot:\n\ngenes1 = sc.get.rank_genes_groups_df(cl1, group='Covid', key='wilcoxon')['names'][:20]\ngenes2 = sc.get.rank_genes_groups_df(cl1, group='Ctrl', key='wilcoxon')['names'][:20]\ngenes = genes1.tolist() + genes2.tolist() \n\nsc.pl.dotplot(cl1,genes, groupby='sample')\n\n\n\n\n\n\n\n\nClearly many of the top Covid genes are only high in the covid_17 sample, and not a general feature of covid patients.\nThis is also the patient with the highest number of cells in this cluster:\n\ncl1.obs['sample'].value_counts()\n\nsample\ncovid_17 130\nctrl_5 114\ncovid_1 109\nctrl_13 65\nctrl_14 62\nctrl_19 57\ncovid_16 38\ncovid_15 37\nName: count, dtype: int64\n\n\n\n\n8.3 Subsample\nSo one obvious thing to consider is an equal amount of cells per individual so that the DGE results are not dominated by a single sample.\nSo we will downsample to an equal number of cells per sample, in this case 34 cells per sample as it is the lowest number among all samples\n\ntarget_cells = 37\n\ntmp = [cl1[cl1.obs['sample'] == s] for s in cl1.obs['sample'].cat.categories]\n\nfor dat in tmp:\n if dat.n_obs > target_cells:\n sc.pp.subsample(dat, n_obs=target_cells)\n\ncl1_sub = tmp[0].concatenate(*tmp[1:])\n\ncl1_sub.obs['sample'].value_counts()\n\nsample\ncovid_1 37\ncovid_15 37\ncovid_16 37\ncovid_17 37\nctrl_5 37\nctrl_13 37\nctrl_14 37\nctrl_19 37\nName: count, dtype: int64\n\n\n\nsc.tl.rank_genes_groups(cl1_sub, 'type', method='wilcoxon', key_added = \"wilcoxon\")\nsc.pl.rank_genes_groups(cl1_sub, n_genes=25, sharey=False, key=\"wilcoxon\")\n\nranking genes\n finished (0:00:00)\n\n\n\n\n\n\n\n\n\n\ngenes1 = sc.get.rank_genes_groups_df(cl1_sub, group='Covid', key='wilcoxon')['names'][:20]\ngenes2 = sc.get.rank_genes_groups_df(cl1_sub, group='Ctrl', key='wilcoxon')['names'][:20]\ngenes = genes1.tolist() + genes2.tolist() \n\nsc.pl.dotplot(cl1,genes, groupby='sample')\n\n\n\n\n\n\n\n\nIt looks much better now. But if we look per patient you can see that we still have some genes that are dominated by a single patient. Still, it is often a good idea to control the number of cells from each sample when doing differential expression.\nThere are many different ways to try and resolve the issue of patient batch effects, however most of them require R packages. These can be run via rpy2 as is demonstraded in this compendium: https://www.sc-best-practices.org/conditions/differential_gene_expression.html\nHowever, we have not included it here as of now. So please have a look at the patient batch effect section in the seurat DGE tutorial where we run EdgeR on pseudobulk and MAST with random effect." }, { "objectID": "labs/scanpy/scanpy_05_dge.html#meta-dge_gsa", "href": "labs/scanpy/scanpy_05_dge.html#meta-dge_gsa", "title": " Differential gene expression", "section": "9 Gene Set Analysis (GSA)", - "text": "9 Gene Set Analysis (GSA)\n\n9.1 Hypergeometric enrichment test\nHaving a defined list of differentially expressed genes, you can now look for their combined function using hypergeometric test.\n\n#Available databases : ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ \ngene_set_names = gseapy.get_library_name(organism='Human')\nprint(gene_set_names)\n\n['ARCHS4_Cell-lines', 'ARCHS4_IDG_Coexp', 'ARCHS4_Kinases_Coexp', 'ARCHS4_TFs_Coexp', 'ARCHS4_Tissues', 'Achilles_fitness_decrease', 'Achilles_fitness_increase', 'Aging_Perturbations_from_GEO_down', 'Aging_Perturbations_from_GEO_up', 'Allen_Brain_Atlas_10x_scRNA_2021', 'Allen_Brain_Atlas_down', 'Allen_Brain_Atlas_up', 'Azimuth_2023', 'Azimuth_Cell_Types_2021', 'BioCarta_2013', 'BioCarta_2015', 'BioCarta_2016', 'BioPlanet_2019', 'BioPlex_2017', 'CCLE_Proteomics_2020', 'CORUM', 'COVID-19_Related_Gene_Sets', 'COVID-19_Related_Gene_Sets_2021', 'Cancer_Cell_Line_Encyclopedia', 'CellMarker_Augmented_2021', 'ChEA_2013', 'ChEA_2015', 'ChEA_2016', 'ChEA_2022', 'Chromosome_Location', 'Chromosome_Location_hg19', 'ClinVar_2019', 'DSigDB', 'Data_Acquisition_Method_Most_Popular_Genes', 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019', 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019', 'Descartes_Cell_Types_and_Tissue_2021', 'Diabetes_Perturbations_GEO_2022', 'DisGeNET', 'Disease_Perturbations_from_GEO_down', 'Disease_Perturbations_from_GEO_up', 'Disease_Signatures_from_GEO_down_2014', 'Disease_Signatures_from_GEO_up_2014', 'DrugMatrix', 'Drug_Perturbations_from_GEO_2014', 'Drug_Perturbations_from_GEO_down', 'Drug_Perturbations_from_GEO_up', 'ENCODE_Histone_Modifications_2013', 'ENCODE_Histone_Modifications_2015', 'ENCODE_TF_ChIP-seq_2014', 'ENCODE_TF_ChIP-seq_2015', 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X', 'ESCAPE', 'Elsevier_Pathway_Collection', 'Enrichr_Libraries_Most_Popular_Genes', 'Enrichr_Submissions_TF-Gene_Coocurrence', 'Enrichr_Users_Contributed_Lists_2020', 'Epigenomics_Roadmap_HM_ChIP-seq', 'FANTOM6_lncRNA_KD_DEGs', 'GO_Biological_Process_2013', 'GO_Biological_Process_2015', 'GO_Biological_Process_2017', 'GO_Biological_Process_2017b', 'GO_Biological_Process_2018', 'GO_Biological_Process_2021', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2013', 'GO_Cellular_Component_2015', 'GO_Cellular_Component_2017', 'GO_Cellular_Component_2017b', 'GO_Cellular_Component_2018', 'GO_Cellular_Component_2021', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2013', 'GO_Molecular_Function_2015', 'GO_Molecular_Function_2017', 'GO_Molecular_Function_2017b', 'GO_Molecular_Function_2018', 'GO_Molecular_Function_2021', 'GO_Molecular_Function_2023', 'GTEx_Aging_Signatures_2021', 'GTEx_Tissue_Expression_Down', 'GTEx_Tissue_Expression_Up', 'GTEx_Tissues_V8_2023', 'GWAS_Catalog_2019', 'GWAS_Catalog_2023', 'GeDiPNet_2023', 'GeneSigDB', 'Gene_Perturbations_from_GEO_down', 'Gene_Perturbations_from_GEO_up', 'Genes_Associated_with_NIH_Grants', 'Genome_Browser_PWMs', 'GlyGen_Glycosylated_Proteins_2022', 'HDSigDB_Human_2021', 'HDSigDB_Mouse_2021', 'HMDB_Metabolites', 'HMS_LINCS_KinomeScan', 'HomoloGene', 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression', 'HuBMAP_ASCTplusB_augmented_2022', 'HumanCyc_2015', 'HumanCyc_2016', 'Human_Gene_Atlas', 'Human_Phenotype_Ontology', 'IDG_Drug_Targets_2022', 'InterPro_Domains_2019', 'Jensen_COMPARTMENTS', 'Jensen_DISEASES', 'Jensen_TISSUES', 'KEA_2013', 'KEA_2015', 'KEGG_2013', 'KEGG_2015', 'KEGG_2016', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'KEGG_2021_Human', 'KOMP2_Mouse_Phenotypes_2022', 'Kinase_Perturbations_from_GEO_down', 'Kinase_Perturbations_from_GEO_up', 'L1000_Kinase_and_GPCR_Perturbations_down', 'L1000_Kinase_and_GPCR_Perturbations_up', 'LINCS_L1000_CRISPR_KO_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_down', 'LINCS_L1000_Chem_Pert_up', 'LINCS_L1000_Ligand_Perturbations_down', 'LINCS_L1000_Ligand_Perturbations_up', 'Ligand_Perturbations_from_GEO_down', 'Ligand_Perturbations_from_GEO_up', 'MAGMA_Drugs_and_Diseases', 'MAGNET_2023', 'MCF7_Perturbations_from_GEO_down', 'MCF7_Perturbations_from_GEO_up', 'MGI_Mammalian_Phenotype_2013', 'MGI_Mammalian_Phenotype_2017', 'MGI_Mammalian_Phenotype_Level_3', 'MGI_Mammalian_Phenotype_Level_4', 'MGI_Mammalian_Phenotype_Level_4_2019', 'MGI_Mammalian_Phenotype_Level_4_2021', 'MSigDB_Computational', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Metabolomics_Workbench_Metabolites_2022', 'Microbe_Perturbations_from_GEO_down', 'Microbe_Perturbations_from_GEO_up', 'MoTrPAC_2023', 'Mouse_Gene_Atlas', 'NCI-60_Cancer_Cell_Lines', 'NCI-Nature_2015', 'NCI-Nature_2016', 'NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_Human_AutoRIF', 'NIH_Funded_PIs_2017_Human_GeneRIF', 'NURSA_Human_Endogenous_Complexome', 'OMIM_Disease', 'OMIM_Expanded', 'Old_CMAP_down', 'Old_CMAP_up', 'Orphanet_Augmented_2021', 'PFOCR_Pathways', 'PFOCR_Pathways_2023', 'PPI_Hub_Proteins', 'PanglaoDB_Augmented_2021', 'Panther_2015', 'Panther_2016', 'Pfam_Domains_2019', 'Pfam_InterPro_Domains', 'PheWeb_2019', 'PhenGenI_Association_2021', 'Phosphatase_Substrates_from_DEPOD', 'ProteomicsDB_2020', 'Proteomics_Drug_Atlas_2023', 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO', 'RNAseq_Automatic_GEO_Signatures_Human_Down', 'RNAseq_Automatic_GEO_Signatures_Human_Up', 'RNAseq_Automatic_GEO_Signatures_Mouse_Down', 'RNAseq_Automatic_GEO_Signatures_Mouse_Up', 'Rare_Diseases_AutoRIF_ARCHS4_Predictions', 'Rare_Diseases_AutoRIF_Gene_Lists', 'Rare_Diseases_GeneRIF_ARCHS4_Predictions', 'Rare_Diseases_GeneRIF_Gene_Lists', 'Reactome_2013', 'Reactome_2015', 'Reactome_2016', 'Reactome_2022', 'Rummagene_kinases', 'Rummagene_signatures', 'Rummagene_transcription_factors', 'SILAC_Phosphoproteomics', 'SubCell_BarCode', 'SynGO_2022', 'SysMyo_Muscle_Gene_Sets', 'TF-LOF_Expression_from_GEO', 'TF_Perturbations_Followed_by_Expression', 'TG_GATES_2020', 'TRANSFAC_and_JASPAR_PWMs', 'TRRUST_Transcription_Factors_2019', 'Table_Mining_of_CRISPR_Studies', 'Tabula_Muris', 'Tabula_Sapiens', 'TargetScan_microRNA', 'TargetScan_microRNA_2017', 'The_Kinase_Library_2023', 'Tissue_Protein_Expression_from_Human_Proteome_Map', 'Tissue_Protein_Expression_from_ProteomicsDB', 'Transcription_Factor_PPIs', 'UK_Biobank_GWAS_v1', 'Virus-Host_PPI_P-HIPSTer_2020', 'VirusMINT', 'Virus_Perturbations_from_GEO_down', 'Virus_Perturbations_from_GEO_up', 'WikiPathway_2021_Human', 'WikiPathway_2023_Human', 'WikiPathways_2013', 'WikiPathways_2015', 'WikiPathways_2016', 'WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'dbGaP', 'huMAP', 'lncHUB_lncRNA_Co-Expression', 'miRTarBase_2017']\n\n\nGet the significant DEGs for the Covid patients.\n\n#?gseapy.enrichr\nglist = sc.get.rank_genes_groups_df(cl1_sub, group='Covid', key='wilcoxon', log2fc_min=0.25, pval_cutoff=0.05)['names'].squeeze().str.strip().tolist()\nprint(len(glist))\n\n18\n\n\n\nenr_res = gseapy.enrichr(gene_list=glist, organism='Human', gene_sets='GO_Biological_Process_2018', cutoff = 0.5)\nenr_res.results.head()\n\n\n\n\n\n\n\n\nGene_set\nTerm\nOverlap\nP-value\nAdjusted P-value\nOld P-value\nOld Adjusted P-value\nOdds Ratio\nCombined Score\nGenes\n\n\n\n\n0\nGO_Biological_Process_2018\ncellular response to type I interferon (GO:007...\n5/65\n2.569729e-09\n3.186464e-07\n0\n0\n127.705128\n2525.939157\nISG20;IFITM1;IFITM2;ISG15;XAF1\n\n\n1\nGO_Biological_Process_2018\ntype I interferon signaling pathway (GO:0060337)\n5/65\n2.569729e-09\n3.186464e-07\n0\n0\n127.705128\n2525.939157\nISG20;IFITM1;IFITM2;ISG15;XAF1\n\n\n2\nGO_Biological_Process_2018\ncytokine-mediated signaling pathway (GO:0019221)\n8/633\n3.184846e-08\n2.632806e-06\n0\n0\n24.776960\n427.706745\nISG20;NFKBIA;IFITM1;IFITM2;ISG15;VIM;XAF1;SOD1\n\n\n3\nGO_Biological_Process_2018\nnegative regulation of viral genome replicatio...\n4/50\n1.030352e-07\n6.388183e-06\n0\n0\n123.826087\n1992.138251\nISG20;IFITM1;IFITM2;ISG15\n\n\n4\nGO_Biological_Process_2018\nnegative regulation of viral life cycle (GO:19...\n4/61\n2.320426e-07\n1.093538e-05\n0\n0\n99.874687\n1525.720152\nISG20;IFITM1;IFITM2;ISG15\n\n\n\n\n\n\n\nSome databases of interest:\nGO_Biological_Process_2017bKEGG_2019_HumanKEGG_2019_MouseWikiPathways_2019_HumanWikiPathways_2019_Mouse\nYou visualize your results using a simple barplot, for example:\n\ngseapy.barplot(enr_res.res2d,title='GO_Biological_Process_2018')\n\n<Axes: title={'center': 'GO_Biological_Process_2018'}, xlabel='$- \\\\log_{10}$ (Adjusted P-value)'>" + "text": "9 Gene Set Analysis (GSA)\n\n9.1 Hypergeometric enrichment test\nHaving a defined list of differentially expressed genes, you can now look for their combined function using hypergeometric test.\n\n#Available databases : ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ \ngene_set_names = gseapy.get_library_name(organism='Human')\nprint(gene_set_names)\n\n['ARCHS4_Cell-lines', 'ARCHS4_IDG_Coexp', 'ARCHS4_Kinases_Coexp', 'ARCHS4_TFs_Coexp', 'ARCHS4_Tissues', 'Achilles_fitness_decrease', 'Achilles_fitness_increase', 'Aging_Perturbations_from_GEO_down', 'Aging_Perturbations_from_GEO_up', 'Allen_Brain_Atlas_10x_scRNA_2021', 'Allen_Brain_Atlas_down', 'Allen_Brain_Atlas_up', 'Azimuth_2023', 'Azimuth_Cell_Types_2021', 'BioCarta_2013', 'BioCarta_2015', 'BioCarta_2016', 'BioPlanet_2019', 'BioPlex_2017', 'CCLE_Proteomics_2020', 'CORUM', 'COVID-19_Related_Gene_Sets', 'COVID-19_Related_Gene_Sets_2021', 'Cancer_Cell_Line_Encyclopedia', 'CellMarker_Augmented_2021', 'ChEA_2013', 'ChEA_2015', 'ChEA_2016', 'ChEA_2022', 'Chromosome_Location', 'Chromosome_Location_hg19', 'ClinVar_2019', 'DSigDB', 'Data_Acquisition_Method_Most_Popular_Genes', 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019', 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019', 'Descartes_Cell_Types_and_Tissue_2021', 'Diabetes_Perturbations_GEO_2022', 'DisGeNET', 'Disease_Perturbations_from_GEO_down', 'Disease_Perturbations_from_GEO_up', 'Disease_Signatures_from_GEO_down_2014', 'Disease_Signatures_from_GEO_up_2014', 'DrugMatrix', 'Drug_Perturbations_from_GEO_2014', 'Drug_Perturbations_from_GEO_down', 'Drug_Perturbations_from_GEO_up', 'ENCODE_Histone_Modifications_2013', 'ENCODE_Histone_Modifications_2015', 'ENCODE_TF_ChIP-seq_2014', 'ENCODE_TF_ChIP-seq_2015', 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X', 'ESCAPE', 'Elsevier_Pathway_Collection', 'Enrichr_Libraries_Most_Popular_Genes', 'Enrichr_Submissions_TF-Gene_Coocurrence', 'Enrichr_Users_Contributed_Lists_2020', 'Epigenomics_Roadmap_HM_ChIP-seq', 'FANTOM6_lncRNA_KD_DEGs', 'GO_Biological_Process_2013', 'GO_Biological_Process_2015', 'GO_Biological_Process_2017', 'GO_Biological_Process_2017b', 'GO_Biological_Process_2018', 'GO_Biological_Process_2021', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2013', 'GO_Cellular_Component_2015', 'GO_Cellular_Component_2017', 'GO_Cellular_Component_2017b', 'GO_Cellular_Component_2018', 'GO_Cellular_Component_2021', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2013', 'GO_Molecular_Function_2015', 'GO_Molecular_Function_2017', 'GO_Molecular_Function_2017b', 'GO_Molecular_Function_2018', 'GO_Molecular_Function_2021', 'GO_Molecular_Function_2023', 'GTEx_Aging_Signatures_2021', 'GTEx_Tissue_Expression_Down', 'GTEx_Tissue_Expression_Up', 'GTEx_Tissues_V8_2023', 'GWAS_Catalog_2019', 'GWAS_Catalog_2023', 'GeDiPNet_2023', 'GeneSigDB', 'Gene_Perturbations_from_GEO_down', 'Gene_Perturbations_from_GEO_up', 'Genes_Associated_with_NIH_Grants', 'Genome_Browser_PWMs', 'GlyGen_Glycosylated_Proteins_2022', 'HDSigDB_Human_2021', 'HDSigDB_Mouse_2021', 'HMDB_Metabolites', 'HMS_LINCS_KinomeScan', 'HomoloGene', 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression', 'HuBMAP_ASCTplusB_augmented_2022', 'HumanCyc_2015', 'HumanCyc_2016', 'Human_Gene_Atlas', 'Human_Phenotype_Ontology', 'IDG_Drug_Targets_2022', 'InterPro_Domains_2019', 'Jensen_COMPARTMENTS', 'Jensen_DISEASES', 'Jensen_TISSUES', 'KEA_2013', 'KEA_2015', 'KEGG_2013', 'KEGG_2015', 'KEGG_2016', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'KEGG_2021_Human', 'KOMP2_Mouse_Phenotypes_2022', 'Kinase_Perturbations_from_GEO_down', 'Kinase_Perturbations_from_GEO_up', 'L1000_Kinase_and_GPCR_Perturbations_down', 'L1000_Kinase_and_GPCR_Perturbations_up', 'LINCS_L1000_CRISPR_KO_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_down', 'LINCS_L1000_Chem_Pert_up', 'LINCS_L1000_Ligand_Perturbations_down', 'LINCS_L1000_Ligand_Perturbations_up', 'Ligand_Perturbations_from_GEO_down', 'Ligand_Perturbations_from_GEO_up', 'MAGMA_Drugs_and_Diseases', 'MAGNET_2023', 'MCF7_Perturbations_from_GEO_down', 'MCF7_Perturbations_from_GEO_up', 'MGI_Mammalian_Phenotype_2013', 'MGI_Mammalian_Phenotype_2017', 'MGI_Mammalian_Phenotype_Level_3', 'MGI_Mammalian_Phenotype_Level_4', 'MGI_Mammalian_Phenotype_Level_4_2019', 'MGI_Mammalian_Phenotype_Level_4_2021', 'MSigDB_Computational', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Metabolomics_Workbench_Metabolites_2022', 'Microbe_Perturbations_from_GEO_down', 'Microbe_Perturbations_from_GEO_up', 'MoTrPAC_2023', 'Mouse_Gene_Atlas', 'NCI-60_Cancer_Cell_Lines', 'NCI-Nature_2015', 'NCI-Nature_2016', 'NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_Human_AutoRIF', 'NIH_Funded_PIs_2017_Human_GeneRIF', 'NURSA_Human_Endogenous_Complexome', 'OMIM_Disease', 'OMIM_Expanded', 'Old_CMAP_down', 'Old_CMAP_up', 'Orphanet_Augmented_2021', 'PFOCR_Pathways', 'PFOCR_Pathways_2023', 'PPI_Hub_Proteins', 'PanglaoDB_Augmented_2021', 'Panther_2015', 'Panther_2016', 'Pfam_Domains_2019', 'Pfam_InterPro_Domains', 'PheWeb_2019', 'PhenGenI_Association_2021', 'Phosphatase_Substrates_from_DEPOD', 'ProteomicsDB_2020', 'Proteomics_Drug_Atlas_2023', 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO', 'RNAseq_Automatic_GEO_Signatures_Human_Down', 'RNAseq_Automatic_GEO_Signatures_Human_Up', 'RNAseq_Automatic_GEO_Signatures_Mouse_Down', 'RNAseq_Automatic_GEO_Signatures_Mouse_Up', 'Rare_Diseases_AutoRIF_ARCHS4_Predictions', 'Rare_Diseases_AutoRIF_Gene_Lists', 'Rare_Diseases_GeneRIF_ARCHS4_Predictions', 'Rare_Diseases_GeneRIF_Gene_Lists', 'Reactome_2013', 'Reactome_2015', 'Reactome_2016', 'Reactome_2022', 'Rummagene_kinases', 'Rummagene_signatures', 'Rummagene_transcription_factors', 'SILAC_Phosphoproteomics', 'SubCell_BarCode', 'SynGO_2022', 'SysMyo_Muscle_Gene_Sets', 'TF-LOF_Expression_from_GEO', 'TF_Perturbations_Followed_by_Expression', 'TG_GATES_2020', 'TRANSFAC_and_JASPAR_PWMs', 'TRRUST_Transcription_Factors_2019', 'Table_Mining_of_CRISPR_Studies', 'Tabula_Muris', 'Tabula_Sapiens', 'TargetScan_microRNA', 'TargetScan_microRNA_2017', 'The_Kinase_Library_2023', 'Tissue_Protein_Expression_from_Human_Proteome_Map', 'Tissue_Protein_Expression_from_ProteomicsDB', 'Transcription_Factor_PPIs', 'UK_Biobank_GWAS_v1', 'Virus-Host_PPI_P-HIPSTer_2020', 'VirusMINT', 'Virus_Perturbations_from_GEO_down', 'Virus_Perturbations_from_GEO_up', 'WikiPathway_2021_Human', 'WikiPathway_2023_Human', 'WikiPathways_2013', 'WikiPathways_2015', 'WikiPathways_2016', 'WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'dbGaP', 'huMAP', 'lncHUB_lncRNA_Co-Expression', 'miRTarBase_2017']\n\n\nGet the significant DEGs for the Covid patients.\n\n#?gseapy.enrichr\nglist = sc.get.rank_genes_groups_df(cl1_sub, group='Covid', key='wilcoxon', log2fc_min=0.25, pval_cutoff=0.05)['names'].squeeze().str.strip().tolist()\nprint(len(glist))\n\n7\n\n\n\nenr_res = gseapy.enrichr(gene_list=glist, organism='Human', gene_sets='GO_Biological_Process_2018', cutoff = 0.5)\nenr_res.results.head()\n\n\n\n\n\n\n\n\nGene_set\nTerm\nOverlap\nP-value\nAdjusted P-value\nOld P-value\nOld Adjusted P-value\nOdds Ratio\nCombined Score\nGenes\n\n\n\n\n0\nGO_Biological_Process_2018\npositive regulation of inflammatory response (...\n2/73\n0.000273\n0.021549\n0\n0\n112.236620\n921.142995\nNFKBIA;S100A9\n\n\n1\nGO_Biological_Process_2018\npositive regulation of defense response (GO:00...\n2/74\n0.000280\n0.021549\n0\n0\n110.672222\n905.289889\nNFKBIA;S100A9\n\n\n2\nGO_Biological_Process_2018\npositive regulation of response to external st...\n2/90\n0.000414\n0.021549\n0\n0\n90.477273\n704.697251\nNFKBIA;S100A9\n\n\n3\nGO_Biological_Process_2018\npositive regulation of NF-kappaB transcription...\n2/128\n0.000836\n0.032592\n0\n0\n63.069841\n446.990813\nNFKBIA;S100A9\n\n\n4\nGO_Biological_Process_2018\ncellular protein complex assembly (GO:0043623)\n2/144\n0.001056\n0.032941\n0\n0\n55.918310\n383.234256\nHSP90AB1;POMP\n\n\n\n\n\n\n\nSome databases of interest:\nGO_Biological_Process_2017bKEGG_2019_HumanKEGG_2019_MouseWikiPathways_2019_HumanWikiPathways_2019_Mouse\nYou visualize your results using a simple barplot, for example:\n\ngseapy.barplot(enr_res.res2d,title='GO_Biological_Process_2018')\n\n<Axes: title={'center': 'GO_Biological_Process_2018'}, xlabel='$- \\\\log_{10}$ (Adjusted P-value)'>" }, { "objectID": "labs/scanpy/scanpy_05_dge.html#meta-dge_gsea", "href": "labs/scanpy/scanpy_05_dge.html#meta-dge_gsea", "title": " Differential gene expression", "section": "10 Gene Set Enrichment Analysis (GSEA)", - "text": "10 Gene Set Enrichment Analysis (GSEA)\nBesides the enrichment using hypergeometric test, we can also perform gene set enrichment analysis (GSEA), which scores ranked genes list (usually based on fold changes) and computes permutation test to check if a particular gene set is more present in the Up-regulated genes, among the DOWN_regulated genes or not differentially regulated.\nWe need a table with all DEGs and their log foldchanges. However, many lowly expressed genes will have high foldchanges and just contribue noise, so also filter for expression in enough cells.\n\ngene_rank = sc.get.rank_genes_groups_df(cl1_sub, group='Covid', key='wilcoxon')[['names','logfoldchanges']]\ngene_rank.sort_values(by=['logfoldchanges'], inplace=True, ascending=False)\n\n# calculate_qc_metrics will calculate number of cells per gene\nsc.pp.calculate_qc_metrics(cl1, percent_top=None, log1p=False, inplace=True)\n\n# filter for genes expressed in at least 30 cells.\ngene_rank = gene_rank[gene_rank['names'].isin(cl1.var_names[cl1.var.n_cells_by_counts>30])]\n\ngene_rank\n\n\n\n\n\n\n\n\nnames\nlogfoldchanges\n\n\n\n\n169\nTTTY15\n27.813257\n\n\n234\nCXCL8\n27.684155\n\n\n385\nG0S2\n27.324526\n\n\n61\nIFIT3\n4.778945\n\n\n228\nSLFN5\n4.398190\n\n\n...\n...\n...\n\n\n17616\nPSMD5\n-2.900448\n\n\n17498\nFARSA\n-2.907254\n\n\n17784\nDHDDS\n-3.096821\n\n\n18109\nCD200\n-3.213758\n\n\n18101\nFAM111B\n-3.797801\n\n\n\n\n6567 rows × 2 columns\n\n\n\nOnce our list of genes are sorted, we can proceed with the enrichment itself. We can use the package to get gene set from the Molecular Signature Database (MSigDB) and select KEGG pathways as an example.\n\n#Available databases : ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ \ngene_set_names = gseapy.get_library_name(organism='Human')\nprint(gene_set_names)\n\n['ARCHS4_Cell-lines', 'ARCHS4_IDG_Coexp', 'ARCHS4_Kinases_Coexp', 'ARCHS4_TFs_Coexp', 'ARCHS4_Tissues', 'Achilles_fitness_decrease', 'Achilles_fitness_increase', 'Aging_Perturbations_from_GEO_down', 'Aging_Perturbations_from_GEO_up', 'Allen_Brain_Atlas_10x_scRNA_2021', 'Allen_Brain_Atlas_down', 'Allen_Brain_Atlas_up', 'Azimuth_2023', 'Azimuth_Cell_Types_2021', 'BioCarta_2013', 'BioCarta_2015', 'BioCarta_2016', 'BioPlanet_2019', 'BioPlex_2017', 'CCLE_Proteomics_2020', 'CORUM', 'COVID-19_Related_Gene_Sets', 'COVID-19_Related_Gene_Sets_2021', 'Cancer_Cell_Line_Encyclopedia', 'CellMarker_Augmented_2021', 'ChEA_2013', 'ChEA_2015', 'ChEA_2016', 'ChEA_2022', 'Chromosome_Location', 'Chromosome_Location_hg19', 'ClinVar_2019', 'DSigDB', 'Data_Acquisition_Method_Most_Popular_Genes', 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019', 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019', 'Descartes_Cell_Types_and_Tissue_2021', 'Diabetes_Perturbations_GEO_2022', 'DisGeNET', 'Disease_Perturbations_from_GEO_down', 'Disease_Perturbations_from_GEO_up', 'Disease_Signatures_from_GEO_down_2014', 'Disease_Signatures_from_GEO_up_2014', 'DrugMatrix', 'Drug_Perturbations_from_GEO_2014', 'Drug_Perturbations_from_GEO_down', 'Drug_Perturbations_from_GEO_up', 'ENCODE_Histone_Modifications_2013', 'ENCODE_Histone_Modifications_2015', 'ENCODE_TF_ChIP-seq_2014', 'ENCODE_TF_ChIP-seq_2015', 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X', 'ESCAPE', 'Elsevier_Pathway_Collection', 'Enrichr_Libraries_Most_Popular_Genes', 'Enrichr_Submissions_TF-Gene_Coocurrence', 'Enrichr_Users_Contributed_Lists_2020', 'Epigenomics_Roadmap_HM_ChIP-seq', 'FANTOM6_lncRNA_KD_DEGs', 'GO_Biological_Process_2013', 'GO_Biological_Process_2015', 'GO_Biological_Process_2017', 'GO_Biological_Process_2017b', 'GO_Biological_Process_2018', 'GO_Biological_Process_2021', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2013', 'GO_Cellular_Component_2015', 'GO_Cellular_Component_2017', 'GO_Cellular_Component_2017b', 'GO_Cellular_Component_2018', 'GO_Cellular_Component_2021', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2013', 'GO_Molecular_Function_2015', 'GO_Molecular_Function_2017', 'GO_Molecular_Function_2017b', 'GO_Molecular_Function_2018', 'GO_Molecular_Function_2021', 'GO_Molecular_Function_2023', 'GTEx_Aging_Signatures_2021', 'GTEx_Tissue_Expression_Down', 'GTEx_Tissue_Expression_Up', 'GTEx_Tissues_V8_2023', 'GWAS_Catalog_2019', 'GWAS_Catalog_2023', 'GeDiPNet_2023', 'GeneSigDB', 'Gene_Perturbations_from_GEO_down', 'Gene_Perturbations_from_GEO_up', 'Genes_Associated_with_NIH_Grants', 'Genome_Browser_PWMs', 'GlyGen_Glycosylated_Proteins_2022', 'HDSigDB_Human_2021', 'HDSigDB_Mouse_2021', 'HMDB_Metabolites', 'HMS_LINCS_KinomeScan', 'HomoloGene', 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression', 'HuBMAP_ASCTplusB_augmented_2022', 'HumanCyc_2015', 'HumanCyc_2016', 'Human_Gene_Atlas', 'Human_Phenotype_Ontology', 'IDG_Drug_Targets_2022', 'InterPro_Domains_2019', 'Jensen_COMPARTMENTS', 'Jensen_DISEASES', 'Jensen_TISSUES', 'KEA_2013', 'KEA_2015', 'KEGG_2013', 'KEGG_2015', 'KEGG_2016', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'KEGG_2021_Human', 'KOMP2_Mouse_Phenotypes_2022', 'Kinase_Perturbations_from_GEO_down', 'Kinase_Perturbations_from_GEO_up', 'L1000_Kinase_and_GPCR_Perturbations_down', 'L1000_Kinase_and_GPCR_Perturbations_up', 'LINCS_L1000_CRISPR_KO_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_down', 'LINCS_L1000_Chem_Pert_up', 'LINCS_L1000_Ligand_Perturbations_down', 'LINCS_L1000_Ligand_Perturbations_up', 'Ligand_Perturbations_from_GEO_down', 'Ligand_Perturbations_from_GEO_up', 'MAGMA_Drugs_and_Diseases', 'MAGNET_2023', 'MCF7_Perturbations_from_GEO_down', 'MCF7_Perturbations_from_GEO_up', 'MGI_Mammalian_Phenotype_2013', 'MGI_Mammalian_Phenotype_2017', 'MGI_Mammalian_Phenotype_Level_3', 'MGI_Mammalian_Phenotype_Level_4', 'MGI_Mammalian_Phenotype_Level_4_2019', 'MGI_Mammalian_Phenotype_Level_4_2021', 'MSigDB_Computational', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Metabolomics_Workbench_Metabolites_2022', 'Microbe_Perturbations_from_GEO_down', 'Microbe_Perturbations_from_GEO_up', 'MoTrPAC_2023', 'Mouse_Gene_Atlas', 'NCI-60_Cancer_Cell_Lines', 'NCI-Nature_2015', 'NCI-Nature_2016', 'NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_Human_AutoRIF', 'NIH_Funded_PIs_2017_Human_GeneRIF', 'NURSA_Human_Endogenous_Complexome', 'OMIM_Disease', 'OMIM_Expanded', 'Old_CMAP_down', 'Old_CMAP_up', 'Orphanet_Augmented_2021', 'PFOCR_Pathways', 'PFOCR_Pathways_2023', 'PPI_Hub_Proteins', 'PanglaoDB_Augmented_2021', 'Panther_2015', 'Panther_2016', 'Pfam_Domains_2019', 'Pfam_InterPro_Domains', 'PheWeb_2019', 'PhenGenI_Association_2021', 'Phosphatase_Substrates_from_DEPOD', 'ProteomicsDB_2020', 'Proteomics_Drug_Atlas_2023', 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO', 'RNAseq_Automatic_GEO_Signatures_Human_Down', 'RNAseq_Automatic_GEO_Signatures_Human_Up', 'RNAseq_Automatic_GEO_Signatures_Mouse_Down', 'RNAseq_Automatic_GEO_Signatures_Mouse_Up', 'Rare_Diseases_AutoRIF_ARCHS4_Predictions', 'Rare_Diseases_AutoRIF_Gene_Lists', 'Rare_Diseases_GeneRIF_ARCHS4_Predictions', 'Rare_Diseases_GeneRIF_Gene_Lists', 'Reactome_2013', 'Reactome_2015', 'Reactome_2016', 'Reactome_2022', 'Rummagene_kinases', 'Rummagene_signatures', 'Rummagene_transcription_factors', 'SILAC_Phosphoproteomics', 'SubCell_BarCode', 'SynGO_2022', 'SysMyo_Muscle_Gene_Sets', 'TF-LOF_Expression_from_GEO', 'TF_Perturbations_Followed_by_Expression', 'TG_GATES_2020', 'TRANSFAC_and_JASPAR_PWMs', 'TRRUST_Transcription_Factors_2019', 'Table_Mining_of_CRISPR_Studies', 'Tabula_Muris', 'Tabula_Sapiens', 'TargetScan_microRNA', 'TargetScan_microRNA_2017', 'The_Kinase_Library_2023', 'Tissue_Protein_Expression_from_Human_Proteome_Map', 'Tissue_Protein_Expression_from_ProteomicsDB', 'Transcription_Factor_PPIs', 'UK_Biobank_GWAS_v1', 'Virus-Host_PPI_P-HIPSTer_2020', 'VirusMINT', 'Virus_Perturbations_from_GEO_down', 'Virus_Perturbations_from_GEO_up', 'WikiPathway_2021_Human', 'WikiPathway_2023_Human', 'WikiPathways_2013', 'WikiPathways_2015', 'WikiPathways_2016', 'WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'dbGaP', 'huMAP', 'lncHUB_lncRNA_Co-Expression', 'miRTarBase_2017']\n\n\nNext, we will be using the GSEA. This will result in a table containing information for several pathways. We can then sort and filter those pathways to visualize only the top ones. You can select/filter them by either p-value or normalized enrichment score (NES).\n\nres = gseapy.prerank(rnk=gene_rank, gene_sets='KEGG_2021_Human')\n\nterms = res.res2d.Term\nterms[:10]\n\n0 Coronavirus disease\n1 Cytokine-cytokine receptor interaction\n2 Viral protein interaction with cytokine and cy...\n3 RIG-I-like receptor signaling pathway\n4 NF-kappa B signaling pathway\n5 IL-17 signaling pathway\n6 Legionellosis\n7 Pertussis\n8 Toll-like receptor signaling pathway\n9 Rheumatoid arthritis\nName: Term, dtype: object\n\n\n\ngseapy.gseaplot(rank_metric=res.ranking, term=terms[0], **res.results[terms[0]])\n\n[<Axes: xlabel='Gene Rank', ylabel='Ranked metric'>,\n <Axes: >,\n <Axes: >,\n <Axes: ylabel='Enrichment Score'>]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nWhich KEGG pathways are upregulated in this cluster?Which KEGG pathways are dowregulated in this cluster?\nChange the pathway source to another gene set (e.g. “CP:WIKIPATHWAYS” or “CP:REACTOME” or “CP:BIOCARTA” or “GO:BP”) and check the if you get similar results?\n\n\nFinally, lets save the integrated data for further analysis.\n\nadata.write_h5ad('./data/covid/results/scanpy_covid_qc_dr_scanorama_cl_dge.h5ad')" + "text": "10 Gene Set Enrichment Analysis (GSEA)\nBesides the enrichment using hypergeometric test, we can also perform gene set enrichment analysis (GSEA), which scores ranked genes list (usually based on fold changes) and computes permutation test to check if a particular gene set is more present in the Up-regulated genes, among the DOWN_regulated genes or not differentially regulated.\nWe need a table with all DEGs and their log foldchanges. However, many lowly expressed genes will have high foldchanges and just contribue noise, so also filter for expression in enough cells.\n\ngene_rank = sc.get.rank_genes_groups_df(cl1_sub, group='Covid', key='wilcoxon')[['names','logfoldchanges']]\ngene_rank.sort_values(by=['logfoldchanges'], inplace=True, ascending=False)\n\n# calculate_qc_metrics will calculate number of cells per gene\nsc.pp.calculate_qc_metrics(cl1, percent_top=None, log1p=False, inplace=True)\n\n# filter for genes expressed in at least 30 cells.\ngene_rank = gene_rank[gene_rank['names'].isin(cl1.var_names[cl1.var.n_cells_by_counts>30])]\n\ngene_rank\n\n\n\n\n\n\n\n\nnames\nlogfoldchanges\n\n\n\n\n526\nSLFN5\n26.829697\n\n\n368\nCXCL8\n26.812254\n\n\n209\nEGR1\n5.063837\n\n\n38\nPPBP\n4.969337\n\n\n211\nPF4\n4.870691\n\n\n...\n...\n...\n\n\n18062\nNXPH4\n-2.804427\n\n\n18449\nMME\n-3.049736\n\n\n18380\nDHDDS\n-3.202402\n\n\n18282\nKDM1B\n-3.256811\n\n\n18607\nZNF296\n-4.392631\n\n\n\n\n7105 rows × 2 columns\n\n\n\nOnce our list of genes are sorted, we can proceed with the enrichment itself. We can use the package to get gene set from the Molecular Signature Database (MSigDB) and select KEGG pathways as an example.\n\n#Available databases : ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ \ngene_set_names = gseapy.get_library_name(organism='Human')\nprint(gene_set_names)\n\n['ARCHS4_Cell-lines', 'ARCHS4_IDG_Coexp', 'ARCHS4_Kinases_Coexp', 'ARCHS4_TFs_Coexp', 'ARCHS4_Tissues', 'Achilles_fitness_decrease', 'Achilles_fitness_increase', 'Aging_Perturbations_from_GEO_down', 'Aging_Perturbations_from_GEO_up', 'Allen_Brain_Atlas_10x_scRNA_2021', 'Allen_Brain_Atlas_down', 'Allen_Brain_Atlas_up', 'Azimuth_2023', 'Azimuth_Cell_Types_2021', 'BioCarta_2013', 'BioCarta_2015', 'BioCarta_2016', 'BioPlanet_2019', 'BioPlex_2017', 'CCLE_Proteomics_2020', 'CORUM', 'COVID-19_Related_Gene_Sets', 'COVID-19_Related_Gene_Sets_2021', 'Cancer_Cell_Line_Encyclopedia', 'CellMarker_Augmented_2021', 'ChEA_2013', 'ChEA_2015', 'ChEA_2016', 'ChEA_2022', 'Chromosome_Location', 'Chromosome_Location_hg19', 'ClinVar_2019', 'DSigDB', 'Data_Acquisition_Method_Most_Popular_Genes', 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019', 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019', 'Descartes_Cell_Types_and_Tissue_2021', 'Diabetes_Perturbations_GEO_2022', 'DisGeNET', 'Disease_Perturbations_from_GEO_down', 'Disease_Perturbations_from_GEO_up', 'Disease_Signatures_from_GEO_down_2014', 'Disease_Signatures_from_GEO_up_2014', 'DrugMatrix', 'Drug_Perturbations_from_GEO_2014', 'Drug_Perturbations_from_GEO_down', 'Drug_Perturbations_from_GEO_up', 'ENCODE_Histone_Modifications_2013', 'ENCODE_Histone_Modifications_2015', 'ENCODE_TF_ChIP-seq_2014', 'ENCODE_TF_ChIP-seq_2015', 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X', 'ESCAPE', 'Elsevier_Pathway_Collection', 'Enrichr_Libraries_Most_Popular_Genes', 'Enrichr_Submissions_TF-Gene_Coocurrence', 'Enrichr_Users_Contributed_Lists_2020', 'Epigenomics_Roadmap_HM_ChIP-seq', 'FANTOM6_lncRNA_KD_DEGs', 'GO_Biological_Process_2013', 'GO_Biological_Process_2015', 'GO_Biological_Process_2017', 'GO_Biological_Process_2017b', 'GO_Biological_Process_2018', 'GO_Biological_Process_2021', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2013', 'GO_Cellular_Component_2015', 'GO_Cellular_Component_2017', 'GO_Cellular_Component_2017b', 'GO_Cellular_Component_2018', 'GO_Cellular_Component_2021', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2013', 'GO_Molecular_Function_2015', 'GO_Molecular_Function_2017', 'GO_Molecular_Function_2017b', 'GO_Molecular_Function_2018', 'GO_Molecular_Function_2021', 'GO_Molecular_Function_2023', 'GTEx_Aging_Signatures_2021', 'GTEx_Tissue_Expression_Down', 'GTEx_Tissue_Expression_Up', 'GTEx_Tissues_V8_2023', 'GWAS_Catalog_2019', 'GWAS_Catalog_2023', 'GeDiPNet_2023', 'GeneSigDB', 'Gene_Perturbations_from_GEO_down', 'Gene_Perturbations_from_GEO_up', 'Genes_Associated_with_NIH_Grants', 'Genome_Browser_PWMs', 'GlyGen_Glycosylated_Proteins_2022', 'HDSigDB_Human_2021', 'HDSigDB_Mouse_2021', 'HMDB_Metabolites', 'HMS_LINCS_KinomeScan', 'HomoloGene', 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression', 'HuBMAP_ASCTplusB_augmented_2022', 'HumanCyc_2015', 'HumanCyc_2016', 'Human_Gene_Atlas', 'Human_Phenotype_Ontology', 'IDG_Drug_Targets_2022', 'InterPro_Domains_2019', 'Jensen_COMPARTMENTS', 'Jensen_DISEASES', 'Jensen_TISSUES', 'KEA_2013', 'KEA_2015', 'KEGG_2013', 'KEGG_2015', 'KEGG_2016', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'KEGG_2021_Human', 'KOMP2_Mouse_Phenotypes_2022', 'Kinase_Perturbations_from_GEO_down', 'Kinase_Perturbations_from_GEO_up', 'L1000_Kinase_and_GPCR_Perturbations_down', 'L1000_Kinase_and_GPCR_Perturbations_up', 'LINCS_L1000_CRISPR_KO_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_down', 'LINCS_L1000_Chem_Pert_up', 'LINCS_L1000_Ligand_Perturbations_down', 'LINCS_L1000_Ligand_Perturbations_up', 'Ligand_Perturbations_from_GEO_down', 'Ligand_Perturbations_from_GEO_up', 'MAGMA_Drugs_and_Diseases', 'MAGNET_2023', 'MCF7_Perturbations_from_GEO_down', 'MCF7_Perturbations_from_GEO_up', 'MGI_Mammalian_Phenotype_2013', 'MGI_Mammalian_Phenotype_2017', 'MGI_Mammalian_Phenotype_Level_3', 'MGI_Mammalian_Phenotype_Level_4', 'MGI_Mammalian_Phenotype_Level_4_2019', 'MGI_Mammalian_Phenotype_Level_4_2021', 'MSigDB_Computational', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Metabolomics_Workbench_Metabolites_2022', 'Microbe_Perturbations_from_GEO_down', 'Microbe_Perturbations_from_GEO_up', 'MoTrPAC_2023', 'Mouse_Gene_Atlas', 'NCI-60_Cancer_Cell_Lines', 'NCI-Nature_2015', 'NCI-Nature_2016', 'NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions', 'NIH_Funded_PIs_2017_Human_AutoRIF', 'NIH_Funded_PIs_2017_Human_GeneRIF', 'NURSA_Human_Endogenous_Complexome', 'OMIM_Disease', 'OMIM_Expanded', 'Old_CMAP_down', 'Old_CMAP_up', 'Orphanet_Augmented_2021', 'PFOCR_Pathways', 'PFOCR_Pathways_2023', 'PPI_Hub_Proteins', 'PanglaoDB_Augmented_2021', 'Panther_2015', 'Panther_2016', 'Pfam_Domains_2019', 'Pfam_InterPro_Domains', 'PheWeb_2019', 'PhenGenI_Association_2021', 'Phosphatase_Substrates_from_DEPOD', 'ProteomicsDB_2020', 'Proteomics_Drug_Atlas_2023', 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO', 'RNAseq_Automatic_GEO_Signatures_Human_Down', 'RNAseq_Automatic_GEO_Signatures_Human_Up', 'RNAseq_Automatic_GEO_Signatures_Mouse_Down', 'RNAseq_Automatic_GEO_Signatures_Mouse_Up', 'Rare_Diseases_AutoRIF_ARCHS4_Predictions', 'Rare_Diseases_AutoRIF_Gene_Lists', 'Rare_Diseases_GeneRIF_ARCHS4_Predictions', 'Rare_Diseases_GeneRIF_Gene_Lists', 'Reactome_2013', 'Reactome_2015', 'Reactome_2016', 'Reactome_2022', 'Rummagene_kinases', 'Rummagene_signatures', 'Rummagene_transcription_factors', 'SILAC_Phosphoproteomics', 'SubCell_BarCode', 'SynGO_2022', 'SysMyo_Muscle_Gene_Sets', 'TF-LOF_Expression_from_GEO', 'TF_Perturbations_Followed_by_Expression', 'TG_GATES_2020', 'TRANSFAC_and_JASPAR_PWMs', 'TRRUST_Transcription_Factors_2019', 'Table_Mining_of_CRISPR_Studies', 'Tabula_Muris', 'Tabula_Sapiens', 'TargetScan_microRNA', 'TargetScan_microRNA_2017', 'The_Kinase_Library_2023', 'Tissue_Protein_Expression_from_Human_Proteome_Map', 'Tissue_Protein_Expression_from_ProteomicsDB', 'Transcription_Factor_PPIs', 'UK_Biobank_GWAS_v1', 'Virus-Host_PPI_P-HIPSTer_2020', 'VirusMINT', 'Virus_Perturbations_from_GEO_down', 'Virus_Perturbations_from_GEO_up', 'WikiPathway_2021_Human', 'WikiPathway_2023_Human', 'WikiPathways_2013', 'WikiPathways_2015', 'WikiPathways_2016', 'WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'dbGaP', 'huMAP', 'lncHUB_lncRNA_Co-Expression', 'miRTarBase_2017']\n\n\nNext, we will be using the GSEA. This will result in a table containing information for several pathways. We can then sort and filter those pathways to visualize only the top ones. You can select/filter them by either p-value or normalized enrichment score (NES).\n\nres = gseapy.prerank(rnk=gene_rank, gene_sets='KEGG_2021_Human')\n\nterms = res.res2d.Term\nterms[:10]\n\n0 Cytokine-cytokine receptor interaction\n1 AGE-RAGE signaling pathway in diabetic complic...\n2 Viral protein interaction with cytokine and cy...\n3 Rheumatoid arthritis\n4 IL-17 signaling pathway\n5 Bladder cancer\n6 Chemokine signaling pathway\n7 NF-kappa B signaling pathway\n8 Legionellosis\n9 Chagas disease\nName: Term, dtype: object\n\n\n\ngseapy.gseaplot(rank_metric=res.ranking, term=terms[0], **res.results[terms[0]])\n\n[<Axes: xlabel='Gene Rank', ylabel='Ranked metric'>,\n <Axes: >,\n <Axes: >,\n <Axes: ylabel='Enrichment Score'>]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nWhich KEGG pathways are upregulated in this cluster?Which KEGG pathways are dowregulated in this cluster?\nChange the pathway source to another gene set (e.g. “CP:WIKIPATHWAYS” or “CP:REACTOME” or “CP:BIOCARTA” or “GO:BP”) and check the if you get similar results?\n\n\nFinally, lets save the integrated data for further analysis.\n\nadata.write_h5ad('./data/covid/results/scanpy_covid_qc_dr_scanorama_cl_dge.h5ad')" }, { "objectID": "labs/scanpy/scanpy_05_dge.html#meta-session", "href": "labs/scanpy/scanpy_05_dge.html#meta-session", "title": " Differential gene expression", "section": "11 Session info", - "text": "11 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfuture 0.18.3\ngmpy2 2.1.2\ngseapy 1.0.6\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmatplotlib_venn 0.11.9\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npybiomart 0.2.0\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrequests_cache 0.4.13\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nseaborn 0.12.2\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:23" + "text": "11 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfuture 0.18.3\ngmpy2 2.1.2\ngseapy 1.0.6\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmatplotlib_venn 0.11.9\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npybiomart 0.2.0\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrequests_cache 0.4.13\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nseaborn 0.12.2\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:29" }, { "objectID": "labs/scanpy/scanpy_06_celltyping.html", "href": "labs/scanpy/scanpy_06_celltyping.html", "title": " Celltype prediction", "section": "", - "text": "Note\n\n\n\nCode chunks run Python commands unless it starts with %%bash, in which case, those chunks run shell commands.\nCelltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses known marker genes for each celltype.\nWe will select one sample from the Covid data, ctrl_13 and predict celltype by cell on that sample.\nSome methods will predict a celltype to each cell based on what it is most similar to even if the celltype of that cell is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified.\nThere are multiple different methods to predict celltypes, here we will just cover a few of those.\nHere we will use a reference PBMC dataset that we get from scanpy datasets and classify celltypes based on two methods:\nFirst, lets load required libraries\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 2\nsc.settings.set_figure_params(dpi=80)\nLet’s read in the saved Covid-19 data object from the clustering step.\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\n# path_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\npath_file = \"data/covid/results/scanpy_clustered_covid.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 5725 × 2727\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden_1.0', 'leiden_0.6', 'leiden_0.4', 'leiden_1.4', 'louvain_1.0', 'louvain_0.6', 'louvain_0.4', 'louvain_1.4', 'kmeans5', 'kmeans10', 'kmeans15', 'hclust_5', 'hclust_10', 'hclust_15'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'dendrogram_leiden_0.6', 'dendrogram_louvain_0.6', 'doublet_info_colors', 'hclust_10_colors', 'hclust_15_colors', 'hclust_5_colors', 'hvg', 'kmeans10_colors', 'kmeans15_colors', 'kmeans5_colors', 'leiden', 'leiden_0.4_colors', 'leiden_0.6_colors', 'leiden_1.0_colors', 'leiden_1.4_colors', 'log1p', 'louvain', 'louvain_0.4_colors', 'louvain_0.6_colors', 'louvain_1.0_colors', 'louvain_1.4_colors', 'neighbors', 'pca', 'sample_colors', 'tsne', 'umap'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'\nadata.uns['log1p']['base']=None\nprint(adata.shape)\nprint(adata.raw.shape)\n\n(5725, 2727)\n(5725, 18830)\nSubset one patient.\nadata = adata[adata.obs[\"sample\"] == \"ctrl_13\",:]\nprint(adata.shape)\n\n(1117, 2727)\nsc.pl.umap(\n adata, color=[\"louvain_0.6\"], palette=sc.pl.palettes.default_20\n)" + "text": "Note\n\n\n\nCode chunks run Python commands unless it starts with %%bash, in which case, those chunks run shell commands.\nCelltype prediction can either be performed on indiviudal cells where each cell gets a predicted celltype label, or on the level of clusters. All methods are based on similarity to other datasets, single cell or sorted bulk RNAseq, or uses known marker genes for each celltype.\nWe will select one sample from the Covid data, ctrl_13 and predict celltype by cell on that sample.\nSome methods will predict a celltype to each cell based on what it is most similar to even if the celltype of that cell is not included in the reference. Other methods include an uncertainty so that cells with low similarity scores will be unclassified.\nThere are multiple different methods to predict celltypes, here we will just cover a few of those.\nHere we will use a reference PBMC dataset that we get from scanpy datasets and classify celltypes based on two methods:\nFirst, lets load required libraries\nimport numpy as np\nimport pandas as pd\nimport scanpy as sc\nimport matplotlib.pyplot as plt\nimport warnings\nimport os\nimport urllib.request\n\nwarnings.simplefilter(action=\"ignore\", category=Warning)\n\n# verbosity: errors (0), warnings (1), info (2), hints (3)\nsc.settings.verbosity = 2\nsc.settings.set_figure_params(dpi=80)\nLet’s read in the saved Covid-19 data object from the clustering step.\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_results = \"data/covid/results\"\nif not os.path.exists(path_results):\n os.makedirs(path_results, exist_ok=True)\n\n# path_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\npath_file = \"data/covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad\"\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'covid/results/scanpy_covid_qc_dr_scanorama_cl.h5ad'), path_file)\n\nadata = sc.read_h5ad(path_file)\nadata\n\nAnnData object with n_obs × n_vars = 7222 × 2626\n obs: 'type', 'sample', 'batch', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'n_genes', 'percent_chrY', 'XIST-counts', 'S_score', 'G2M_score', 'phase', 'doublet_scores', 'predicted_doublets', 'doublet_info', 'leiden_1.0', 'leiden_0.6', 'leiden_0.4', 'leiden_1.4', 'louvain_1.0', 'louvain_0.6', 'louvain_0.4', 'louvain_1.4', 'kmeans5', 'kmeans10', 'kmeans15', 'hclust_5', 'hclust_10', 'hclust_15'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n uns: 'dendrogram_leiden_0.6', 'dendrogram_louvain_0.6', 'doublet_info_colors', 'hclust_10_colors', 'hclust_15_colors', 'hclust_5_colors', 'hvg', 'kmeans10_colors', 'kmeans15_colors', 'kmeans5_colors', 'leiden', 'leiden_0.4_colors', 'leiden_0.6_colors', 'leiden_1.0_colors', 'leiden_1.4_colors', 'log1p', 'louvain', 'louvain_0.4_colors', 'louvain_0.6_colors', 'louvain_1.0_colors', 'louvain_1.4_colors', 'neighbors', 'pca', 'phase_colors', 'sample_colors', 'tsne', 'umap'\n obsm: 'Scanorama', 'X_pca', 'X_tsne', 'X_umap'\n varm: 'PCs'\n obsp: 'connectivities', 'distances'\nadata.uns['log1p']['base']=None\nprint(adata.shape)\nprint(adata.raw.shape)\n\n(7222, 2626)\n(7222, 19468)\nSubset one patient.\nadata = adata[adata.obs[\"sample\"] == \"ctrl_13\",:]\nprint(adata.shape)\n\n(1121, 2626)\nadata.obs[\"louvain_0.6\"].value_counts()\n\nlouvain_0.6\n0 244\n2 187\n3 184\n5 139\n8 129\n1 100\n4 65\n7 33\n9 33\n6 4\n10 3\nName: count, dtype: int64\nAs you can see, we have only one cell from cluster 10 in this sample, so lets remove that cell for now.\nadata = adata[adata.obs[\"louvain_0.6\"] != \"10\",:]\nsc.pl.umap(\n adata, color=[\"louvain_0.6\"], palette=sc.pl.palettes.default_20\n)" }, { "objectID": "labs/scanpy/scanpy_06_celltyping.html#meta-ct_ref", "href": "labs/scanpy/scanpy_06_celltyping.html#meta-ct_ref", "title": " Celltype prediction", "section": "1 Reference data", - "text": "1 Reference data\nLoad the reference data from scanpy.datasets. It is the annotated and processed pbmc3k dataset from 10x.\n\nadata_ref = sc.datasets.pbmc3k_processed() \n\nadata_ref.obs['sample']='pbmc3k'\n\nprint(adata_ref.shape)\nadata_ref.obs\n\ntry downloading from url\nhttps://raw.githubusercontent.com/chanzuckerberg/cellxgene/main/example-dataset/pbmc3k.h5ad\n... this may take a while but only happens once\n(2638, 1838)\n\n\n\n\n\n\n\n\n\nn_genes\npercent_mito\nn_counts\nlouvain\nsample\n\n\nindex\n\n\n\n\n\n\n\n\n\nAAACATACAACCAC-1\n781\n0.030178\n2419.0\nCD4 T cells\npbmc3k\n\n\nAAACATTGAGCTAC-1\n1352\n0.037936\n4903.0\nB cells\npbmc3k\n\n\nAAACATTGATCAGC-1\n1131\n0.008897\n3147.0\nCD4 T cells\npbmc3k\n\n\nAAACCGTGCTTCCG-1\n960\n0.017431\n2639.0\nCD14+ Monocytes\npbmc3k\n\n\nAAACCGTGTATGCG-1\n522\n0.012245\n980.0\nNK cells\npbmc3k\n\n\n...\n...\n...\n...\n...\n...\n\n\nTTTCGAACTCTCAT-1\n1155\n0.021104\n3459.0\nCD14+ Monocytes\npbmc3k\n\n\nTTTCTACTGAGGCA-1\n1227\n0.009294\n3443.0\nB cells\npbmc3k\n\n\nTTTCTACTTCCTCG-1\n622\n0.021971\n1684.0\nB cells\npbmc3k\n\n\nTTTGCATGAGAGGC-1\n454\n0.020548\n1022.0\nB cells\npbmc3k\n\n\nTTTGCATGCCTCAC-1\n724\n0.008065\n1984.0\nCD4 T cells\npbmc3k\n\n\n\n\n2638 rows × 5 columns\n\n\n\n\nsc.pl.umap(adata_ref, color='louvain')\n\n\n\n\n\n\n\n\nMake sure we have the same genes in both datset by taking the intersection\n\nprint(adata_ref.shape[1])\nprint(adata.shape[1])\nvar_names = adata_ref.var_names.intersection(adata.var_names)\nprint(len(var_names))\n\nadata_ref = adata_ref[:, var_names]\nadata = adata[:, var_names]\n\n1838\n2727\n427\n\n\nFirst we need to rerun pca and umap with the same gene set for both datasets.\n\nsc.pp.pca(adata_ref)\nsc.pp.neighbors(adata_ref)\nsc.tl.umap(adata_ref)\nsc.pl.umap(adata_ref, color='louvain')\n\ncomputing PCA\n with n_comps=50\n finished (0:00:00)\ncomputing neighbors\n using 'X_pca' with n_pcs = 50\n finished (0:00:08)\ncomputing UMAP\n finished (0:00:04)\n\n\n\n\n\n\n\n\n\n\nsc.pp.pca(adata)\nsc.pp.neighbors(adata)\nsc.tl.umap(adata)\nsc.pl.umap(adata, color='louvain_0.6')\n\ncomputing PCA\n on highly variable genes\n with n_comps=50\n finished (0:00:00)\ncomputing neighbors\n using 'X_pca' with n_pcs = 50\n finished (0:00:00)\ncomputing UMAP\n finished (0:00:02)" + "text": "1 Reference data\nLoad the reference data from scanpy.datasets. It is the annotated and processed pbmc3k dataset from 10x.\n\nadata_ref = sc.datasets.pbmc3k_processed() \n\nadata_ref.obs['sample']='pbmc3k'\n\nprint(adata_ref.shape)\nadata_ref.obs\n\n(2638, 1838)\n\n\n\n\n\n\n\n\n\nn_genes\npercent_mito\nn_counts\nlouvain\nsample\n\n\nindex\n\n\n\n\n\n\n\n\n\nAAACATACAACCAC-1\n781\n0.030178\n2419.0\nCD4 T cells\npbmc3k\n\n\nAAACATTGAGCTAC-1\n1352\n0.037936\n4903.0\nB cells\npbmc3k\n\n\nAAACATTGATCAGC-1\n1131\n0.008897\n3147.0\nCD4 T cells\npbmc3k\n\n\nAAACCGTGCTTCCG-1\n960\n0.017431\n2639.0\nCD14+ Monocytes\npbmc3k\n\n\nAAACCGTGTATGCG-1\n522\n0.012245\n980.0\nNK cells\npbmc3k\n\n\n...\n...\n...\n...\n...\n...\n\n\nTTTCGAACTCTCAT-1\n1155\n0.021104\n3459.0\nCD14+ Monocytes\npbmc3k\n\n\nTTTCTACTGAGGCA-1\n1227\n0.009294\n3443.0\nB cells\npbmc3k\n\n\nTTTCTACTTCCTCG-1\n622\n0.021971\n1684.0\nB cells\npbmc3k\n\n\nTTTGCATGAGAGGC-1\n454\n0.020548\n1022.0\nB cells\npbmc3k\n\n\nTTTGCATGCCTCAC-1\n724\n0.008065\n1984.0\nCD4 T cells\npbmc3k\n\n\n\n\n2638 rows × 5 columns\n\n\n\n\nsc.pl.umap(adata_ref, color='louvain')\n\n\n\n\n\n\n\n\nMake sure we have the same genes in both datset by taking the intersection\n\nprint(adata_ref.shape[1])\nprint(adata.shape[1])\nvar_names = adata_ref.var_names.intersection(adata.var_names)\nprint(len(var_names))\n\nadata_ref = adata_ref[:, var_names]\nadata = adata[:, var_names]\n\n1838\n2626\n419\n\n\nFirst we need to rerun pca and umap with the same gene set for both datasets.\n\nsc.pp.pca(adata_ref)\nsc.pp.neighbors(adata_ref)\nsc.tl.umap(adata_ref)\nsc.pl.umap(adata_ref, color='louvain')\n\ncomputing PCA\n with n_comps=50\n finished (0:00:00)\ncomputing neighbors\n using 'X_pca' with n_pcs = 50\n finished (0:00:08)\ncomputing UMAP\n finished (0:00:04)\n\n\n\n\n\n\n\n\n\n\nsc.pp.pca(adata)\nsc.pp.neighbors(adata)\nsc.tl.umap(adata)\nsc.pl.umap(adata, color='louvain_0.6')\n\ncomputing PCA\n on highly variable genes\n with n_comps=50\n finished (0:00:00)\ncomputing neighbors\n using 'X_pca' with n_pcs = 50\n finished (0:00:00)\ncomputing UMAP\n finished (0:00:02)" }, { "objectID": "labs/scanpy/scanpy_06_celltyping.html#integrate-with-scanorama", "href": "labs/scanpy/scanpy_06_celltyping.html#integrate-with-scanorama", "title": " Celltype prediction", "section": "2 Integrate with scanorama", - "text": "2 Integrate with scanorama\n\nimport scanorama\n\n#subset the individual dataset to the same variable genes as in MNN-correct.\nalldata = dict()\nalldata['ctrl']=adata\nalldata['ref']=adata_ref\n\n#convert to list of AnnData objects\nadatas = list(alldata.values())\n\n# run scanorama.integrate\nscanorama.integrate_scanpy(adatas, dimred = 50)\n\nFound 427 genes among all datasets\n[[0. 0.96329454]\n [0. 0. ]]\nProcessing datasets (0, 1)\n\n\n\n# add in sample info\nadata_ref.obs['sample']='pbmc3k'\n\n# create a merged scanpy object and add in the scanorama \nadata_merged = alldata['ctrl'].concatenate(alldata['ref'], batch_key='sample', batch_categories=['ctrl','pbmc3k'])\n\nembedding = np.concatenate([ad.obsm['X_scanorama'] for ad in adatas], axis=0)\nadata_merged.obsm['Scanorama'] = embedding\n\n\n#run umap.\nsc.pp.neighbors(adata_merged, n_pcs =50, use_rep = \"Scanorama\")\nsc.tl.umap(adata_merged)\n\ncomputing neighbors\n finished (0:00:00)\ncomputing UMAP\n finished (0:00:05)\n\n\n\nsc.pl.umap(adata_merged, color=[\"sample\",\"louvain\"])\n\n\n\n\n\n\n\n\n\n2.1 Label transfer\nUsing the function in the Spatial tutorial at the scanpy website we will calculate normalized cosine distances between the two datasets and tranfer labels to the celltype with the highest scores.\n\nfrom sklearn.metrics.pairwise import cosine_distances\n\ndistances = 1 - cosine_distances(\n adata_merged[adata_merged.obs['sample'] == \"pbmc3k\"].obsm[\"Scanorama\"],\n adata_merged[adata_merged.obs['sample'] == \"ctrl\"].obsm[\"Scanorama\"],\n)\n\ndef label_transfer(dist, labels, index):\n lab = pd.get_dummies(labels)\n class_prob = lab.to_numpy().T @ dist\n norm = np.linalg.norm(class_prob, 2, axis=0)\n class_prob = class_prob / norm\n class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)\n # convert to df\n cp_df = pd.DataFrame(\n class_prob, columns=lab.columns\n )\n cp_df.index = index\n # classify as max score\n m = cp_df.idxmax(axis=1)\n \n return m\n\nclass_def = label_transfer(distances, adata_ref.obs.louvain, adata.obs.index)\n\n# add to obs section of the original object\nadata.obs['predicted'] = class_def\n\nsc.pl.umap(adata, color=\"predicted\")\n\n\n\n\n\n\n\n\n\n# add to merged object.\nadata_merged.obs[\"predicted\"] = pd.concat(\n [class_def, adata_ref.obs[\"louvain\"]], axis=0\n).tolist()\n\nsc.pl.umap(adata_merged, color=[\"sample\",\"louvain\",'predicted'])\n#plot only ctrl cells.\nsc.pl.umap(adata_merged[adata_merged.obs['sample']=='ctrl'], color='predicted')" + "text": "2 Integrate with scanorama\n\nimport scanorama\n\n#subset the individual dataset to the same variable genes as in MNN-correct.\nalldata = dict()\nalldata['ctrl']=adata\nalldata['ref']=adata_ref\n\n#convert to list of AnnData objects\nadatas = list(alldata.values())\n\n# run scanorama.integrate\nscanorama.integrate_scanpy(adatas, dimred = 50)\n\nFound 419 genes among all datasets\n[[0. 0.96511628]\n [0. 0. ]]\nProcessing datasets (0, 1)\n\n\n\n# add in sample info\nadata_ref.obs['sample']='pbmc3k'\n\n# create a merged scanpy object and add in the scanorama \nadata_merged = alldata['ctrl'].concatenate(alldata['ref'], batch_key='sample', batch_categories=['ctrl','pbmc3k'])\n\nembedding = np.concatenate([ad.obsm['X_scanorama'] for ad in adatas], axis=0)\nadata_merged.obsm['Scanorama'] = embedding\n\n\n#run umap.\nsc.pp.neighbors(adata_merged, n_pcs =50, use_rep = \"Scanorama\")\nsc.tl.umap(adata_merged)\n\ncomputing neighbors\n finished (0:00:00)\ncomputing UMAP\n finished (0:00:05)\n\n\n\nsc.pl.umap(adata_merged, color=[\"sample\",\"louvain\"])\n\n\n\n\n\n\n\n\n\n2.1 Label transfer\nUsing the function in the Spatial tutorial at the scanpy website we will calculate normalized cosine distances between the two datasets and tranfer labels to the celltype with the highest scores.\n\nfrom sklearn.metrics.pairwise import cosine_distances\n\ndistances = 1 - cosine_distances(\n adata_merged[adata_merged.obs['sample'] == \"pbmc3k\"].obsm[\"Scanorama\"],\n adata_merged[adata_merged.obs['sample'] == \"ctrl\"].obsm[\"Scanorama\"],\n)\n\ndef label_transfer(dist, labels, index):\n lab = pd.get_dummies(labels)\n class_prob = lab.to_numpy().T @ dist\n norm = np.linalg.norm(class_prob, 2, axis=0)\n class_prob = class_prob / norm\n class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)\n # convert to df\n cp_df = pd.DataFrame(\n class_prob, columns=lab.columns\n )\n cp_df.index = index\n # classify as max score\n m = cp_df.idxmax(axis=1)\n \n return m\n\nclass_def = label_transfer(distances, adata_ref.obs.louvain, adata.obs.index)\n\n# add to obs section of the original object\nadata.obs['predicted'] = class_def\n\nsc.pl.umap(adata, color=\"predicted\")\n\n\n\n\n\n\n\n\n\n# add to merged object.\nadata_merged.obs[\"predicted\"] = pd.concat(\n [class_def, adata_ref.obs[\"louvain\"]], axis=0\n).tolist()\n\nsc.pl.umap(adata_merged, color=[\"sample\",\"louvain\",'predicted'])\n#plot only ctrl cells.\nsc.pl.umap(adata_merged[adata_merged.obs['sample']=='ctrl'], color='predicted')\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNow plot how many cells of each celltypes can be found in each cluster.\n\ntmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['predicted'], normalize='index')\ntmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')\n\n<matplotlib.legend.Legend at 0x7fff4ec62b60>" }, { "objectID": "labs/scanpy/scanpy_06_celltyping.html#ingest", "href": "labs/scanpy/scanpy_06_celltyping.html#ingest", "title": " Celltype prediction", "section": "3 Ingest", - "text": "3 Ingest\nAnother method for celltype prediction is Ingest, for more information, please look at https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html\n\nsc.tl.ingest(adata, adata_ref, obs='louvain')\nsc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)\n\nrunning ingest\n finished (0:00:20)" + "text": "3 Ingest\nAnother method for celltype prediction is Ingest, for more information, please look at https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html\n\nsc.tl.ingest(adata, adata_ref, obs='louvain')\nsc.pl.umap(adata, color=['louvain','louvain_0.6'], wspace=0.5)\n\nrunning ingest\n finished (0:00:20)\n\n\n\n\n\n\n\n\n\nNow plot how many cells of each celltypes can be found in each cluster.\n\ntmp = pd.crosstab(adata.obs['louvain_0.6'],adata.obs['louvain'], normalize='index')\ntmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.8, 1),loc='upper right')\n\n<matplotlib.legend.Legend at 0x7fff4e07aa40>" }, { "objectID": "labs/scanpy/scanpy_06_celltyping.html#compare-results", @@ -1243,14 +1243,14 @@ "href": "labs/scanpy/scanpy_06_celltyping.html#gene-set-analysis", "title": " Celltype prediction", "section": "5 Gene set analysis", - "text": "5 Gene set analysis\nAnother way of predicting celltypes is to use the differentially expressed genes per cluster and compare to lists of known cell marker genes. This requires a list of genes that you trust and that is relevant for the tissue you are working on.\nYou can either run it with a marker list from the ontology or a list of your choice as in the example below.\n\npath_file = 'data/human_cell_markers.txt'\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'human_cell_markers.txt'), path_file)\n\n\ndf = pd.read_table(path_file)\ndf\n\n\n\n\n\n\n\n\nspeciesType\ntissueType\nUberonOntologyID\ncancerType\ncellType\ncellName\nCellOntologyID\ncellMarker\ngeneSymbol\ngeneID\nproteinName\nproteinID\nmarkerResource\nPMID\nCompany\n\n\n\n\n0\nHuman\nKidney\nUBERON_0002113\nNormal\nNormal cell\nProximal tubular cell\nNaN\nIntestinal Alkaline Phosphatase\nALPI\n248\nPPBI\nP09923\nExperiment\n9263997\nNaN\n\n\n1\nHuman\nLiver\nUBERON_0002107\nNormal\nNormal cell\nIto cell (hepatic stellate cell)\nCL_0000632\nSynaptophysin\nSYP\n6855\nSYPH\nP08247\nExperiment\n10595912\nNaN\n\n\n2\nHuman\nEndometrium\nUBERON_0001295\nNormal\nNormal cell\nTrophoblast cell\nCL_0000351\nCEACAM1\nCEACAM1\n634\nCEAM1\nP13688\nExperiment\n10751340\nNaN\n\n\n3\nHuman\nGerm\nUBERON_0000923\nNormal\nNormal cell\nPrimordial germ cell\nCL_0000670\nVASA\nDDX4\n54514\nDDX4\nQ9NQI0\nExperiment\n10920202\nNaN\n\n\n4\nHuman\nCorneal epithelium\nUBERON_0001772\nNormal\nNormal cell\nEpithelial cell\nCL_0000066\nKLF6\nKLF6\n1316\nKLF6\nQ99612\nExperiment\n12407152\nNaN\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n2863\nHuman\nEmbryo\nUBERON_0000922\nNormal\nNormal cell\n1-cell stage cell (Blastomere)\nCL_0000353\nACCSL, ACVR1B, ARHGEF16, ASF1B, BCL2L10, BLCAP...\nACCSL, ACVR1B, ARHGEF16, ASF1B, BCL2L10, BLCAP...\n390110, 91, 27237, 55723, 10017, 10904, 662, 7...\n1A1L2, ACV1B, ARHGG, ASF1B, B2L10, BLCAP, SEC2...\nQ4AC99, P36896, Q5VV41, Q9NVP2, Q9HD36, P62952...\nSingle-cell sequencing\n23892778\nNaN\n\n\n2864\nHuman\nEmbryo\nUBERON_0000922\nNormal\nNormal cell\n4-cell stage cell (Blastomere)\nCL_0000353\nADPGK, AIM1, AIMP2, ARG2, ARHGAP17, ARIH1, CDC...\nADPGK, CRYBG1, AIMP2, ARG2, ARHGAP17, ARIH1, C...\n83440, 202, 7965, 384, 55114, 25820, 55536, 24...\nADPGK, CRBG1, AIMP2, ARGI2, RHG17, ARI1, CDA7L...\nQ9BRR6, Q9Y4K1, Q13155, P78540, Q68EM7, Q9Y4X5...\nSingle-cell sequencing\n23892778\nNaN\n\n\n2865\nHuman\nEmbryo\nUBERON_0000922\nNormal\nNormal cell\n8-cell stage cell (Blastomere)\nCL_0000353\nC11orf48, C19orf53, DHX9, DIABLO, EIF1AD, EIF4...\nLBHD1, C19orf53, DHX9, DIABLO, EIF1AD, EIF4G1,...\n79081, 28974, 1660, 56616, 84285, 1981, 26017,...\nLBHD1, L10K, DHX9, DBLOH, EIF1A, IF4G1, FA32A,...\nQ9BQE6, Q9UNZ5, Q08211, Q9NR28, Q8N9N8, Q04637...\nSingle-cell sequencing\n23892778\nNaN\n\n\n2866\nHuman\nEmbryo\nUBERON_0000922\nNormal\nNormal cell\nMorula cell (Blastomere)\nCL_0000360\nADCK1, AGL, AIMP1, AKAP12, ARPC3, ATP1B3, ATP5...\nADCK1, AGL, AIMP1, AKAP12, ARPC3, ATP1B3, NA, ...\n57143, 178, 9255, 9590, 10094, 483, NA, 586, 9...\nADCK1, GDE, AIMP1, AKA12, ARPC3, AT1B3, AT5F1,...\nQ86TW2, P35573, Q12904, Q02952, O15145, P54709...\nSingle-cell sequencing\n23892778\nNaN\n\n\n2867\nHuman\nBrain\nUBERON_0000955\noligodendroglioma\nCancer cell\nCancer stem cell\nNaN\nASCL1, BOC, CCND2, CD24, CHD7, EGFR, NFIB, SOX...\nASCL1, BOC, CCND2, CD24, CHD7, EGFR, NFIB, SOX...\n429, 91653, 894, 100133941, 55636, 1956, 4781,...\nASCL1, BOC, CCND2, CD24, CHD7, EGFR, NFIB, SOX...\nP50553, Q9BWV1, P30279, P25063, Q9P2D1, P00533...\nSingle-cell sequencing\n27806376\nNaN\n\n\n\n\n2868 rows × 15 columns\n\n\n\n\n# Filter for number of genes per celltype\nprint(df.shape)\n\n(2868, 15)\n\n\n\ndf['nG'] = df.geneSymbol.str.split(\",\").str.len()\n\ndf = df[df['nG'] > 5]\ndf = df[df['nG'] < 100]\nd = df[df['cancerType'] == \"Normal\"]\nprint(df.shape)\n\n(445, 16)\n\n\n\n# this chunk has issues and therefore not evaluated\n\ndf.index = df.cellName\ngene_dict = df.geneSymbol.str.split(\",\").to_dict()\n\n# run differential expression per cluster\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = \"wilcoxon\")\n\n\n# this chunk has issues and therefore not evaluated\n\n# do gene set overlap to the groups in the gene list and top 300 DEGs.\nimport gseapy\n\ngsea_res = dict()\npred = dict()\n\nfor cl in adata.obs['louvain_0.6'].cat.categories.tolist():\n print(cl)\n glist = sc.get.rank_genes_groups_df(adata, group=cl, key='wilcoxon')[\n 'names'].squeeze().str.strip().tolist()\n enr_res = gseapy.enrichr(gene_list=glist[:300],\n organism='Human',\n gene_sets=gene_dict,\n background=adata.raw.shape[1],\n cutoff=1)\n if enr_res.results.shape[0] == 0:\n pred[cl] = \"Unass\"\n else:\n enr_res.results.sort_values(\n by=\"P-value\", axis=0, ascending=True, inplace=True)\n print(enr_res.results.head(2))\n gsea_res[cl] = enr_res\n pred[cl] = enr_res.results[\"Term\"][0]\n\n\n# this chunk has issues and therefore not evaluated\n\n# prediction per cluster\npred\n\n\n# this chunk has issues and therefore not evaluated\n\nprediction = [pred[x] for x in adata.obs['louvain_0.6']]\nadata.obs[\"GS_overlap_pred\"] = prediction\n\nsc.pl.umap(adata, color='GS_overlap_pred')\n\n\n\n\n\n\n\nDiscuss\n\n\n\nAs you can see, it agrees to some extent with the predictions from label transfer and ingest, but there are clear differences, which do you think looks better?" + "text": "5 Gene set analysis\nAnother way of predicting celltypes is to use the differentially expressed genes per cluster and compare to lists of known cell marker genes. This requires a list of genes that you trust and that is relevant for the tissue you are working on.\nYou can either run it with a marker list from the ontology or a list of your choice as in the example below.\n\npath_file = 'data/human_cell_markers.txt'\nif not os.path.exists(path_file):\n urllib.request.urlretrieve(os.path.join(\n path_data, 'human_cell_markers.txt'), path_file)\n\n\ndf = pd.read_table(path_file)\ndf\n\nprint(df.shape)\n\n(2868, 15)\n\n\n\n# Filter for number of genes per celltype\ndf['nG'] = df.geneSymbol.str.split(\",\").str.len()\n\ndf = df[df['nG'] > 5]\ndf = df[df['nG'] < 100]\nd = df[df['cancerType'] == \"Normal\"]\nprint(df.shape)\n\n(445, 16)\n\n\n\ndf.index = df.cellName\ngene_dict = df.geneSymbol.str.split(\",\").to_dict()\n\n\n# run differential expression per cluster\nsc.tl.rank_genes_groups(adata, 'louvain_0.6', method='wilcoxon', key_added = \"wilcoxon\")\n\nranking genes\n finished (0:00:01)\n\n\n\n# do gene set overlap to the groups in the gene list and top 300 DEGs.\nimport gseapy\n\ngsea_res = dict()\npred = dict()\n\nfor cl in adata.obs['louvain_0.6'].cat.categories.tolist():\n print(cl)\n glist = sc.get.rank_genes_groups_df(adata, group=cl, key='wilcoxon')[\n 'names'].squeeze().str.strip().tolist()\n enr_res = gseapy.enrichr(gene_list=glist[:300],\n organism='Human',\n gene_sets=gene_dict,\n background=adata.raw.shape[1],\n cutoff=1)\n if enr_res.results.shape[0] == 0:\n pred[cl] = \"Unass\"\n else:\n enr_res.results.sort_values(\n by=\"P-value\", axis=0, ascending=True, inplace=True)\n print(enr_res.results.head(2))\n gsea_res[cl] = enr_res\n pred[cl] = enr_res.results[\"Term\"][0]\n\n0\n Gene_set Term Overlap P-value Adjusted P-value \\\n0 gs_ind_0 Cancer stem-like cell 1/6 0.088981 0.226652 \n6 gs_ind_0 Macrophage 1/6 0.088981 0.226652 \n\n Odds Ratio Combined Score Genes \n0 14.996147 36.280703 ANPEP \n6 14.996147 36.280703 AIF1 \n1\n Gene_set Term Overlap P-value Adjusted P-value \\\n2 gs_ind_0 Effector memory T cell 1/7 0.103024 0.180292 \n4 gs_ind_0 Monocyte 1/7 0.103024 0.180292 \n\n Odds Ratio Combined Score Genes \n2 12.995993 29.537229 IL7R \n4 12.995993 29.537229 CD52 \n2\n Gene_set Term Overlap P-value Adjusted P-value \\\n6 gs_ind_0 Monocyte 1/7 0.103024 0.244332 \n7 gs_ind_0 Parietal progenitor cell 1/7 0.103024 0.244332 \n\n Odds Ratio Combined Score Genes \n6 12.995993 29.537229 CD52 \n7 12.995993 29.537229 ANXA1 \n3\n Gene_set Term Overlap P-value Adjusted P-value \\\n6 gs_ind_0 Effector memory T cell 1/7 0.103024 0.226084 \n8 gs_ind_0 Naive T cell 1/7 0.103024 0.226084 \n\n Odds Ratio Combined Score Genes \n6 12.995993 29.537229 IL7R \n8 12.995993 29.537229 CCR7 \n4\n Gene_set Term Overlap P-value Adjusted P-value Odds Ratio \\\n0 gs_ind_0 B cell 1/6 0.088981 0.116851 14.996147 \n4 gs_ind_0 Monocyte 1/7 0.103024 0.116851 12.995993 \n\n Combined Score Genes \n0 36.280703 CD19 \n4 29.537229 CD52 \n5\n Gene_set Term Overlap P-value \\\n11 gs_ind_0 Myeloid-derived suppressor cell 1/6 0.088981 \n2 gs_ind_0 Dendritic cell 1/7 0.103024 \n\n Adjusted P-value Odds Ratio Combined Score Genes \n11 0.183109 14.996147 36.280703 ITGAM \n2 0.183109 12.995993 29.537229 ITGAM \n6\n Gene_set Term Overlap P-value \\\n0 gs_ind_0 Cancer stem-like cell 1/6 0.088981 \n4 gs_ind_0 Induced pluripotent stem cell 1/6 0.088981 \n\n Adjusted P-value Odds Ratio Combined Score Genes \n0 0.164838 14.996147 36.280703 ANPEP \n4 0.164838 14.996147 36.280703 ITGA6 \n7\n Gene_set Term Overlap P-value Adjusted P-value Odds Ratio \\\n0 gs_ind_0 B cell 1/6 0.088981 0.140221 14.996147 \n5 gs_ind_0 Monocyte 1/7 0.103024 0.140221 12.995993 \n\n Combined Score Genes \n0 36.280703 CD19 \n5 29.537229 CD52 \n8\n Gene_set Term Overlap P-value \\\n2 gs_ind_0 Macrophage 1/6 0.088981 \n3 gs_ind_0 Monocyte derived dendritic cell 1/8 0.116851 \n\n Adjusted P-value Odds Ratio Combined Score Genes \n2 0.233702 14.996147 36.280703 AIF1 \n3 0.233702 11.466464 24.616832 ITGAX \n9\n Gene_set Term Overlap P-value Adjusted P-value \\\n3 gs_ind_0 PROM1Low progenitor cell 1/7 0.103024 0.309489 \n1 gs_ind_0 M2 macrophage 1/12 0.170068 0.309489 \n\n Odds Ratio Combined Score Genes \n3 12.995993 29.537229 ALCAM \n1 7.795593 13.810336 CD163 \n\n\n\n# prediction per cluster\npred\n\n{'0': 'Cancer stem-like cell',\n '1': 'CD4+ T cell',\n '2': 'CD8+ T cell',\n '3': 'Activated T cell',\n '4': 'B cell',\n '5': 'CD16+ dendritic cell',\n '6': 'Cancer stem-like cell',\n '7': 'B cell',\n '8': 'Circulating fetal cell',\n '9': 'Circulating fetal cell'}\n\n\n\nprediction = [pred[x] for x in adata.obs['louvain_0.6']]\nadata.obs[\"GS_overlap_pred\"] = prediction\n\nsc.pl.umap(adata, color='GS_overlap_pred')\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nAs you can see, it agrees to some extent with the predictions from label transfer and ingest, but there are clear differences, which do you think looks better?" }, { "objectID": "labs/scanpy/scanpy_06_celltyping.html#meta-session", "href": "labs/scanpy/scanpy_06_celltyping.html#meta-session", "title": " Celltype prediction", "section": "6 Session info", - "text": "6 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\narray_api_compat 1.4\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfbpca NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nintervaltree NA\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscanorama 1.7.4\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsortedcontainers 2.4.0\nsparse 0.14.0\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:24" + "text": "6 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\narray_api_compat 1.4\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfbpca NA\ngmpy2 2.1.2\ngseapy 1.0.6\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nintervaltree NA\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscanorama 1.7.4\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsortedcontainers 2.4.0\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:30" }, { "objectID": "labs/scanpy/scanpy_07_trajectory.html", @@ -1327,7 +1327,7 @@ "href": "labs/scanpy/scanpy_07_trajectory.html#session-info", "title": " Trajectory inference using PAGA", "section": "10 Session info", - "text": "10 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfontTools 4.47.0\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnetworkx 3.2.1\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:26" + "text": "10 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfontTools 4.47.0\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnetworkx 3.2.1\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscipy 1.11.4\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsparse 0.14.0\nstack_data 0.6.2\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:32" }, { "objectID": "labs/scanpy/scanpy_08_spatial.html", @@ -1355,14 +1355,14 @@ "href": "labs/scanpy/scanpy_08_spatial.html#meta-st_analysis", "title": " Spatial Transcriptomics", "section": "3 Analysis", - "text": "3 Analysis\nWe will proceed with the data in a very similar manner to scRNA-seq data.\nAs we have two sections, we will select variable genes with batch_key=“library_id” and then take the union of variable genes for further analysis. The idea is to avoid including batch specific genes in the analysis.\n\n# save the counts to a separate object for later, we need the normalized counts in raw for DEG dete\ncounts_adata = adata.copy()\n\nsc.pp.normalize_total(adata, inplace=True)\nsc.pp.log1p(adata)\n# take 1500 variable genes per batch and then use the union of them.\nsc.pp.highly_variable_genes(adata, flavor=\"seurat\", n_top_genes=1500, inplace=True, batch_key=\"library_id\")\n\n# subset for variable genes\nadata.raw = adata\nadata = adata[:,adata.var.highly_variable_nbatches > 0]\n\n# scale data\nsc.pp.scale(adata)\n\nnormalizing counts per cell\n finished (0:00:00)\nIf you pass `n_top_genes`, all cutoffs are ignored.\nextracting highly variable genes\n finished (0:00:02)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\n... as `zero_center=True`, sparse input is densified and may lead to large memory consumption\n\n\nNow we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.\n\nfor library in library_names:\n sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = [\"Ttr\", \"Hpca\"])\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.1 Dimensionality reduction and clustering\nWe can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.\n\nsc.pp.neighbors(adata)\nsc.tl.umap(adata)\nsc.tl.leiden(adata, key_added=\"clusters\")\n\ncomputing neighbors\nWARNING: You’re trying to run this on 2405 dimensions of `.X`, if you really want this, set `use_rep='X'`.\n Falling back to preprocessing with `sc.pp.pca` and default params.\ncomputing PCA\n with n_comps=50\n finished (0:00:00)\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:08)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:09)\nrunning Leiden clustering\n finished: found 23 clusters and added\n 'clusters', the cluster labels (adata.obs, categorical) (0:00:01)\n\n\nWe can then plot clusters onto umap or onto the tissue section.\n\nsc.pl.umap(\n adata, color=[\"clusters\", \"library_id\"], palette=sc.pl.palettes.default_20\n)\n\nWARNING: Length of palette colors is smaller than the number of categories (palette length: 20, categories length: 23. Some categories will have the same color.\n\n\n\n\n\n\n\n\n\nAs we are plotting the two sections separately, we need to make sure that they get the same colors by fetching cluster colors from a dict.\n\nclusters_colors = dict(\n zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns[\"clusters_colors\"])\n)\n\nfig, axs = plt.subplots(1, 2, figsize=(15, 10))\n\nfor i, library in enumerate(\n [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n):\n ad = adata[adata.obs.library_id == library, :].copy()\n sc.pl.spatial(\n ad,\n img_key=\"hires\",\n library_id=library,\n color=\"clusters\",\n size=1.5,\n palette=[\n v\n for k, v in clusters_colors.items()\n if k in ad.obs.clusters.unique().tolist()\n ],\n legend_loc=None,\n show=False,\n ax=axs[i],\n )\n\nplt.tight_layout()\n\n\n\n\n\n\n\n\n\n\n3.2 Integration\nQuite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.\nWe will do a similar integration as in the Data Integration lab, here we will use Scanorama for integration.\n\nadatas = {}\nfor batch in library_names:\n adatas[batch] = adata[adata.obs['library_id'] == batch,]\n\nadatas \n\n{'V1_Mouse_Brain_Sagittal_Anterior': View of AnnData object with n_obs × n_vars = 2597 × 2405\n obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'\n uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'\n obsm: 'spatial', 'X_pca', 'X_umap'\n obsp: 'distances', 'connectivities',\n 'V1_Mouse_Brain_Sagittal_Posterior': View of AnnData object with n_obs × n_vars = 3152 × 2405\n obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'\n uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'\n obsm: 'spatial', 'X_pca', 'X_umap'\n obsp: 'distances', 'connectivities'}\n\n\n\nimport scanorama\n\n#convert to list of AnnData objects\nadatas = list(adatas.values())\n\n# run scanorama.integrate\nscanorama.integrate_scanpy(adatas, dimred = 50)\n\n# Get all the integrated matrices.\nscanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]\n\n# make into one matrix.\nall_s = np.concatenate(scanorama_int)\nprint(all_s.shape)\n\n# add to the AnnData object\nadata.obsm[\"Scanorama\"] = all_s\n\nadata\n\nFound 2405 genes among all datasets\n[[0. 0.47824413]\n [0. 0. ]]\nProcessing datasets (0, 1)\n(5749, 50)\n\n\nAnnData object with n_obs × n_vars = 5749 × 2405\n obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'\n uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'\n obsm: 'spatial', 'X_pca', 'X_umap', 'Scanorama'\n obsp: 'distances', 'connectivities'\n\n\nThen we run dimensionality reduction and clustering as before.\n\nsc.pp.neighbors(adata, use_rep=\"Scanorama\")\nsc.tl.umap(adata)\nsc.tl.leiden(adata, key_added=\"clusters\")\n\nsc.pl.umap(\n adata, color=[\"clusters\", \"library_id\"], palette=sc.pl.palettes.default_20\n)\n\ncomputing neighbors\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:09)\nrunning Leiden clustering\n finished: found 19 clusters and added\n 'clusters', the cluster labels (adata.obs, categorical) (0:00:01)\n\n\n\n\n\n\n\n\n\nAs we have new clusters, we again need to make a new dict for cluster colors\n\nclusters_colors = dict(\n zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns[\"clusters_colors\"])\n)\n\nfig, axs = plt.subplots(1, 2, figsize=(15, 10))\n\nfor i, library in enumerate(\n [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n):\n ad = adata[adata.obs.library_id == library, :].copy()\n sc.pl.spatial(\n ad,\n img_key=\"hires\",\n library_id=library,\n color=\"clusters\",\n size=1.5,\n palette=[\n v\n for k, v in clusters_colors.items()\n if k in ad.obs.clusters.unique().tolist()\n ],\n legend_loc=\"on data\",\n show=False,\n ax=axs[i],\n )\n\nplt.tight_layout()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you see any differences between the integrated and non-integrated clustering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.\n\n\n\n\n3.3 Spatially Variable Features\nThere are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that have spatial patterning without taking clusters or spatial annotation into account. First, we will do differential expression between clusters just as we did for the scRNAseq data before.\n\n# run t-test \nsc.tl.rank_genes_groups(adata, \"clusters\", method=\"wilcoxon\")\n# plot as heatmap for cluster5 genes\nsc.pl.rank_genes_groups_heatmap(adata, groups=\"5\", n_genes=10, groupby=\"clusters\")\n\nranking genes\n finished: added to `.uns['rank_genes_groups']`\n 'names', sorted np.recarray to be indexed by group ids\n 'scores', sorted np.recarray to be indexed by group ids\n 'logfoldchanges', sorted np.recarray to be indexed by group ids\n 'pvals', sorted np.recarray to be indexed by group ids\n 'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:15)\nWARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.\n using 'X_pca' with n_pcs = 50\nStoring dendrogram info using `.uns['dendrogram_clusters']`\nWARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.\ncategories: 0, 1, 2, etc.\nvar_group_labels: 5\n\n\n\n\n\n\n\n\n\n\n# plot onto spatial location\ntop_genes = sc.get.rank_genes_groups_df(adata, group='5',log2fc_min=0)['names'][:3]\n\nfor library in [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]:\n sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = top_genes)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSpatial transcriptomics allows researchers to investigate how gene expression trends varies in space, thus identifying spatial patterns of gene expression. For this purpose there are multiple methods, such as SpatailDE, SPARK, Trendsceek, HMRF and Splotch.\nWe use SpatialDE Svensson et al., a Gaussian process-based statistical framework that aims to identify spatially variable genes.\n\nTakes a long time to run, so skip this step for now and download the precomputed file.\n\n\n# slow step\n\nimport NaiveDE\nimport SpatialDE\n\ncounts = sc.get.obs_df(adata, keys=list(adata.var_names), use_raw=True)\ntotal_counts = sc.get.obs_df(adata, keys=[\"total_counts\"])\nnorm_expr = NaiveDE.stabilize(counts.T).T\nresid_expr = NaiveDE.regress_out(\n total_counts, norm_expr.T, \"np.log(total_counts)\").T\nresults = SpatialDE.run(adata.obsm[\"spatial\"], resid_expr)\n\nimport pickle\nwith open('data/spatial/visium/scanpy_spatialde.pkl', 'wb') as file:\n pickle.dump(results, file)\n\n\nimport urllib.request\nimport os\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_file = \"data/spatial/visium/scanpy_spatialde.pkl\"\nif not os.path.exists(path_file):\n file_url = os.path.join(\n path_data, \"spatial/visium/results/scanpy_spatialde.pkl\")\n urllib.request.urlretrieve(file_url, path_file)\n\n\nimport pickle\nwith open('data/spatial/visium/scanpy_spatialde.pkl', 'rb') as file:\n results = pickle.load(file)\n\n\n# We concatenate the results with the DataFrame of annotations of variables: `adata.var`.\nresults.index = results[\"g\"]\nadata.var = pd.concat(\n [adata.var, results.loc[adata.var.index.values, :]], axis=1)\nadata.write_h5ad('./data/spatial/visium/adata_processed_sc.h5ad')\n\n# Then we can inspect significant genes that varies in space and visualize them with `sc.pl.spatial` function.\nresults.sort_values(\"qval\").head(10)\n\n\n\n\n\n\n\n\nFSV\nM\ng\nl\nmax_delta\nmax_ll\nmax_mu_hat\nmax_s2_t_hat\nmodel\nn\ns2_FSV\ns2_logdelta\ntime\nBIC\nmax_ll_null\nLLR\npval\nqval\n\n\ng\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEfnb3\n1.775241e-01\n4\nEfnb3\n544.733658\n4.490546e+00\n-4701.690768\n0.920436\n6.950994e-02\nSE\n5749\n8.838199e-06\n4.029374e-04\n0.002817\n9438.008662\n-5374.163817\n672.473049\n0.0\n0.0\n\n\nS100a16\n5.205988e-02\n4\nS100a16\n544.733658\n1.764863e+01\n-4013.533873\n1.703474\n3.034361e-02\nSE\n5749\n7.639943e-06\n2.521451e-03\n0.002874\n8061.694871\n-4099.970498\n86.436625\n0.0\n0.0\n\n\nS100a5\n3.082312e-01\n4\nS100a5\n544.733658\n2.175292e+00\n-989.712703\n-0.692740\n3.893413e-02\nSE\n5749\n1.244686e-05\n3.006748e-04\n0.006836\n2014.052530\n-2128.916543\n1139.203841\n0.0\n0.0\n\n\nS100a6\n1.198049e-01\n4\nS100a6\n544.733658\n7.120948e+00\n-4911.277757\n0.024394\n4.364682e-02\nSE\n5749\n1.525869e-05\n1.234171e-03\n0.005581\n9857.182640\n-5087.893957\n176.616199\n0.0\n0.0\n\n\nCers2\n6.170160e-02\n4\nCers2\n544.733658\n1.473933e+01\n-4909.254062\n1.688963\n3.873848e-02\nSE\n5749\n6.963833e-06\n1.699495e-03\n0.002615\n9853.135249\n-5025.230150\n115.976088\n0.0\n0.0\n\n\nCar14\n6.721518e-02\n4\nCar14\n544.733658\n1.345078e+01\n-4078.211327\n1.646634\n3.422252e-02\nSE\n5749\n5.827188e-06\n1.224504e-03\n0.002738\n8191.049780\n-4232.446324\n154.234997\n0.0\n0.0\n\n\nHmgcs2\n1.043706e-01\n4\nHmgcs2\n544.733658\n8.317319e+00\n64.273783\n-0.697246\n9.799863e-03\nSE\n5749\n1.273224e-05\n1.280151e-03\n0.002757\n-93.920441\n-106.666922\n170.940704\n0.0\n0.0\n\n\nAtp1a1\n1.951715e-01\n4\nAtp1a1\n544.733658\n3.996873e+00\n-2655.884269\n-1.945325\n6.109633e-02\nSE\n5749\n1.387274e-05\n5.578607e-04\n0.002461\n5346.395662\n-3228.572142\n572.687873\n0.0\n0.0\n\n\nVangl1\n1.997762e-09\n4\nVangl1\n544.733658\n4.851652e+08\n523.655680\n-0.633082\n9.297993e-10\nSE\n5749\n5.712059e-09\n1.036289e+09\n0.011918\n-1012.684235\n435.773794\n87.881886\n0.0\n0.0\n\n\nTspan2\n2.008700e-01\n4\nTspan2\n544.733658\n3.855987e+00\n-4904.385928\n2.862779\n1.358988e-01\nSE\n5749\n1.765625e-05\n6.842354e-04\n0.002182\n9843.398981\n-5358.727526\n454.341598\n0.0\n0.0" + "text": "3 Analysis\nWe will proceed with the data in a very similar manner to scRNA-seq data.\nAs we have two sections, we will select variable genes with batch_key=“library_id” and then take the union of variable genes for further analysis. The idea is to avoid including batch specific genes in the analysis.\n\n# save the counts to a separate object for later, we need the normalized counts in raw for DEG dete\ncounts_adata = adata.copy()\n\nsc.pp.normalize_total(adata, inplace=True)\nsc.pp.log1p(adata)\n# take 1500 variable genes per batch and then use the union of them.\nsc.pp.highly_variable_genes(adata, flavor=\"seurat\", n_top_genes=1500, inplace=True, batch_key=\"library_id\")\n\n# subset for variable genes\nadata.raw = adata\nadata = adata[:,adata.var.highly_variable_nbatches > 0]\n\n# scale data\nsc.pp.scale(adata)\n\nnormalizing counts per cell\n finished (0:00:00)\nIf you pass `n_top_genes`, all cutoffs are ignored.\nextracting highly variable genes\n finished (0:00:02)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\n... as `zero_center=True`, sparse input is densified and may lead to large memory consumption\n\n\nNow we can plot gene expression of individual genes, the gene Hpca is a strong hippocampal marker and Ttr is a marker of the choroid plexus.\n\nfor library in library_names:\n sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = [\"Ttr\", \"Hpca\"])\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.1 Dimensionality reduction and clustering\nWe can then now run dimensionality reduction and clustering using the same workflow as we use for scRNA-seq analysis.\n\nsc.pp.neighbors(adata)\nsc.tl.umap(adata)\nsc.tl.leiden(adata, key_added=\"clusters\")\n\ncomputing neighbors\nWARNING: You’re trying to run this on 2405 dimensions of `.X`, if you really want this, set `use_rep='X'`.\n Falling back to preprocessing with `sc.pp.pca` and default params.\ncomputing PCA\n with n_comps=50\n finished (0:00:00)\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:08)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:09)\nrunning Leiden clustering\n finished: found 23 clusters and added\n 'clusters', the cluster labels (adata.obs, categorical) (0:00:01)\n\n\nWe can then plot clusters onto umap or onto the tissue section.\n\nsc.pl.umap(\n adata, color=[\"clusters\", \"library_id\"], palette=sc.pl.palettes.default_20\n)\n\nWARNING: Length of palette colors is smaller than the number of categories (palette length: 20, categories length: 23. Some categories will have the same color.\n\n\n\n\n\n\n\n\n\nAs we are plotting the two sections separately, we need to make sure that they get the same colors by fetching cluster colors from a dict.\n\nclusters_colors = dict(\n zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns[\"clusters_colors\"])\n)\n\nfig, axs = plt.subplots(1, 2, figsize=(15, 10))\n\nfor i, library in enumerate(\n [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n):\n ad = adata[adata.obs.library_id == library, :].copy()\n sc.pl.spatial(\n ad,\n img_key=\"hires\",\n library_id=library,\n color=\"clusters\",\n size=1.5,\n palette=[\n v\n for k, v in clusters_colors.items()\n if k in ad.obs.clusters.unique().tolist()\n ],\n legend_loc=None,\n show=False,\n ax=axs[i],\n )\n\nplt.tight_layout()\n\n\n\n\n\n\n\n\n\n\n3.2 Integration\nQuite often there are strong batch effects between different ST sections, so it may be a good idea to integrate the data across sections.\nWe will do a similar integration as in the Data Integration lab, here we will use Scanorama for integration.\n\nadatas = {}\nfor batch in library_names:\n adatas[batch] = adata[adata.obs['library_id'] == batch,]\n\nadatas \n\n{'V1_Mouse_Brain_Sagittal_Anterior': View of AnnData object with n_obs × n_vars = 2597 × 2405\n obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'\n uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'\n obsm: 'spatial', 'X_pca', 'X_umap'\n obsp: 'distances', 'connectivities',\n 'V1_Mouse_Brain_Sagittal_Posterior': View of AnnData object with n_obs × n_vars = 3152 × 2405\n obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'\n uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'\n obsm: 'spatial', 'X_pca', 'X_umap'\n obsp: 'distances', 'connectivities'}\n\n\n\nimport scanorama\n\n#convert to list of AnnData objects\nadatas = list(adatas.values())\n\n# run scanorama.integrate\nscanorama.integrate_scanpy(adatas, dimred = 50)\n\n# Get all the integrated matrices.\nscanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]\n\n# make into one matrix.\nall_s = np.concatenate(scanorama_int)\nprint(all_s.shape)\n\n# add to the AnnData object\nadata.obsm[\"Scanorama\"] = all_s\n\nadata\n\nFound 2405 genes among all datasets\n[[0. 0.47824413]\n [0. 0. ]]\nProcessing datasets (0, 1)\n(5749, 50)\n\n\nAnnData object with n_obs × n_vars = 5749 × 2405\n obs: 'in_tissue', 'array_row', 'array_col', 'library_id', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'pct_counts_hb', 'clusters'\n var: 'gene_ids', 'feature_types', 'genome', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'\n uns: 'spatial', 'library_id_colors', 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'clusters_colors'\n obsm: 'spatial', 'X_pca', 'X_umap', 'Scanorama'\n obsp: 'distances', 'connectivities'\n\n\nThen we run dimensionality reduction and clustering as before.\n\nsc.pp.neighbors(adata, use_rep=\"Scanorama\")\nsc.tl.umap(adata)\nsc.tl.leiden(adata, key_added=\"clusters\")\n\nsc.pl.umap(\n adata, color=[\"clusters\", \"library_id\"], palette=sc.pl.palettes.default_20\n)\n\ncomputing neighbors\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:08)\nrunning Leiden clustering\n finished: found 19 clusters and added\n 'clusters', the cluster labels (adata.obs, categorical) (0:00:01)\n\n\n\n\n\n\n\n\n\nAs we have new clusters, we again need to make a new dict for cluster colors\n\nclusters_colors = dict(\n zip([str(i) for i in range(len(adata.obs.clusters.cat.categories))], adata.uns[\"clusters_colors\"])\n)\n\nfig, axs = plt.subplots(1, 2, figsize=(15, 10))\n\nfor i, library in enumerate(\n [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]\n):\n ad = adata[adata.obs.library_id == library, :].copy()\n sc.pl.spatial(\n ad,\n img_key=\"hires\",\n library_id=library,\n color=\"clusters\",\n size=1.5,\n palette=[\n v\n for k, v in clusters_colors.items()\n if k in ad.obs.clusters.unique().tolist()\n ],\n legend_loc=\"on data\",\n show=False,\n ax=axs[i],\n )\n\nplt.tight_layout()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nDo you see any differences between the integrated and non-integrated clustering? Judge for yourself, which of the clusterings do you think looks best? As a reference, you can compare to brain regions in the Allen brain atlas.\n\n\n\n\n3.3 Spatially Variable Features\nThere are two main workflows to identify molecular features that correlate with spatial location within a tissue. The first is to perform differential expression based on spatially distinct clusters, the other is to find features that have spatial patterning without taking clusters or spatial annotation into account. First, we will do differential expression between clusters just as we did for the scRNAseq data before.\n\n# run t-test \nsc.tl.rank_genes_groups(adata, \"clusters\", method=\"wilcoxon\")\n# plot as heatmap for cluster5 genes\nsc.pl.rank_genes_groups_heatmap(adata, groups=\"5\", n_genes=10, groupby=\"clusters\")\n\nranking genes\n finished: added to `.uns['rank_genes_groups']`\n 'names', sorted np.recarray to be indexed by group ids\n 'scores', sorted np.recarray to be indexed by group ids\n 'logfoldchanges', sorted np.recarray to be indexed by group ids\n 'pvals', sorted np.recarray to be indexed by group ids\n 'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:14)\nWARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.\n using 'X_pca' with n_pcs = 50\nStoring dendrogram info using `.uns['dendrogram_clusters']`\nWARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.\ncategories: 0, 1, 2, etc.\nvar_group_labels: 5\n\n\n\n\n\n\n\n\n\n\n# plot onto spatial location\ntop_genes = sc.get.rank_genes_groups_df(adata, group='5',log2fc_min=0)['names'][:3]\n\nfor library in [\"V1_Mouse_Brain_Sagittal_Anterior\", \"V1_Mouse_Brain_Sagittal_Posterior\"]:\n sc.pl.spatial(adata[adata.obs.library_id == library,:], library_id=library, color = top_genes)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSpatial transcriptomics allows researchers to investigate how gene expression trends varies in space, thus identifying spatial patterns of gene expression. For this purpose there are multiple methods, such as SpatailDE, SPARK, Trendsceek, HMRF and Splotch.\nWe use SpatialDE Svensson et al., a Gaussian process-based statistical framework that aims to identify spatially variable genes.\n\nTakes a long time to run, so skip this step for now and download the precomputed file.\n\n\n# slow step\n\nimport NaiveDE\nimport SpatialDE\n\ncounts = sc.get.obs_df(adata, keys=list(adata.var_names), use_raw=True)\ntotal_counts = sc.get.obs_df(adata, keys=[\"total_counts\"])\nnorm_expr = NaiveDE.stabilize(counts.T).T\nresid_expr = NaiveDE.regress_out(\n total_counts, norm_expr.T, \"np.log(total_counts)\").T\nresults = SpatialDE.run(adata.obsm[\"spatial\"], resid_expr)\n\nimport pickle\nwith open('data/spatial/visium/scanpy_spatialde.pkl', 'wb') as file:\n pickle.dump(results, file)\n\n\nimport urllib.request\nimport os\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_file = \"data/spatial/visium/scanpy_spatialde.pkl\"\nif not os.path.exists(path_file):\n file_url = os.path.join(\n path_data, \"spatial/visium/results/scanpy_spatialde.pkl\")\n urllib.request.urlretrieve(file_url, path_file)\n\n\nimport pickle\nwith open('data/spatial/visium/scanpy_spatialde.pkl', 'rb') as file:\n results = pickle.load(file)\n\n\n# We concatenate the results with the DataFrame of annotations of variables: `adata.var`.\nresults.index = results[\"g\"]\nadata.var = pd.concat(\n [adata.var, results.loc[adata.var.index.values, :]], axis=1)\nadata.write_h5ad('./data/spatial/visium/adata_processed_sc.h5ad')\n\n# Then we can inspect significant genes that varies in space and visualize them with `sc.pl.spatial` function.\nresults.sort_values(\"qval\").head(10)\n\n\n\n\n\n\n\n\nFSV\nM\ng\nl\nmax_delta\nmax_ll\nmax_mu_hat\nmax_s2_t_hat\nmodel\nn\ns2_FSV\ns2_logdelta\ntime\nBIC\nmax_ll_null\nLLR\npval\nqval\n\n\ng\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nEfnb3\n1.775241e-01\n4\nEfnb3\n544.733658\n4.490546e+00\n-4701.690768\n0.920436\n6.950994e-02\nSE\n5749\n8.838199e-06\n4.029374e-04\n0.002817\n9438.008662\n-5374.163817\n672.473049\n0.0\n0.0\n\n\nS100a16\n5.205988e-02\n4\nS100a16\n544.733658\n1.764863e+01\n-4013.533873\n1.703474\n3.034361e-02\nSE\n5749\n7.639943e-06\n2.521451e-03\n0.002874\n8061.694871\n-4099.970498\n86.436625\n0.0\n0.0\n\n\nS100a5\n3.082312e-01\n4\nS100a5\n544.733658\n2.175292e+00\n-989.712703\n-0.692740\n3.893413e-02\nSE\n5749\n1.244686e-05\n3.006748e-04\n0.006836\n2014.052530\n-2128.916543\n1139.203841\n0.0\n0.0\n\n\nS100a6\n1.198049e-01\n4\nS100a6\n544.733658\n7.120948e+00\n-4911.277757\n0.024394\n4.364682e-02\nSE\n5749\n1.525869e-05\n1.234171e-03\n0.005581\n9857.182640\n-5087.893957\n176.616199\n0.0\n0.0\n\n\nCers2\n6.170160e-02\n4\nCers2\n544.733658\n1.473933e+01\n-4909.254062\n1.688963\n3.873848e-02\nSE\n5749\n6.963833e-06\n1.699495e-03\n0.002615\n9853.135249\n-5025.230150\n115.976088\n0.0\n0.0\n\n\nCar14\n6.721518e-02\n4\nCar14\n544.733658\n1.345078e+01\n-4078.211327\n1.646634\n3.422252e-02\nSE\n5749\n5.827188e-06\n1.224504e-03\n0.002738\n8191.049780\n-4232.446324\n154.234997\n0.0\n0.0\n\n\nHmgcs2\n1.043706e-01\n4\nHmgcs2\n544.733658\n8.317319e+00\n64.273783\n-0.697246\n9.799863e-03\nSE\n5749\n1.273224e-05\n1.280151e-03\n0.002757\n-93.920441\n-106.666922\n170.940704\n0.0\n0.0\n\n\nAtp1a1\n1.951715e-01\n4\nAtp1a1\n544.733658\n3.996873e+00\n-2655.884269\n-1.945325\n6.109633e-02\nSE\n5749\n1.387274e-05\n5.578607e-04\n0.002461\n5346.395662\n-3228.572142\n572.687873\n0.0\n0.0\n\n\nVangl1\n1.997762e-09\n4\nVangl1\n544.733658\n4.851652e+08\n523.655680\n-0.633082\n9.297993e-10\nSE\n5749\n5.712059e-09\n1.036289e+09\n0.011918\n-1012.684235\n435.773794\n87.881886\n0.0\n0.0\n\n\nTspan2\n2.008700e-01\n4\nTspan2\n544.733658\n3.855987e+00\n-4904.385928\n2.862779\n1.358988e-01\nSE\n5749\n1.765625e-05\n6.842354e-04\n0.002182\n9843.398981\n-5358.727526\n454.341598\n0.0\n0.0" }, { "objectID": "labs/scanpy/scanpy_08_spatial.html#meta-st_ss", "href": "labs/scanpy/scanpy_08_spatial.html#meta-st_ss", "title": " Spatial Transcriptomics", "section": "4 Single cell data", - "text": "4 Single cell data\nWe can use a scRNA-seq dataset as a reference to predict the proportion of different celltypes in the Visium spots. Keep in mind that it is important to have a reference that contains all the celltypes you expect to find in your spots. Ideally it should be a scRNA-seq reference from the exact same tissue. We will use a reference scRNA-seq dataset of ~14,000 adult mouse cortical cell taxonomy from the Allen Institute, generated with the SMART-Seq2 protocol.\nConveniently, you can also download the pre-processed dataset in h5ad format from here. Here with bash code:\n\nimport urllib.request\nimport os\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_file = \"data/spatial/visium/allen_cortex.h5ad\"\nif not os.path.exists(path_file):\n file_url = os.path.join(\n path_data, \"spatial/visium/allen_cortex.h5ad\")\n urllib.request.urlretrieve(file_url, path_file)\n\n\nadata_cortex = sc.read_h5ad(\"data/spatial/visium/allen_cortex.h5ad\")\nadata_cortex\n\nAnnData object with n_obs × n_vars = 14249 × 34617\n obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'sample_id', 'sample_type', 'organism', 'donor', 'sex', 'age_days', 'eye_condition', 'genotype', 'driver_lines', 'reporter_lines', 'brain_hemisphere', 'brain_region', 'brain_subregion', 'injection_label_direction', 'injection_primary', 'injection_secondary', 'injection_tract', 'injection_material', 'injection_exclusion_criterion', 'facs_date', 'facs_container', 'facs_sort_criteria', 'rna_amplification_set', 'library_prep_set', 'library_prep_avg_size_bp', 'seq_name', 'seq_tube', 'seq_batch', 'total_reads', 'percent_exon_reads', 'percent_intron_reads', 'percent_intergenic_reads', 'percent_rrna_reads', 'percent_mt_exon_reads', 'percent_reads_unique', 'percent_synth_reads', 'percent_ecoli_reads', 'percent_aligned_reads_total', 'complexity_cg', 'genes_detected_cpm_criterion', 'genes_detected_fpkm_criterion', 'tdt_cpm', 'gfp_cpm', 'class', 'subclass', 'cluster', 'confusion_score', 'cluster_correlation', 'core_intermediate_call'\n var: 'features'\n\n\n\nadata_cortex.obs\n\n\n\n\n\n\n\n\norig.ident\nnCount_RNA\nnFeature_RNA\nsample_id\nsample_type\norganism\ndonor\nsex\nage_days\neye_condition\n...\ngenes_detected_cpm_criterion\ngenes_detected_fpkm_criterion\ntdt_cpm\ngfp_cpm\nclass\nsubclass\ncluster\nconfusion_score\ncluster_correlation\ncore_intermediate_call\n\n\n\n\nF1S4_160108_001_A01\n0\n1730700.0\n9029\n527128530\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n10445\n9222\n248.86\n248.86\nGABAergic\nVip\nVip Arhgap36 Hmcn1\n0.4385\n0.837229\nIntermediate\n\n\nF1S4_160108_001_B01\n0\n1909605.0\n10207\n527128536\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n11600\n10370\n289.61\n289.61\nGABAergic\nLamp5\nLamp5 Lsp1\n0.1025\n0.878743\nCore\n\n\nF1S4_160108_001_C01\n0\n1984948.0\n10578\n527128542\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n11848\n10734\n281.06\n281.06\nGABAergic\nLamp5\nLamp5 Lsp1\n0.0195\n0.887084\nCore\n\n\nF1S4_160108_001_D01\n0\n2291552.0\n8482\n527128548\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n9494\n8561\n390.02\n390.02\nGABAergic\nVip\nVip Crispld2 Htr2c\n0.2734\n0.843552\nCore\n\n\nF1S4_160108_001_E01\n0\n1757463.0\n8697\n527128554\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n10012\n8791\n253.92\n253.92\nGABAergic\nLamp5\nLamp5 Plch2 Dock5\n0.7532\n0.854994\nCore\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\nFYS4_171004_104_C01\n2\n949356.0\n9141\n645142562\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n9629\n9229\n432.15\n432.15\nGlutamatergic\nL5 PT\nL5 PT VISp C1ql2 Cdh13\n0.0477\n0.885255\nCore\n\n\nFYS4_171004_104_D01\n2\n998736.0\n6927\n645142573\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n7701\n7023\n217.83\n217.83\nGABAergic\nSst\nSst Hpse Sema3c\n0.1064\n0.854499\nCore\n\n\nFYS4_171004_104_F01\n2\n1002766.0\n6936\n645142613\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n7888\n7054\n91.88\n91.88\nGlutamatergic\nL5 PT\nL5 PT VISp Chrna6\n0.0095\n0.822625\nCore\n\n\nFYS4_171004_104_G01\n2\n1025804.0\n8027\n645142648\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n8933\n8146\n127.77\n127.77\nGABAergic\nSst\nSst Calb2 Pdlim5\n0.2852\n0.856322\nCore\n\n\nFYS4_171004_104_H01\n2\n882435.0\n6574\n645142673\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n7393\n6687\n310.17\n310.17\nGABAergic\nPvalb\nPvalb Reln Tac1\n0.6089\n0.799198\nCore\n\n\n\n\n14249 rows × 52 columns\n\n\n\n\nsc.pp.normalize_total(adata_cortex, target_sum=1e5)\nsc.pp.log1p(adata_cortex)\nsc.pp.highly_variable_genes(adata_cortex, min_mean=0.0125, max_mean=3, min_disp=0.5)\nsc.pp.scale(adata_cortex, max_value=10)\nsc.tl.pca(adata_cortex, svd_solver='arpack')\nsc.pp.neighbors(adata_cortex, n_pcs=30)\nsc.tl.umap(adata_cortex)\nsc.pl.umap(adata_cortex, color=\"subclass\", legend_loc='on data')\n\nnormalizing counts per cell\n finished (0:00:01)\nextracting highly variable genes\n finished (0:00:06)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\n... as `zero_center=True`, sparse input is densified and may lead to large memory consumption\ncomputing PCA\n on highly variable genes\n with n_comps=50\n finished (0:00:10)\ncomputing neighbors\n using 'X_pca' with n_pcs = 30\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:11)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:12)\n\n\n\n\n\n\n\n\n\n\nadata_cortex.obs.subclass.value_counts()\n\nsubclass\nL6 IT 1872\nSst 1741\nVip 1728\nL4 1401\nPvalb 1337\nLamp5 1122\nL2/3 IT 982\nL6 CT 960\nL5 IT 880\nL5 PT 544\nAstro 368\nNP 362\nL6b 358\nSncg 125\nEndo 94\nOligo 91\nVLMC 67\nSMC 55\nMacrophage 51\nMeis2 45\nPeri 32\nSerpinf1 27\nCR 7\nName: count, dtype: int64\n\n\nFor speed, and for a more fair comparison of the celltypes, we will subsample all celltypes to a maximum of 200 cells per class (subclass).\n\ntarget_cells = 200\n\nadatas2 = [adata_cortex[adata_cortex.obs.subclass == clust] for clust in adata_cortex.obs.subclass.cat.categories]\n\nfor dat in adatas2:\n if dat.n_obs > target_cells:\n sc.pp.subsample(dat, n_obs=target_cells)\n\nadata_cortex = adatas2[0].concatenate(*adatas2[1:])\n\nadata_cortex.obs.subclass.value_counts()\n\nsubclass\nAstro 200\nL6 IT 200\nSst 200\nPvalb 200\nNP 200\nLamp5 200\nL6b 200\nVip 200\nL6 CT 200\nL5 PT 200\nL5 IT 200\nL4 200\nL2/3 IT 200\nSncg 125\nEndo 94\nOligo 91\nVLMC 67\nSMC 55\nMacrophage 51\nMeis2 45\nPeri 32\nSerpinf1 27\nCR 7\nName: count, dtype: int64\n\n\n\nsc.pl.umap(\n adata_cortex, color=[\"class\", \"subclass\", \"genotype\", \"brain_region\"], palette=sc.pl.palettes.default_28\n)\n\nWARNING: Length of palette colors is smaller than the number of categories (palette length: 28, categories length: 61. Some categories will have the same color.\n\n\n\n\n\n\n\n\n\n\nsc.pl.umap(adata_cortex, color=\"subclass\", legend_loc = 'on data')" + "text": "4 Single cell data\nWe can use a scRNA-seq dataset as a reference to predict the proportion of different celltypes in the Visium spots. Keep in mind that it is important to have a reference that contains all the celltypes you expect to find in your spots. Ideally it should be a scRNA-seq reference from the exact same tissue. We will use a reference scRNA-seq dataset of ~14,000 adult mouse cortical cell taxonomy from the Allen Institute, generated with the SMART-Seq2 protocol.\nConveniently, you can also download the pre-processed dataset in h5ad format from here. Here with bash code:\n\nimport urllib.request\nimport os\n\npath_data = \"https://export.uppmax.uu.se/naiss2023-23-3/workshops/workshop-scrnaseq\"\n\npath_file = \"data/spatial/visium/allen_cortex.h5ad\"\nif not os.path.exists(path_file):\n file_url = os.path.join(\n path_data, \"spatial/visium/allen_cortex.h5ad\")\n urllib.request.urlretrieve(file_url, path_file)\n\n\nadata_cortex = sc.read_h5ad(\"data/spatial/visium/allen_cortex.h5ad\")\nadata_cortex\n\nAnnData object with n_obs × n_vars = 14249 × 34617\n obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'sample_id', 'sample_type', 'organism', 'donor', 'sex', 'age_days', 'eye_condition', 'genotype', 'driver_lines', 'reporter_lines', 'brain_hemisphere', 'brain_region', 'brain_subregion', 'injection_label_direction', 'injection_primary', 'injection_secondary', 'injection_tract', 'injection_material', 'injection_exclusion_criterion', 'facs_date', 'facs_container', 'facs_sort_criteria', 'rna_amplification_set', 'library_prep_set', 'library_prep_avg_size_bp', 'seq_name', 'seq_tube', 'seq_batch', 'total_reads', 'percent_exon_reads', 'percent_intron_reads', 'percent_intergenic_reads', 'percent_rrna_reads', 'percent_mt_exon_reads', 'percent_reads_unique', 'percent_synth_reads', 'percent_ecoli_reads', 'percent_aligned_reads_total', 'complexity_cg', 'genes_detected_cpm_criterion', 'genes_detected_fpkm_criterion', 'tdt_cpm', 'gfp_cpm', 'class', 'subclass', 'cluster', 'confusion_score', 'cluster_correlation', 'core_intermediate_call'\n var: 'features'\n\n\n\nadata_cortex.obs\n\n\n\n\n\n\n\n\norig.ident\nnCount_RNA\nnFeature_RNA\nsample_id\nsample_type\norganism\ndonor\nsex\nage_days\neye_condition\n...\ngenes_detected_cpm_criterion\ngenes_detected_fpkm_criterion\ntdt_cpm\ngfp_cpm\nclass\nsubclass\ncluster\nconfusion_score\ncluster_correlation\ncore_intermediate_call\n\n\n\n\nF1S4_160108_001_A01\n0\n1730700.0\n9029\n527128530\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n10445\n9222\n248.86\n248.86\nGABAergic\nVip\nVip Arhgap36 Hmcn1\n0.4385\n0.837229\nIntermediate\n\n\nF1S4_160108_001_B01\n0\n1909605.0\n10207\n527128536\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n11600\n10370\n289.61\n289.61\nGABAergic\nLamp5\nLamp5 Lsp1\n0.1025\n0.878743\nCore\n\n\nF1S4_160108_001_C01\n0\n1984948.0\n10578\n527128542\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n11848\n10734\n281.06\n281.06\nGABAergic\nLamp5\nLamp5 Lsp1\n0.0195\n0.887084\nCore\n\n\nF1S4_160108_001_D01\n0\n2291552.0\n8482\n527128548\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n9494\n8561\n390.02\n390.02\nGABAergic\nVip\nVip Crispld2 Htr2c\n0.2734\n0.843552\nCore\n\n\nF1S4_160108_001_E01\n0\n1757463.0\n8697\n527128554\nCells\nMus musculus\n225675\nM\n53\nNormal\n...\n10012\n8791\n253.92\n253.92\nGABAergic\nLamp5\nLamp5 Plch2 Dock5\n0.7532\n0.854994\nCore\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\nFYS4_171004_104_C01\n2\n949356.0\n9141\n645142562\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n9629\n9229\n432.15\n432.15\nGlutamatergic\nL5 PT\nL5 PT VISp C1ql2 Cdh13\n0.0477\n0.885255\nCore\n\n\nFYS4_171004_104_D01\n2\n998736.0\n6927\n645142573\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n7701\n7023\n217.83\n217.83\nGABAergic\nSst\nSst Hpse Sema3c\n0.1064\n0.854499\nCore\n\n\nFYS4_171004_104_F01\n2\n1002766.0\n6936\n645142613\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n7888\n7054\n91.88\n91.88\nGlutamatergic\nL5 PT\nL5 PT VISp Chrna6\n0.0095\n0.822625\nCore\n\n\nFYS4_171004_104_G01\n2\n1025804.0\n8027\n645142648\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n8933\n8146\n127.77\n127.77\nGABAergic\nSst\nSst Calb2 Pdlim5\n0.2852\n0.856322\nCore\n\n\nFYS4_171004_104_H01\n2\n882435.0\n6574\n645142673\nCells\nMus musculus\n350650\nM\n51\nNormal\n...\n7393\n6687\n310.17\n310.17\nGABAergic\nPvalb\nPvalb Reln Tac1\n0.6089\n0.799198\nCore\n\n\n\n\n14249 rows × 52 columns\n\n\n\n\nsc.pp.normalize_total(adata_cortex, target_sum=1e5)\nsc.pp.log1p(adata_cortex)\nsc.pp.highly_variable_genes(adata_cortex, min_mean=0.0125, max_mean=3, min_disp=0.5)\nsc.pp.scale(adata_cortex, max_value=10)\nsc.tl.pca(adata_cortex, svd_solver='arpack')\nsc.pp.neighbors(adata_cortex, n_pcs=30)\nsc.tl.umap(adata_cortex)\nsc.pl.umap(adata_cortex, color=\"subclass\", legend_loc='on data')\n\nnormalizing counts per cell\n finished (0:00:00)\nextracting highly variable genes\n finished (0:00:03)\n--> added\n 'highly_variable', boolean vector (adata.var)\n 'means', float vector (adata.var)\n 'dispersions', float vector (adata.var)\n 'dispersions_norm', float vector (adata.var)\n... as `zero_center=True`, sparse input is densified and may lead to large memory consumption\ncomputing PCA\n on highly variable genes\n with n_comps=50\n finished (0:00:06)\ncomputing neighbors\n using 'X_pca' with n_pcs = 30\n finished: added to `.uns['neighbors']`\n `.obsp['distances']`, distances for each pair of neighbors\n `.obsp['connectivities']`, weighted adjacency matrix (0:00:11)\ncomputing UMAP\n finished: added\n 'X_umap', UMAP coordinates (adata.obsm) (0:00:14)\n\n\n\n\n\n\n\n\n\n\nadata_cortex.obs.subclass.value_counts()\n\nsubclass\nL6 IT 1872\nSst 1741\nVip 1728\nL4 1401\nPvalb 1337\nLamp5 1122\nL2/3 IT 982\nL6 CT 960\nL5 IT 880\nL5 PT 544\nAstro 368\nNP 362\nL6b 358\nSncg 125\nEndo 94\nOligo 91\nVLMC 67\nSMC 55\nMacrophage 51\nMeis2 45\nPeri 32\nSerpinf1 27\nCR 7\nName: count, dtype: int64\n\n\nFor speed, and for a more fair comparison of the celltypes, we will subsample all celltypes to a maximum of 200 cells per class (subclass).\n\ntarget_cells = 200\n\nadatas2 = [adata_cortex[adata_cortex.obs.subclass == clust] for clust in adata_cortex.obs.subclass.cat.categories]\n\nfor dat in adatas2:\n if dat.n_obs > target_cells:\n sc.pp.subsample(dat, n_obs=target_cells)\n\nadata_cortex = adatas2[0].concatenate(*adatas2[1:])\n\nadata_cortex.obs.subclass.value_counts()\n\nsubclass\nAstro 200\nL6 IT 200\nSst 200\nPvalb 200\nNP 200\nLamp5 200\nL6b 200\nVip 200\nL6 CT 200\nL5 PT 200\nL5 IT 200\nL4 200\nL2/3 IT 200\nSncg 125\nEndo 94\nOligo 91\nVLMC 67\nSMC 55\nMacrophage 51\nMeis2 45\nPeri 32\nSerpinf1 27\nCR 7\nName: count, dtype: int64\n\n\n\nsc.pl.umap(\n adata_cortex, color=[\"class\", \"subclass\", \"genotype\", \"brain_region\"], palette=sc.pl.palettes.default_28\n)\n\nWARNING: Length of palette colors is smaller than the number of categories (palette length: 28, categories length: 61. Some categories will have the same color.\n\n\n\n\n\n\n\n\n\n\nsc.pl.umap(adata_cortex, color=\"subclass\", legend_loc = 'on data')" }, { "objectID": "labs/scanpy/scanpy_08_spatial.html#meta-st_sub", @@ -1383,7 +1383,7 @@ "href": "labs/scanpy/scanpy_08_spatial.html#meta-session", "title": " Spatial Transcriptomics", "section": "7 Session info", - "text": "7 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfbpca NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nintervaltree NA\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscanorama 1.7.4\nscipy 1.11.4\nseaborn 0.12.2\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsortedcontainers 2.4.0\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-16 23:30" + "text": "7 Session info\n\n\nClick here\n\n\nsc.logging.print_versions()\n\n-----\nanndata 0.10.3\nscanpy 1.9.6\n-----\nPIL 10.0.0\nannoy NA\nanyio NA\nasttokens NA\nattr 23.1.0\nbabel 2.12.1\nbackcall 0.2.0\ncertifi 2023.11.17\ncffi 1.15.1\ncharset_normalizer 3.1.0\ncolorama 0.4.6\ncomm 0.1.3\ncycler 0.12.1\ncython_runtime NA\ndateutil 2.8.2\ndebugpy 1.6.7\ndecorator 5.1.1\ndefusedxml 0.7.1\nexceptiongroup 1.2.0\nexecuting 1.2.0\nfastjsonschema NA\nfbpca NA\ngmpy2 2.1.2\nh5py 3.9.0\nidna 3.4\nigraph 0.10.8\nintervaltree NA\nipykernel 6.23.1\nipython_genutils 0.2.0\njedi 0.18.2\njinja2 3.1.2\njoblib 1.3.2\njson5 NA\njsonpointer 2.0\njsonschema 4.17.3\njupyter_events 0.6.3\njupyter_server 2.6.0\njupyterlab_server 2.22.1\nkiwisolver 1.4.5\nleidenalg 0.10.1\nllvmlite 0.41.1\nlouvain 0.8.1\nmarkupsafe 2.1.2\nmatplotlib 3.8.0\nmatplotlib_inline 0.1.6\nmpl_toolkits NA\nmpmath 1.3.0\nnatsort 8.4.0\nnbformat 5.8.0\nnumba 0.58.1\nnumpy 1.26.2\nopt_einsum v3.3.0\noverrides NA\npackaging 23.1\npandas 2.1.4\nparso 0.8.3\npatsy 0.5.5\npexpect 4.8.0\npickleshare 0.7.5\npkg_resources NA\nplatformdirs 3.5.1\nprometheus_client NA\nprompt_toolkit 3.0.38\npsutil 5.9.5\nptyprocess 0.7.0\npure_eval 0.2.2\npvectorc NA\npycparser 2.21\npydev_ipython NA\npydevconsole NA\npydevd 2.9.5\npydevd_file_utils NA\npydevd_plugins NA\npydevd_tracing NA\npygments 2.15.1\npynndescent 0.5.11\npyparsing 3.1.1\npyrsistent NA\npythonjsonlogger NA\npytz 2023.3\nrequests 2.31.0\nrfc3339_validator 0.1.4\nrfc3986_validator 0.1.1\nscanorama 1.7.4\nscipy 1.11.4\nseaborn 0.12.2\nsend2trash NA\nsession_info 1.0.0\nsix 1.16.0\nsklearn 1.3.2\nsniffio 1.3.0\nsocks 1.7.1\nsortedcontainers 2.4.0\nsparse 0.14.0\nstack_data 0.6.2\nstatsmodels 0.14.1\nsympy 1.12\ntexttable 1.7.0\nthreadpoolctl 3.2.0\ntorch 2.0.0\ntornado 6.3.2\ntqdm 4.65.0\ntraitlets 5.9.0\ntyping_extensions NA\numap 0.5.5\nurllib3 2.0.2\nwcwidth 0.2.6\nwebsocket 1.5.2\nyaml 6.0\nzmq 25.0.2\nzoneinfo NA\nzstandard 0.19.0\n-----\nIPython 8.13.2\njupyter_client 8.2.0\njupyter_core 5.3.0\njupyterlab 4.0.1\nnotebook 6.5.4\n-----\nPython 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]\nLinux-6.5.11-linuxkit-x86_64-with-glibc2.35\n-----\nSession information updated at 2024-01-23 11:35" }, { "objectID": "index.html", @@ -1397,7 +1397,7 @@ "href": "index.html#single-cell-rna-seq-analysis", "title": "", "section": "Single Cell RNA-seq Analysis", - "text": "Single Cell RNA-seq Analysis\n\n\nOverview of scRNAseq technologies\nQC, normalization and transformation\nDimensionality reduction and clustering\nDifferential gene expression\nCelltype prediction\nTrajectory analysis\nSeurat, Bioconductor and Scanpy toolkits\n\n\n\nUpdated: 18-01-2024 at 17:22:21." + "text": "Single Cell RNA-seq Analysis\n\n\nOverview of scRNAseq technologies\nQC, normalization and transformation\nDimensionality reduction and clustering\nDifferential gene expression\nCelltype prediction\nTrajectory analysis\nSeurat, Bioconductor and Scanpy toolkits\n\n\n\nUpdated: 23-01-2024 at 11:35:27." }, { "objectID": "home_contents.html", @@ -1664,5 +1664,33 @@ "title": "UPPMAX Account Guide", "section": "7 ​Create a folder", "text": "7 ​Create a folder\n::: {.warning} ​After having received information that your membership is approved, ​wait 24 h before continuing, as it takes up to 24 h for SUPR to sync with UPPMAX. Else, you might get the message Permission denied when writing files or folders. ::\nCreate a directory for you to work in. Replace <username> with your actual user name.\n\n\n\n\nbash\n\nmkdir /proj/​naiss2023-23-648/nobackup/<username>\n\n\n\n​Unless you got some kind of error message. you should now be finished. To make sure the folder was created you can type\n\n\n\n\nbash\n\nls /proj/​naiss2023-23-648/nobackup/\n\n\n\n​It should list all directories along with the one you created. ​If you get an error message, contac us in Slack." + }, + { + "objectID": "labs/seurat/seurat_04_clustering.html#meta-clust_distribution", + "href": "labs/seurat/seurat_04_clustering.html#meta-clust_distribution", + "title": " Clustering", + "section": "4 Distribution of clusters", + "text": "4 Distribution of clusters\nNow, we can select one of our clustering methods and compare the proportion of samples across the clusters.\nSelect the “CCA_snn_res.0.5” and plot proportion of samples per cluster and also proportion covid vs ctrl.\n\np1 <- ggplot(alldata@meta.data, aes(x = CCA_snn_res.0.5, fill = orig.ident)) +\n geom_bar(position = \"fill\")\np2 <- ggplot(alldata@meta.data, aes(x = CCA_snn_res.0.5, fill = type)) +\n geom_bar(position = \"fill\")\n\np1 + p2\n\n\n\n\n\n\n\n\nIn this case we have quite good representation of each sample in each cluster. But there are clearly some biases with more cells from one sample in some clusters and also more covid cells in some of the clusters.\nWe can also plot it in the other direction, the proportion of each cluster per sample.\n\nggplot(alldata@meta.data, aes(x = orig.ident, fill = CCA_snn_res.0.5)) +\n geom_bar(position = \"fill\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nBy now you should know how to plot different features onto your data. Take the QC metrics that were calculated in the first exercise, that should be stored in your data object, and plot it as violin plots per cluster using the clustering method of your choice. For example, plot number of UMIS, detected genes, percent mitochondrial reads. Then, check carefully if there is any bias in how your data is separated due to quality metrics. Could it be explained biologically, or could you have technical bias there?" + }, + { + "objectID": "labs/scanpy/scanpy_03_integration.html#overview-all-methods", + "href": "labs/scanpy/scanpy_03_integration.html#overview-all-methods", + "title": " Data Integration", + "section": "6 Overview all methods", + "text": "6 Overview all methods\nNow we will plot UMAPS with all three integration methods side by side.\n\nfig, axs = plt.subplots(2, 2, figsize=(10,8),constrained_layout=True)\nsc.pl.umap(adata, color=\"sample\", title=\"Uncorrected\", ax=axs[0,0], show=False)\nsc.pl.umap(adata2, color=\"sample\", title=\"BBKNN\", ax=axs[0,1], show=False)\nsc.pl.umap(adata_combat, color=\"sample\", title=\"Combat\", ax=axs[1,0], show=False)\nsc.pl.umap(adata_sc, color=\"sample\", title=\"Scanorama\", ax=axs[1,1], show=False)\n\n<Axes: title={'center': 'Scanorama'}, xlabel='UMAP1', ylabel='UMAP2'>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nLook at the different integration results, which one do you think looks the best? How would you motivate selecting one method over the other? How do you think you could best evaluate if the integration worked well?" + }, + { + "objectID": "labs/scanpy/scanpy_03_integration.html#extra-task", + "href": "labs/scanpy/scanpy_03_integration.html#extra-task", + "title": " Data Integration", + "section": "7 Extra task", + "text": "7 Extra task\nHave a look at the documentation for BBKNN\nTry changing some of the parameteres in BBKNN, such as distance metric, number of PCs and number of neighbors. How does the results change with different parameters? Can you explain why?" + }, + { + "objectID": "labs/scanpy/scanpy_04_clustering.html#meta-clust_distribution", + "href": "labs/scanpy/scanpy_04_clustering.html#meta-clust_distribution", + "title": " Clustering", + "section": "4 Distribution of clusters", + "text": "4 Distribution of clusters\nNow, we can select one of our clustering methods and compare the proportion of samples across the clusters.\nSelect the “leiden_0.6” and plot proportion of samples per cluster and also proportion covid vs ctrl.\nPlot proportion of cells from each condition per cluster.\n\ntmp = pd.crosstab(adata.obs['leiden_0.6'],adata.obs['type'], normalize='index')\ntmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.4, 1), loc='upper right')\n\ntmp = pd.crosstab(adata.obs['leiden_0.6'],adata.obs['sample'], normalize='index')\ntmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.4, 1),loc='upper right')\n\n<matplotlib.legend.Legend at 0x7fff4680fbe0>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn this case we have quite good representation of each sample in each cluster. But there are clearly some biases with more cells from one sample in some clusters and also more covid cells in some of the clusters.\nWe can also plot it in the other direction, the proportion of each cluster per sample.\n\ntmp = pd.crosstab(adata.obs['sample'],adata.obs['leiden_0.6'], normalize='index')\ntmp.plot.bar(stacked=True).legend(bbox_to_anchor=(1.4, 1), loc='upper right')\n\n<matplotlib.legend.Legend at 0x7fff46bb3be0>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDiscuss\n\n\n\nBy now you should know how to plot different features onto your data. Take the QC metrics that were calculated in the first exercise, that should be stored in your data object, and plot it as violin plots per cluster using the clustering method of your choice. For example, plot number of UMIS, detected genes, percent mitochondrial reads. Then, check carefully if there is any bias in how your data is separated due to quality metrics. Could it be explained biologically, or could you have technical bias there?" } ] \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 7a8a5339..9632e0b1 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -2,138 +2,138 @@ https://nbisweden.github.io/workshop-scrnaseq/labs/index.html - 2024-01-18T17:37:08.172Z + 2024-01-23T11:36:39.706Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_01_qc.html - 2024-01-16T22:25:10.248Z + 2024-01-23T10:36:04.063Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_02_dimred.html - 2024-01-16T22:28:14.448Z + 2024-01-23T10:39:16.844Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_03_integration.html - 2024-01-16T22:32:39.666Z + 2024-01-23T10:43:41.411Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_04_clustering.html - 2024-01-16T22:36:10.489Z + 2024-01-23T10:47:15.968Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_05_dge.html - 2024-01-16T22:41:51.265Z + 2024-01-23T10:48:38.377Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_06_celltyping.html - 2024-01-16T22:45:20.706Z + 2024-01-23T10:52:12.431Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_07_trajectory.html - 2024-01-18T17:25:11.929Z + 2024-01-23T10:53:02.176Z https://nbisweden.github.io/workshop-scrnaseq/labs/seurat/seurat_08_spatial.html - 2024-01-18T17:37:01.515Z + 2024-01-23T11:03:47.462Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_01_qc.html - 2024-01-16T23:00:07.097Z + 2024-01-23T11:06:06.861Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_02_dimred.html - 2024-01-16T23:02:03.186Z + 2024-01-23T11:08:00.839Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_03_integration.html - 2024-01-16T23:03:37.326Z + 2024-01-23T11:09:34.580Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_04_clustering.html - 2024-01-16T23:06:57.728Z + 2024-01-23T11:12:54.882Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_05_dge.html - 2024-01-16T23:08:27.279Z + 2024-01-23T11:14:23.699Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_06_celltyping.html - 2024-01-16T23:12:28.234Z + 2024-01-23T11:18:22.052Z https://nbisweden.github.io/workshop-scrnaseq/labs/bioc/bioc_08_spatial.html - 2024-01-16T23:16:16.673Z + 2024-01-23T11:21:19.598Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_01_qc.html - 2024-01-16T23:17:50.240Z + 2024-01-23T11:23:00.368Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_02_dimred.html - 2024-01-16T23:19:40.186Z + 2024-01-23T11:25:14.223Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_03_integration.html - 2024-01-16T23:21:34.606Z + 2024-01-23T11:27:29.728Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_04_clustering.html - 2024-01-16T23:22:13.515Z + 2024-01-23T11:28:12.425Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_05_dge.html - 2024-01-16T23:23:48.212Z + 2024-01-23T11:29:47.308Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_06_celltyping.html - 2024-01-16T23:24:57.165Z + 2024-01-23T11:30:56.667Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_07_trajectory.html - 2024-01-16T23:26:21.015Z + 2024-01-23T11:32:20.280Z https://nbisweden.github.io/workshop-scrnaseq/labs/scanpy/scanpy_08_spatial.html - 2024-01-16T23:30:15.128Z + 2024-01-23T11:35:17.003Z https://nbisweden.github.io/workshop-scrnaseq/index.html - 2024-01-18T17:22:21.340Z + 2024-01-23T11:35:28.037Z https://nbisweden.github.io/workshop-scrnaseq/home_contents.html - 2024-01-18T17:22:27.497Z + 2024-01-23T11:35:33.625Z https://nbisweden.github.io/workshop-scrnaseq/home_info.html - 2024-01-18T17:22:34.791Z + 2024-01-23T11:35:41.069Z https://nbisweden.github.io/workshop-scrnaseq/home_precourse.html - 2024-01-18T17:22:41.764Z + 2024-01-23T11:35:48.817Z https://nbisweden.github.io/workshop-scrnaseq/home_schedule.html - 2024-01-18T17:22:49.487Z + 2024-01-23T11:35:56.581Z https://nbisweden.github.io/workshop-scrnaseq/home_syllabus.html - 2024-01-18T17:22:55.612Z + 2024-01-23T11:36:03.275Z https://nbisweden.github.io/workshop-scrnaseq/other/uppmax.html - 2024-01-18T17:23:02.260Z + 2024-01-23T11:36:09.923Z https://nbisweden.github.io/workshop-scrnaseq/other/docker.html - 2024-01-18T17:23:08.909Z + 2024-01-23T11:36:16.428Z https://nbisweden.github.io/workshop-scrnaseq/other/containers.html - 2024-01-19T17:02:00.786Z + 2024-01-23T11:36:23.687Z https://nbisweden.github.io/workshop-scrnaseq/other/faq.html - 2024-01-18T17:23:23.151Z + 2024-01-23T11:36:30.404Z