-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfigure_generation_v2_MK_subclustering.Rmd
180 lines (126 loc) · 6.65 KB
/
figure_generation_v2_MK_subclustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "MK Subclustering"
author: "D. Ford Hannum"
date: "6/23/2020"
output:
html_document:
toc: true
toc_depth: 3
number_sections: false
theme: united
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(Seurat)
library(ggplot2)
library(data.table)
```
# Overview
While going through generating figures I was subclustering the MK and MEP cells. I found that some of the genes that were deferentially expressed where actually due to differences in the population of the subclusters (similar to some DE genes in bulk RNA, that are due to differences in populations).
Because of this I decided to change how I was doing the differential expression analysis. I focused on looking at marker genes that distinguished subclusters and did a GO term analysis on those genes. I have an excel file that goes with this file that has tabs for the differentially epressed genes in subclusters 0, 1, and 5 (which are unique to Mpl mice) along with the GO results.
This is all included in my full file figure_generation_v2 but I wanted to share separately to get input on potential consequences of these aberrant MK subclusters.
# UMAP Projection of MK Clustering
For this subclustering I only used the subset of MK and MEP cells.
```{r reading in the data}
wbm <- readRDS('./data/wbm_clustered_filtered_named.rds')
wbm@meta.data$condition <- ifelse(wbm@meta.data$condition == 'control', 'Control', 'Mpl')
mks <- subset(wbm, cluster_IDs %in% c('MK',"MEP"))
mks <- NormalizeData(mks, normalization.method = "LogNormalize", scale.factor = 10000)
all.genes <- rownames(mks)
mks <- ScaleData(mks, features = all.genes)
mks <- RunPCA(mks, features = VariableFeatures(mks), verbose = F)
# ElbowPlot(mks) # decided to go with the first 8 PCs
mks <- FindNeighbors(mks, dims = 1:8)
mks <- FindClusters(mks , resolution = .45, verbose = F)
mks <- RunUMAP(mks, dims = 1:8, verbose = F)
DimPlot(mks, reduction = 'umap', split.by = 'condition') +
theme(axis.text = element_blank())
```
We see that clusters 0, 1, and 5 are exclusive to Mpl.
```{r plot2}
DimPlot(mks, reduction = 'umap', split.by = 'cluster_IDs') +
theme(axis.text = element_blank())
```
Cluster 2 is the MEP cell population, which is expressed in both control and Mpl.
# Bar Graph of Cluster Counts
Similar to the bar graph which will be Figure 3b, below is a bar chart showing the percentage of clusters that comes from each condition (Control, Mpl). I normalized the cell counts to overall cell count between Control (n = 2620) and Mpl (n = 3501).
```{r 4a sup. table}
mk.tbl <- as.data.frame(table(mks$seurat_clusters, mks$condition))
#mk.tbl
colnames(mk.tbl) <- c('Cluster','Condition', 'Freq')
#mk.tbl
# normalization_factor <- sum(mk.tbl[mk.tbl$Condition == 'control',]$Freq)/
# sum(mk.tbl[mk.tbl$Condition == 'mut',]$Freq)
# The normalization factor should be calculated by the total cell depth
#summary(as.factor(wbm$condition))
normalization_factor <- 2620/3501
mk.tbl$Norm_freq <- ifelse(mk.tbl$Condition == 'Control', mk.tbl$Freq,
round(mk.tbl$Freq * normalization_factor,0))
cluster_counts <- c()
cnt <- 1
for (i in levels(mk.tbl$Cluster)){
cluster_counts[cnt] <- sum(mk.tbl[mk.tbl$Cluster == i,]$Norm_freq)
cnt <- cnt + 1
}
mk.tbl$cluster_counts <- rep(cluster_counts,2)
mk.tbl$Percentage <- round(mk.tbl$Norm_freq / mk.tbl$cluster_counts,2)*100
#mk.tbl$Condition <- ifelse(mk.tbl$Condition == 'control', 'Control', 'Mpl')
mk.tbl
ggplot(data = mk.tbl, aes(x = Cluster, y = Percentage, fill = Condition)) +
scale_fill_manual(values = c('blue','red')) +
geom_bar(stat = 'identity') +
theme_bw() +
ylab('Percentage of Cells') + xlab('Cell Type') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
# DE Analysis
Below is the heatmap I earlier created for DE analysis within the MK cluster.
```{r original MK heatmap}
mk.markers <- FindMarkers(wbm, ident.1 = 'Control', ident.2 = 'Mpl',
group.by = 'condition',
verbose = T, subset.ident = 'MK',
logfc.threshold = log(2), test.use = 'MAST')
sig.mk.markers <- mk.markers[mk.markers$p_val_adj < 0.05,]
sig.mk.markers <- sig.mk.markers[order(sig.mk.markers$avg_logFC, decreasing = T),]
sig.mk.genes <- rownames(sig.mk.markers)
mk.cells <- WhichCells(wbm, ident = 'MK')
DoHeatmap(wbm, features = sig.mk.genes, group.by = 'condition',
cells = mk.cells, label = F) +
theme_minimal() +
xlab('MK Cells') + ylab('DE genes') +
scale_x_discrete(position = 'top') +
theme(text = element_text(size = 10, family = 'sans'),
axis.text = element_blank(),
axis.ticks = element_blank())
```
One of the MK DE genes is Mki67 (a proliferation gene) and it is said to be upregulated in the control compared to Mpl:
```{r mki67}
sig.mk.markers[rownames(sig.mk.markers) == 'Mki67',]
```
If we look at a violin chart this seems to also be the case.
```{r mki67 violin}
VlnPlot(wbm, 'Mki67', pt.size = 0, split.by = 'condition')
```
But if we look closer at the MK subclustering there's a different story.
```{r mki67 mk violin}
VlnPlot(mks, 'Mki67', pt.size = 0, split.by = 'condition')
```
We see that it is just one subcluster of MKs that has expression of Mki67 and we see a similar expression between control and Mpl cells within the cluster. This subcluster contains both control and Mpl cells, and the reason it comes up as differentially expressed is this subcluster makes up a much higher percentage of control cells than Mpl cells as can be seen below.
```{r another bar chart}
mk.tbl2 <- mk.tbl[,1:4]
mk.tbl2$condition_counts <- rep(c(sum(mk.tbl2[mk.tbl2$Condition == 'Control','Norm_freq']),
sum(mk.tbl2[mk.tbl2$Condition == 'Mpl', 'Norm_freq'])),
each = 6)
mk.tbl2$Percentage <- round(mk.tbl2$Norm_freq / mk.tbl2$condition_counts,2)*100
#mk.tbl2
mk.tbl2
ggplot(data = mk.tbl2, aes(x = Condition, y = Percentage, fill = Cluster)) +
geom_bar(stat = 'identity') +
theme_bw() +
ylab('Percentage of Cells') + xlab('Cell Type') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
```
# New DE Analysis
Because of this I decided to look at what genes distinguished cluster 0, 1, and 5 from the other MK subclusters; since these are aberrant clusters that only appear in Mpl cells. The supplemental spreadsheet has the results of all the DE marker genes and the GO term analysis performed on those DE marker genes.