-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfigure3_generation.Rmd
253 lines (184 loc) · 10.9 KB
/
figure3_generation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "Sang WBM Figure Generation"
author: "D. Ford Hannum"
date: "5/18/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(Seurat)
library(ggplot2)
library(data.table)
library(plyr)
library(MAST)
```
```{r reading in the data, echo = F}
wbm <- readRDS('./data/wbm_clustered_filtered_named.rds')
```
> **FIGURES WILL BE SAVED IN FIGURES/VERSION2/**
# Things to ask
A section of questions that arise when putting together the analysis:
* What are we calling the wildtype and mutant groups?
# Figure 3
Putting together images for Figure 3, which we want to sumarize the single cell RNA-seq (scRNA) between the Mipl (mut) and control samples.
## Figure 3a
UMAP projection of the single cell data. Clusters were labeled using SingleR and marker gene expression, and consolidated into single clusters. The first image is of all the clusters and their labels. The second image is when the clusters are consolidated.
```{r 3a, echo = F}
DimPlot(wbm, reduction = 'umap')
new_levels <- c('Granulocyte','Granulocyte','Granulocyte','B-cell','MK',
'Granulocyte','Granulocyte',
'Monocyte','Monocyte','Erythroid','B-cell','T-cell/NK','MEP')
names(new_levels) <- levels(wbm)
wbm <- RenameIdents(wbm, new_levels)
DimPlot(wbm, reduction = 'umap', pt.size = .001, label.size = 2) + xlab ('UMAP1') + ylab ('UMAP2') +
theme_classic() +
theme(text = element_text(size = 10, family = 'sans'),
axis.text = element_blank(),
axis.ticks = element_blank())
# ggsave('./figures/version2/fig3a_umap_consolidated.pdf', units = 'in',
# height = 2.75, width = 5, device = 'pdf')
```
> UMAP projection of the scRNA-seq analysis. Clusters were labeled by cell type and then consolidated into larger clusters based on the cell type
## Figure 3b
Looking at the UMAP projection, but splitting it up between the different experiments.
```{r fig3b, echo = F}
wbm@meta.data$condition <- ifelse(wbm@meta.data$condition == 'control', 'Control','Mipl')
DimPlot(wbm, reduction = 'umap', pt.size = .001, label = F, split.by = 'condition') + xlab ('UMAP1') +
ylab ('UMAP2') +
theme_classic() +
theme(text = element_text(size = 10, family = 'sans'),
axis.text = element_blank(),
axis.ticks = element_blank()) +
NoLegend()
# ggsave('./figures/version2/fig3b_umap_consolidated_split.pdf', units = 'in', height = 2.75, width = 5, device = 'pdf')
```
> UMAP projection of the scRNA-seq analysis, split between the two experimental states (Control, Mipl)
## Figure 3c
Heatmap of the distribution of the experiments among the clusters
```{r generating heatmap, echo = F}
# Generating the df for the heatmap
wbm@meta.data$clusterIDs <- wbm@meta.data$cluster_IDs
levels(wbm@meta.data$clusterIDs) <- new_levels
tbl <- as.matrix(table(wbm@meta.data$clusterIDs, wbm@meta.data$condition))
mtx <- matrix(ncol = 3, nrow = 14)
mtx[,1] <- rep(rownames(tbl),2)
mtx[,2] <- rep(colnames(tbl), each = 7)
mtx[1:7,3] <- tbl[,1]
mtx[8:14,3] <- tbl[,2]
mtx <- as.data.frame(mtx)
colnames(mtx) <- c('Cell Type', 'Condition', 'Count')
mtx$Count <- as.numeric(as.character(mtx$Count))
cnt_count <- sum(mtx[1:7,3])
mpl_count <- sum(mtx[8:14,3])
mtx$Percentage <- NA
mtx$Percentage[1:7] <- round(mtx$Count[1:7]/cnt_count,2)*100
mtx$Percentage[8:14] <- round(mtx$Count[8:14]/mpl_count,2)*100
cell_levels <- c('Granulocyte','B-cell','MK','Monocyte','Erythroid','T-cell/NK','MEP')
mtx$`Cell Type` <- factor(mtx$`Cell Type`, levels = rev(cell_levels))
ggplot(data = mtx, aes(x = Condition, y = `Cell Type`, fill = Percentage)) + geom_tile() +
scale_fill_gradient2(high = 'darkred', low = 'white', mid = 'red', midpoint = 50,
limit = c(0,100), space ="Lab",
#guide = guide_legend(direction = 'horizontal', title.position = 'top'),
guide = guide_colorbar(direction = 'horizontal', title.position = 'top'),
name = 'Percentage of Cells\nin Condition') +
theme_minimal() +
scale_x_discrete(position = 'top') +
geom_text(aes(Condition, `Cell Type`, label = Count), color = 'black', size = 4) +
theme(text = element_text(size = 10, family = 'sans'),
legend.title.align = 0.5,
legend.position = 'bottom')
# ggsave('./figures/version2/figure3c_heatmap.pdf', device = 'pdf', units = 'in',
# width = 2.5, height = 5.5)
```
> Cell counts for each cell type among the conditions. Percentage of the cells in each condition accounted for by cell type is represented by red shading
## Figure 3d
Want to show the differential expression of genes within the MK cluster.
```{r getting marker genes for MKs, echo = F, warning= F}
mk.markers.w <- FindMarkers(wbm, ident.1 = 'Control', ident.2 = 'Mipl',
group.by = 'condition',
verbose = T, subset.ident = 'MK',
logfc.threshold = log(2), test.use = 'wilcox')
mk.markers.b <- FindMarkers(wbm, ident.1 = 'Control', ident.2 = 'Mipl',
group.by = 'condition',
verbose = T, subset.ident = 'MK',
logfc.threshold = log(2), test.use = 'bimod')
mk.markers.r <- FindMarkers(wbm, ident.1 = 'Control', ident.2 = 'Mipl',
group.by = 'condition',
verbose = T, subset.ident = 'MK',
logfc.threshold = log(2), test.use = 'roc')
mk.markers.m <- FindMarkers(wbm, ident.1 = 'Control', ident.2 = 'Mipl',
group.by = 'condition',
verbose = T, subset.ident = 'MK',
logfc.threshold = log(2), test.use = 'MAST')
# print(dim(mk.markers.w[mk.markers.w$p_val_adj < 0.05,]))
# print(dim(mk.markers.b[mk.markers.b$p_val_adj < 0.05,]))
# print(dim(mk.markers.r[mk.markers.r$power > .8,]))
print(dim(mk.markers.m[mk.markers.m$p_val_adj < 0.05,]))
sig.mk.markers.m <- mk.markers.m[mk.markers.m$p_val_adj < 0.05,]
sig.mk.markers.m <- sig.mk.markers.m[order(sig.mk.markers.m$avg_logFC, decreasing = T),]
sig.mk.genes.m <- rownames(sig.mk.markers.m)
```
For the follow-up analysis I'm going to use MAST to determine differentially expressed genes. MAST identifies differentially expressed genes between two groups of cells using a hurdle model tailored to scRNA-seq data. While also stipulating the fold-change in expression greater than 2, we get 123 differentially expressed genes. Previously the wilcoxon test was used, and 93 of the 96 genes from that list were also present in the DE genes from MAST.
```{r heatmap with DE genes, echo = F}
mk.cells <- WhichCells(wbm, ident = 'MK')
DoHeatmap(wbm, features = sig.mk.genes.m, group.by = 'condition',
cells = mk.cells, label = F) +
theme_minimal() +
xlab('MK Cells') + ylab('DE genes') +
scale_x_discrete(position = 'top') +
theme(text = element_text(size = 10, family = 'sans'),
axis.text = element_blank(),
axis.ticks = element_blank())
# ggsave('./figures/version2/figure3d_mk_markers.pdf', device = 'pdf', units = 'in',
# width = 7.5, height = 3)
```
```{r getting the mk.marker gene list, echo = F}
# sig.mk.markers.m
# sig.mk.genes.m
# DimPlot(wbm[], reduction = 'umap', split.by = 'condition',
# cells.highlight = colnames(subset(wbm, subset = Mpo > 2)),
# cols.highlight = 'red')
VlnPlot(wbm, c('Mpo','Slpi'), split.by = 'condition', pt.size = 0)
```
These were three DE genes in MKs and had the largest fold changes. Mpo is . SLPI, aka secretory leukocyte protease inhibitor, inhibits human leukocyte elastase, human cathepsin G, human trypsin, neutrophil elastase, and mast cell chymase. Increased levels of SLPI in saliva and plasma may also be an indicator of HIV infection. SLPI has been shown to interact with PLSCR1 and PLSCR4 on the plasma membrane of T-cells, specifically in the proximity of CD4. This interaction is hypothesized to be on of the ways SLPI inhibits HIV infection.
# Profibrotic Factors
I'm starting to generate some potential figures for Fig. 4, which is going to focus on profibrotic factors.
```{r profibrotic factor figures, echo = F}
gene.list <- rownames(wbm)
pff.list <- c('Pf4','Tgfb','Ppbp','Gata1')
pff.gene.list <- c()
cnt <- 1
for (i in pff.list){
genes <- gene.list[grep(i, gene.list)]
for (j in genes){
pff.gene.list[cnt] <- j
cnt <- cnt + 1
}
}
pff.gene.list.short <- c('Pf4','Gata1','Ppbp','Tgfb1')
VlnPlot(wbm, pff.gene.list[c(1,3:7)], split.by = 'condition', pt.size = 0)
VlnPlot(wbm, pff.gene.list[8:13], split.by = 'condition', pt.size = 0)
```
## Follow-Up from Sang Lab
1. Cell clustering: concerning the hematopoietic stem cells (HSCs) and progenitors.
i) Originally had a progenitor/SC cluster associated with both B-cells and granulocytes but I thought the decision was to merge them into one cluster.
ii) A thought Jun had would be to pull out the granulocyte cluster and do a trajectory analysis just on that cluster. With that method we would be getting at those previously labeled "Stem Cells"
2. Heatmap: wanted to include both the cell number and percentages, also including a bar graph
i) This can be done. Want to discuss in exactly what kind of format
3. Differential gene expression: want to see what those genes are, in an excel file.
i) sent a quick unformatted list to Priya yesterday so they had it for your journal club.
ii) when we find a few genes were are excited about/interested in we can label those in particular. I'm thinking adding their name to the y-axis and maybe highlighting their row with a red box.
iii) I added x and y labels
## Other notes from Jun
* this is a standard/common first figure in most single-cell papers
* looking at others was to format the DE heatplot: try changing color schemes, perhaps changes expression levels to (0 to 1)
* adding a GSEA/go term analysis
* adding SingleR heatmap either in the figure or in the supplementary, to show confidence of the labels
* the heatplot is too big and could be 60% of its size, also text size needs to be consistent
## Continued Work
* Visualizing a trajectory analysis in monocle. Want to do this for the entire dataset, and separately for granulocytes, and MK/MEPs/Erythroids
* Read the Oxford paper. Figure out how we differ and what we can compare
* continue working on potential images to use in the profibrotic factor focused Figure 4
## Things that would be helpful for me
* If I could look at a draft of the paper so far, that would be really helpful. I'd have a better understanding of the whole project, and it would help me to write the results section for the sc-analysis.
* Interesting genes from MK DEG list, I could add additional figures for those genes of interest (like what is being done for the profibrotic factors)