-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcase-study.Rmd
530 lines (429 loc) · 19.4 KB
/
case-study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
---
title: "ADAGE signature analysis - case study"
author: "Jie Tan"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_width: 7
fig_height: 7
vignette: >
%\VignetteIndexEntry{Case study}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
This script shows how we analyze the ∆*anr* mutation dataset used as a case study
in the ADAGE signature analysis paper.
## Data preparation
Load in required libraries.
```{r}
library("ADAGEpath")
library("DT")
library("knitr")
library("VennDiagram")
library("readr")
```
Before any analysis, we need to specify the ADAGE model and the data compendium
we want to use.
```{r}
model <- eADAGEmodel
compendium <- PAcompendium
probe_dist <- probedistribution
```
Let's load in the sample dataset that come with the package. It's expression data
from *Pseudomonas aeruginosa* wild type and ∆*anr* grown as biofilms on ∆F508 cystic
fibrosis bronchial epithelial cells (CFBEs). Detailed information about
the dataset can be found here
[GSE67006](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67006).
All its CEL files are stored in the folder "./inst/extdata/anr".
To load your own dataset, simply modify this input path.
```{r}
input_path <- system.file("extdata", "anr/", package = "ADAGEpath")
data_raw <- load_dataset(input = input_path, isProcessed = FALSE,
isRNAseq = FALSE, model = model,
compendium = compendium, quantile_ref = probe_dist,
norm01 = FALSE)
```
ADAGE only accepts expression values in the (0,1) range. We linearly transform
expression values to be between 0 and 1 using the *P.a.* compendium as the
reference.
```{r}
data_normed <- zeroone_norm(input_data = data_raw, use_ref = TRUE,
ref_data = compendium)
```
Now let's specify the phenotypes for each sample. It needs to be a character
vector and has the same sample order as the expression data loaded above.
```{r}
data_pheno <- c("mt", "mt", "mt", "wt", "wt", "wt")
```
## ADAGE signature analysis
### Activity calculation
We calculate the activity of each signature for each sample in the dataset.
```{r}
data_activity <- calculate_activity(input_data = data_normed, model = model)
```
The returned `data_activity` is a `data.frame` with signature names in the first
column and activity values per sample starting from the second column.
### Active signature detection
We want to find signatures that are differentially active between
*anr* mutant and wildtype samples.
We use [limma](https://bioconductor.org/packages/release/bioc/html/limma.html)
to perform a differential activation test. limma is more robust
than t-test when sample size is small. A two-group limma analysis is
provided in the function `build_limma()`. You can also build other limma models
to test signatures' activities when the experimental design is more complex.
In the limma test, we will use "wt" as the control phenotype.
Because there are a lot of signatures passing the significance cutoff, here we
use the more stringent Bonferroni procedure instead of the Benjamini–Hochberg
procedure for multiple hypothesis correction.
```{r}
limma_result <- build_limma(data_activity, phenotypes = data_pheno,
control_pheno = "wt",
use.bonferroni = TRUE)
```
To take both absolute activity difference and significance into account, we
use pareto fronts to pick the most differentially active signatures. We extract
differentially active signatures in the first 10 layers of pareto
fronts. Modify `N_fronts` to get more or fewer signatures.
```{r}
active_sigs <- get_active_signatures(limma_result = limma_result,
pheno_group = "both",
method = "pareto", N_fronts = 10)
```
Signatures that are differentially active between *anr* mutant and wildtype are:
```{r}
active_sigs
```
Plot each signature's activity change and significance in the `limma` test.
```{r}
plot_volcano(limma_result, highlight_signatures = active_sigs,
interactive = TRUE)
```
Look at how the activities of active signature vary across samples.
```{r}
plot_activity_heatmap(activity = data_activity, signatures = active_sigs)
```
Combining the volcano plot and the activity heatmap, Node35pos is the most
active signature in the *anr* mutant, followed by Node233pos, Node140pos.
On the other side, Node38neg, Node31pos, Node269pos, Node205neg are most active
in wildtype.
### Overlapping signature removal
To reduce the number of signatures to look at, we can check whether these active
signatures overlap with each other.
`plot_signature_overlap` creates a heatmap of odds ratios. The odds ratio
represents the odds that two signatures share a specific number of genes.
```{r}
signature_similarity <- plot_signature_overlap(selected_signatures = active_sigs,
model = model)
```
Next we calculate the marginal activities of similar signatures. Marginal
activity is defined as the activity of signature A after removing genes that it
shares with signature B.
```{r}
marginal_activity <- calculate_marginal_activity(input_data = data_normed,
selected_signatures = active_sigs,
model = model)
```
Again, we build a limma model to test whether these marginal activities
are still strongly different between two conditions.
```{r}
marginal_limma <- build_limma(input_data = marginal_activity,
phenotypes = data_pheno, control_pheno = "wt")
```
Let's visualize the marginal activities in a matrix heatmap. The value in this
matrix represents the -log10 transformed adjusted p-value in
the activation test when the effect of the column signature is removed from
the row signature. Values in the diagonal of the heatmap are the activation
significance of signatures themselves. Activation significance below the
significance cutoff is marked by a cross sign.
```{r}
plot_marginal_activation(marginal_limma_result = marginal_limma,
signature_order = colnames(signature_similarity),
sig_cutoff = 0.05)
```
Based on this plot, we can see that Node119pos, Node214neg, and Node299pos are
completely masked by Node191pos, because after removing Node191pos from them,
they become non-significant. Node130pos and Node250neg are masked by
Node31pos. Node285neg is masked by Node205neg.
Node154pos is masked by Node57neg. Node63neg, Node39neg, Node228pos, Node158neg,
Node140pos, Node269neg and Node31neg are masked by Node35pos. Node185neg and
Node275pos are masked by Node67pos. Node278pos is masked by Node9pos.
We can see that Node67pos, Node35pos, and Node233pos each has unique
genes that make them still significant even after removing the effect
of another signature. We can safely remove a signature being masked by another
signature as long as we keep the second signature.
```{r}
unique_active_sigs <- remove_redundant_signatures(marginal_limma,
sig_cutoff = 0.05)
unique_active_sigs
```
Check out each signature's activity change and significance after removing
overlapping signatures.
```{r}
plot_volcano(limma_result, highlight_signatures = unique_active_sigs,
interactive = TRUE)
```
Look at how the activities of active signature vary across samples after removing
overlapping signatures.
```{r}
plot_activity_heatmap(activity = data_activity, signatures = unique_active_sigs)
```
### Active signature interpretation
#### Genes in signatures
First of all, we can directly inspect gene composition of a signature or a group
of signatures.
```{r}
# interested_sig could also be a character vector storing several signature
# names, e.g. interested_sig <- c("Node35pos", "Node233pos")
interested_sig <- "Node35pos"
sig_annotation <- annotate_genes_in_signatures(selected_signatures = interested_sig,
model = model)
DT::datatable(sig_annotation)
```
#### Associate pathways
Next we can associate existing pathways with signatures. Here we use KEGG pathways.
We retrive *Pseudomonas aeruginosa* KEGG terms from the
[TRIBE](http://tribe.greenelab.com/#/home) webserver. TRIBE periodically updates
KEGG terms. To reproduce results presented in the paper, we use the KEGG
pathways that were retrieved from TRIBE on June 03 2016.
You can also repeat the analysis with up-to-date KEGG terms from TRIBE.
TRIBE also supports Gene Ontology terms. You can simply replace "KEGG" with "GO"
in the following steps to repeat the analysis with GO terms.
```{r}
KEGG <- fetch_geneset(type = "KEGG", access_date = "06-03-16")
# we only consider KEGG pathways with more than 5 genes and less than 100 genes
# as meaningful pathways
KEGG_subset <- KEGG[lengths(KEGG) >= 5 & lengths(KEGG) <= 100]
```
We associate active signatures with known KEGG pathways.
```{r}
pathway_association <- annotate_signatures_with_genesets(
selected_signatures = unique_active_sigs, model = model,
genesets = KEGG_subset)
```
Calculate the activity of associated pathways inside active signatures. This helps
differentiate active vs. non-active pathways assoicated with active signatures.
```{r, fig.height= 10, fig.width=12}
pathway_activity <- signature_geneset_activity(
signature_geneset_df = pathway_association[, c("signature", "geneset")],
gene_set_list = KEGG_subset, model = model, input_data = data_normed)
plot_activity_heatmap(pathway_activity, is_pathway = TRUE)
```
We run a limma test on pathway activities and find pathways that are truly
active in this dataset.
```{r}
pathway_limma <- build_limma(pathway_activity, phenotypes = data_pheno,
control_pheno = "wt", use.bonferroni = TRUE)
# combine pathway association and pathway activation test results
combined_result <- combine_geneset_outputs(
signature_geneset_association = pathway_association,
geneset_limma_result = pathway_limma)
knitr::kable(combined_result, digits = 4, row.names = FALSE, align = "c")
```
#### Gene-gene networks
We next visualize how genes in the active signatures cluster in the
ADAGE-derived gene-gene network.
We calculate an expression fold change for each gene and pass it to the
gene-gene network to show as node color. Again, we use limma to test
differential expression and get the logFC.
```{r}
data_raw_limma <- build_limma(input_data = data_raw, phenotypes = data_pheno,
control_pheno = "wt")
# build a gene:fold change table from limma result
gene_logFC <- data.frame(geneID = rownames(data_raw_limma),
logFC = data_raw_limma$logFC)
```
Visualize the ADAGE gene-gene network of the active signatures.
```{r}
visualize_gene_network(selected_signatures = unique_active_sigs,
gene_color_value = gene_logFC,
model = model, cor_cutoff = 0.5,
curated_pathways = KEGG)
```
We can also visualize one signature at a time or a group of signatures.
```{r}
visualize_gene_network(selected_signatures = interested_sig,
gene_color_value = gene_logFC,
model = model, cor_cutoff = 0.5,
curated_pathways = KEGG)
```
## A comparison between ADAGEpath and GSEA
We also perform a GSEA analysis on this dataset.
```{r}
run_GSEA <- function(GSEA_file, GSEA_folder, GSEA_output_name, data, phenotype,
geneset) {
# create the class lable file for gsea
# for more details about cls file format, please refer to
# http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Phenotype_Data_Formats
label_file <- file.path(GSEA_folder, "label.cls")
write(paste(length(phenotype), length(unique(phenotype)), 1, sep = " "),
file = label_file)
write("# con1 con2", file = label_file, append = TRUE)
write(paste(phenotype, collapse = " "), label_file, append = TRUE)
# create the gene expression file for gsea
expression_file <- file.path(GSEA_folder, "expression.txt")
readr::write_tsv(data, expression_file, col_names = TRUE)
# create the gene set file for gsea
# gmt file is a tab delimited file
# it's format is "geneset-name[tab]description[tab]genes separated by tab"
# here we fill description with na.
geneset_file <- file.path(GSEA_folder, "geneset.gmt")
# write line 1
write(paste(names(geneset)[1], "na", paste(unlist(geneset[[1]]),
collapse = "\t"), sep = "\t"),
file = geneset_file)
# write the rest lines
invisible(sapply(2:length(geneset),
function(x) write(paste(names(geneset)[x], "na",
paste(unlist(geneset[[x]]),
collapse = "\t"), sep = "\t"),
file = geneset_file, append = TRUE)))
# run GSEA
gsea <- paste0("java -Xmx512m -cp ", GSEA_file, " xtools.gsea.Gsea -res ",
expression_file," -cls ",label_file," -gmx ",geneset_file,
" -collapse false -mode Max_probe -norm meandiv -nperm 1000 ",
"-permute gene_set -rnd_type no_balance -scoring_scheme weighted ",
"-rpt_label ", GSEA_output_name ," -metric Signal2Noise -sort real ",
"-order descending -include_only_symbols true -make_sets true ",
"-median false -num 100 -plot_top_x 20 -rnd_seed timestamp ",
"-save_rnd_lists false -set_max 100 -set_min 5 -zip_report false ",
"-out ", GSEA_folder, " -gui false")
system(gsea)
}
# attempt to get the file path to the GSEA java program
GSEA_file <- Sys.glob("gsea2-*.jar")
if (length(GSEA_file) != 1) {
stop("Failed to detect the GSEA java file or detect more than one file in the
current folder. Please specify the correct file path to GSEA_file.")
}
# prefix of the folder that stores GSEA results
GSEA_output_name <- "Anr_case"
# folder to store GSEA data and results
GSEA_folder <- "GSEA_data"
if (dir.exists(GSEA_folder)) {
GSEA_output_fd <- Sys.glob(file.path(GSEA_folder,
paste0(GSEA_output_name, ".Gsea*")))
if (length(GSEA_output_fd) == 1) {
print("The GSEA result folder already exists. Let's directly read in its results.")
} else if (length(GSEA_output_fd) == 0) {
run_GSEA(GSEA_file, GSEA_folder, GSEA_output_name, data_raw, data_pheno,
KEGG_subset)
} else {
stop("Detect more than one GSEA result folder. Please remove the unwanted
result folder and only keep one.")
}
} else {
dir.create(GSEA_folder)
run_GSEA(GSEA_file, GSEA_folder, GSEA_output_name, data_raw, data_pheno,
KEGG_subset)
}
# read in GSEA results
gsea_con1 <- readr::read_tsv(Sys.glob(
file.path(GSEA_folder, paste0(GSEA_output_name, ".Gsea*"),
"gsea_report_for_con1_*.xls")), col_names = TRUE)
gsea_con2 <- readr::read_tsv(Sys.glob(
file.path(GSEA_folder, paste0(GSEA_output_name, ".Gsea*"),
"gsea_report_for_con2_*.xls")), col_names = TRUE)
# get significant results
gsea_con1_sig <- gsea_con1[gsea_con1$`FDR q-val` <= 0.05, ]
gsea_con2_sig <- gsea_con2[gsea_con2$`FDR q-val` <= 0.05, ]
KEGG_GSEA <- unique(c(gsea_con1_sig$NAME, gsea_con2_sig$NAME))
```
Now let's compare the significant KEGG pathways in GSEA with the KEGG pathways
associated with the active signatures.
```{r}
# GSEA automatically converts pathway names to upper case, so we first
# built a match table between normal case and upper case
match_table <- data.frame(normal_case = names(KEGG_subset),
upper_case = toupper(names(KEGG_subset)),
stringsAsFactors = FALSE)
# and then convert uppper case to normal case
KEGG_GSEA <- match_table$normal_case[match(KEGG_GSEA, match_table$upper_case)]
KEGG_ADAGE <- as.character(unique(pathway_association$geneset))
# make a venn diagram for pathway overlap
grid.newpage()
invisible(draw.pairwise.venn(length(KEGG_GSEA), length(KEGG_ADAGE),
length(intersect(KEGG_GSEA, KEGG_ADAGE)),
c("GSEA", "ADAGEpath"), fill = c("red", "blue")))
```
As we can see, GSEA and ADAGEpath agree on many pathways. Pathways shared by
ADAGEpath and GSEA are:
```{r}
intersect(KEGG_GSEA, KEGG_ADAGE)
```
Next we look into
pathways significantly enriched in GSEA but not associated with chosen active
signatures. They are:
```{r}
# pathways enriched in GSEA but not associated with ADAGE signatures
# in the top 10 layers of pareto fronts
GSEA_only <- setdiff(KEGG_GSEA, KEGG_ADAGE)
GSEA_only
```
We next check whether they are assoicated with any ADAGE signatures
```{r}
# check whether GSEA-only KEGG pathways are associated with any signatures
# this step will take a while because it checks association between all signatures
# and all pathways
associated_sigs <- find_associated_signatures(input_genesets = GSEA_only,
model = model,
gene_set_list = KEGG_subset,
significance_cutoff = 0.05,
signature_limma_result = limma_result)
associated_sigs <- unlist(associated_sigs)
associated_sigs_df <- data.frame(pathway = GSEA_only, signature = associated_sigs)
knitr::kable(associated_sigs_df)
```
and check how differentially active the associated signatures are.
```{r}
# see how these signatures perform in the activation test together with
# differentially active signatures on the top 10 pareto fronts
plot_volcano(limma_result, highlight_signatures = c(associated_sigs, active_sigs),
interactive = TRUE)
```
As you can see, Node137pos, Node252neg, and Node113neg are very close to meet
our differential activation cutoff.
Here are pathways enriched in GSEA but not associated with any signatures:
```{r}
GSEA_only[is.na(associated_sigs)]
```
Here are pathways that are associated with differentially active ADAGE
signatures but do not pass the significance cutoff in GSEA.
```{r}
ADAGE_only <- setdiff(KEGG_ADAGE, KEGG_GSEA)
ADAGE_only
```
Some of these pathways are detected to be differentially active, they are:
```{r}
ADAGE_active <- unique(combined_result$geneset[combined_result$adj.P.Val <= 0.05])
ADAGE_only_active <- setdiff(ADAGE_active, KEGG_GSEA)
ADAGE_only_active
```
Let's see how they perform in GSEA.
```{r}
ADAGE_only_active <- toupper(ADAGE_only_active)
ADAGE_only_GSEA <- rbind(gsea_con1[na.omit(match(ADAGE_only_active, gsea_con1$NAME)), ],
gsea_con2[na.omit(match(ADAGE_only_active, gsea_con2$NAME)), ])
knitr::kable(ADAGE_only_GSEA)
```
As you can see, they achieved quite low p-values, though did not pass the cutoff
after multiple hypothesis correction.
There are also signatures uncharacterized by KEGG (not associated with any
exisiting KEGG pathway).
```{r}
uncharacterized_sigs <- setdiff(unique_active_sigs, pathway_association$signature)
uncharacterized_sigs
```
Check the signature similarity of the uncharacterized signatures.
```{r}
plot_signature_overlap(selected_signatures = uncharacterized_sigs, model = model)
```
Visualize uncharacterized signatures in the gene-gene network.
```{r}
visualize_gene_network(selected_signatures = uncharacterized_sigs,
gene_color_value = gene_logFC,
model = model, cor_cutoff = 0.5,
curated_pathways = KEGG)
```