-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path30-quant.Rmd
1171 lines (867 loc) · 37 KB
/
30-quant.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Quantitative data {#sec-quant}
## Quantitation methodologies
There are a wide range of proteomics quantitation techniques that can
broadly be classified as labelled vs. label-free, depending on whether
the features are labelled prior the MS acquisition and the MS level at
which quantitation is inferred, namely MS1 or MS2.
```{r quanttab, echo=FALSE, results='asis'}
qtb <- matrix(c("XIC", "Counting", "SILAC, 15N", "iTRAQ, TMT"),
nrow = 2, ncol = 2)
dimnames(qtb) <- list(
'MS level' = c("MS1", "MS2"),
'Quantitation' = c("Label-free", "Labelled"))
knitr::kable(qtb)
```
### Label-free MS2: Spectral counting
In spectral counting, one simply counts the number of quantified
peptides that are assigned to a protein.
```{r sc, echo=FALSE, out.width = "75%", fig.cap = "Spectral counting. Figure from the `Pbase` package."}
knitr::include_graphics("./img/pbase.png")
```
### Labelled MS2: Isobaric tagging
Isobaric tagging refers to the labelling using isobaric tags,
i.e. chemical tags that have the same mass and hence can't be
distinguished by the spectrometer. The peptides of different samples (4,
6, 10, 11 or 16) are labelled with different tags and combined prior
to mass spectrometry acquisition. Given that they are isobaric, all
identical peptides, irrespective of the tag and this the sample of
origin, are co-analysed, up to fragmentation prior to MS2
analysis. During fragmentation, the isobaric tags fall of, fragment
themselves, and result in a set of sample specific peaks. These
specific peaks can be used to infer sample-specific quantitation,
while the rest of the MS2 spectrum is used for identification.
```{r itraq, echo=FALSE, out.width = "75%", fig.cap = "iTRAQ 4-plex isobaric tagging. Tandem Mass Tags (TMT) offer up to 16 tags."}
knitr::include_graphics("./img/itraq.png")
```
### Label-free MS1: extracted ion chromatograms
In label-free quantitation, the precursor peaks that match an
identified peptide are integrated over retention time and the area under
that *extracted ion chromatogram* is used to quantify that peptide in
that sample.
```{r lf, echo=FALSE, out.width = "75%", fig.cap = "Label-free quantitation. Figure credit [Johannes Rainer](https://github.com/jorainer/)."}
knitr::include_graphics("./img/chrompeaks.png")
```
### Labelled MS1: SILAC
In SILAC quantitation, sample are grown in a medium that contains
heavy amino acids (typically arginine and lysine). All proteins grown
in this *heavy* growth medium contain the heavy form of these amino
acids. Two samples, one grown in heavy medium, and one grown in normal
(light) medium are then combined and analysed together. The heavy
peptides precursor peaks are systematically shifted compared to the
light ones, and the ratio between the height of a heavy and light
peaks can be used to calculate peptide and protein fold-changes.
```{r silab, echo=FALSE, out.width = "75%", fig.cap = "Silac quantitation. Figure credit Wikimedia Commons."}
knitr::include_graphics("./img/Silac.png")
```
These different quantitation techniques come with their respective
benefits and distinct challenges, such as large quantities of raw data
processing, data transformation and normalisation, missing values, and
different underlying statistical models for the quantitative data
(count data for spectral counting, continuous data for the others).
In terms of raw data quantitation in R/Bioconductor, most efforts have
been devoted to MS2-level quantitation. Label-free XIC quantitation
has been addressed in the frame of metabolomics data processing by the
`r Biocpkg("xcms")` infrastructure.
<!-- Below is a list of suggested packages for some common proteomics -->
<!-- quantitation technologies: -->
<!-- * Isobaric tagging (iTRAQ and TMT): `r Biocpkg("MSnbase")` and `r Biocpkg("isobar")`. -->
<!-- * Label-free: `r Biocpkg("xcms")` (metabolomics). -->
<!-- * Counting: `r Biocpkg("MSnbase")` and `r Biocpkg("MSnID")` for -->
<!-- peptide-spectrum matching confidence assessment. -->
<!-- * `r Githubpkg("vladpetyuk/N14N15")` for heavy Nitrogen-labelled data. -->
## QFeatures {#sec-qf}
Mass spectrometry-based quantitative proteomics data can be
represented as a matrix of quantitative values for features (PSMs,
peptides, proteins) arranged along the rows, measured for a set of
samples, arranged along the columns. There is a common representation
for such quantitative data set, namely the `SummarizedExperiment`
[@SE] class:
```{r sefig, echo = FALSE, fig.cap = "Schematic representation of the anatomy of a `SummarizedExperiment` object. (Figure taken from the `SummarizedExperiment` package vignette.)", out.width='100%', fig.align='center'}
knitr::include_graphics("./img/SE.png")
```
- The sample (columns) metadata can be accessed with the `colData()`
function.
- The features (rows) metadata can be accessed with the `rowData()`
column.
- If the features represent ranges along genomic coordinates, these
can be accessed with `rowRanges()`
- Additional metadata describing the overall experiment can be
accessed with `metadata()`.
- The quantitative data can be accessed with `assay()`.
- `assays()` returns a list of matrix-like assays.
### The QFeatures class
While mass spectrometers acquire data for spectra/peptides, the
biological entity of interest are the protein. As part of the data
processing, we are thus required to **aggregate** low-level
quantitative features into higher level data.
```{r featuresplot, fig.cap = "Conceptual representation of a `QFeatures` object and the aggregative relation between different assays.", echo = FALSE}
par(mar = c(0, 0, 0, 0))
plot(NA, xlim = c(0, 12), ylim = c(0, 20),
xaxt = "n", yaxt = "n",
xlab = "", ylab = "", bty = "n")
for (i in 0:7)
rect(0, i, 3, i+1, col = "lightgrey", border = "white")
for (i in 8:12)
rect(0, i, 3, i+1, col = "steelblue", border = "white")
for (i in 13:18)
rect(0, i, 3, i+1, col = "orange", border = "white")
for (i in 19)
rect(0, i, 3, i+1, col = "darkgrey", border = "white")
for (i in 5:7)
rect(5, i, 8, i+1, col = "lightgrey", border = "white")
for (i in 8:10)
rect(5, i, 8, i+1, col = "steelblue", border = "white")
for (i in 11:13)
rect(5, i, 8, i+1, col = "orange", border = "white")
for (i in 14)
rect(5, i, 8, i+1, col = "darkgrey", border = "white")
rect(9, 8, 12, 8+1, col = "lightgrey", border = "white")
rect(9, 9, 12, 9+1, col = "steelblue", border = "white")
rect(9, 10, 12, 10+1, col = "orange", border = "white")
rect(9, 11, 12, 11+1, col = "darkgrey", border = "white")
segments(3, 8, 5, 8, lty = "dashed")
segments(3, 6, 5, 7, lty = "dashed")
segments(3, 4, 5, 6, lty = "dashed")
segments(3, 0, 5, 5, lty = "dashed")
segments(3, 10, 5, 9, lty = "dashed")
segments(3, 11, 5, 10, lty = "dashed")
segments(3, 13, 5, 11, lty = "dashed")
segments(3, 14, 5, 12, lty = "dashed")
segments(3, 16, 5, 13, lty = "dashed")
segments(3, 19, 5, 14, lty = "dashed")
segments(3, 20, 5, 15, lty = "dashed")
segments(8, 5, 9, 8, lty = "dashed")
segments(8, 8, 9, 9, lty = "dashed")
segments(8, 11, 9, 10, lty = "dashed")
segments(8, 14, 9, 11, lty = "dashed")
segments(8, 15, 9, 12, lty = "dashed")
```
We are going to start to familiarise ourselves with the `QFeatures`
class implemented in the
[`QFeatures`](https://rformassspectrometry.github.io/QFeatures/articles/QFeatures.html)
package. The class is derived from the Bioconductor
`MultiAssayExperiment` [@MAE] (MAE) class. Let's start by loading the
`QFeatures` package.
```{r, message = FALSE}
library("QFeatures")
```
Next, we load the `feat1` test data, which is composed of single
*assay* of class `SummarizedExperiment` composed of 10 rows and 2
columns.
```{r load_feat1}
data(feat1)
feat1
```
Let's perform some simple operations to familiarise ourselves with the
`QFeatures` class:
- Extract the sample metadata using the `colData()` accessor (like you
have previously done with `SummarizedExperiment` objects).
```{r cd}
colData(feat1)
```
We can also further annotate the experiment by adding columns to the `colData` slot:
```{r cd2}
colData(feat1)$X <- c("X1", "X2")
feat1$Y <- c("Y1", "Y2")
colData(feat1)
```
- Extract the first (and only) assay composing this `QFeatures` data
using the `[[` operator (as you have done to extract elements of a
list) by using the assay's index or name.
```{r assay1}
feat1[[1]]
feat1[["psms"]]
```
- Extract the `psms` assay's row data and quantitative values.
```{r rd}
assay(feat1[[1]])
rowData(feat1[[1]])
```
### Feature aggregation
The central functionality of the `QFeatures` infrastructure is the
aggregation of features into higher-level features while retaining the
link between the different levels. This can be done with the
[`aggregateFeatures()` function](https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-aggregate.html).
The call below will
- operate on the `psms` assay of the `feat1` objects;
- aggregate the rows of the assay following the grouping defined in the
`peptides` row data variables;
- perform aggregation using the `colMeans()` function;
- create a new assay named `peptides` and add it to the `feat1`
object.
```{r agg1}
feat1 <- aggregateFeatures(feat1, i = "psms",
fcol = "Sequence",
name = "peptides",
fun = colMeans)
feat1
```
- Let's convince ourselves that we understand the effect of feature
aggregation and repeat the calculations manually and check the
content of the new assay's row data.
```{r cm}
## SYGFNAAR
colMeans(assay(feat1[[1]])[1:3, ])
assay(feat1[[2]])["SYGFNAAR", ]
## ELGNDAYK
colMeans(assay(feat1[[1]])[4:6, ])
assay(feat1[[2]])["ELGNDAYK", ]
## IAEESNFPFIK
colMeans(assay(feat1[[1]])[7:10, ])
assay(feat1[[2]])["IAEESNFPFIK", ]
```
```{r rd2}
rowData(feat1[[2]])
```
We can now aggregate the peptide-level data into a new protein-level
assay using the `colMedians()` aggregation function.
```{r agg2}
feat1 <- aggregateFeatures(feat1, i = "peptides",
fcol = "Protein",
name = "proteins",
fun = colMedians)
feat1
assay(feat1[["proteins"]])
```
### Subsetting and filtering
The link between the assays becomes apparent when we now subset the
assays for protein A as shown below or using the `subsetByFeature()`
function. This creates a new instance of class `QFeatures` containing
assays with the expression data for protein, its peptides and their
PSMs.
```{r prota}
feat1["ProtA", , ]
```
The `filterFeatures()` function can be used to filter rows the assays
composing a `QFeatures` object using the row data variables. We can
for example retain rows that have a `pval` < 0.05, which would only
keep rows in the `psms` assay because the `pval` is only relevant for
that assay.
```{r ff1}
filterFeatures(feat1, ~ pval < 0.05)
```
`r msmbstyle::question_begin()`
As the message above implies, it is also possible to apply a filter to
only the assays that have a filtering variables by setting the `keep`
variables.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r ff1b}
filterFeatures(feat1, ~ pval < 0.05, keep = TRUE)
```
`r msmbstyle::solution_end()`
On the other hand, if we filter assay rows for those that localise to
the mitochondrion, we retain the relevant protein, peptides and PSMs.
```{r ff2}
filterFeatures(feat1, ~ location == "Mitochondrion")
```
`r msmbstyle::question_begin()`
As an exercise, let's filter rows that do not localise to the
mitochondrion.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r ff3}
filterFeatures(feat1, ~ location != "Mitochondrion")
```
`r msmbstyle::solution_end()`
You can refer to the [*Quantitative features for mass spectrometry
data*](https://rformassspectrometry.github.io/QFeatures/articles/QFeatures.html)
vignette and the `QFeatures` [manual
page](https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-class.html)
for more details about the class.
## Creating `QFeatures` object
```{r loaddfr, echo = FALSE}
data(hlpsms)
```
While `QFeatures` objects can be created manually (see `?QFeatures`
for details), most users have a quantitative data in a spreadsheet or
a data.frame. In such cases, the easiest is to use the `readQFeatures`
function to extract the quantitative data and metadata columns. Below,
we load the `hlpsms` dataframe that contains data for `r ncol(hlpsms)`
PSMs from the TMT-10plex *hyper*LOPIT spatial proteomics experiment
from [@Christoforou:2016]. The `ecol` argument specifies that columns
1 to 10 contain quantitation data, and that the assay should be named
`psms` in the returned `QFeatures` object, to reflect the nature of
the data.
```{r readQFeatures}
data(hlpsms)
hl <- readQFeatures(hlpsms, ecol = 1:10, name = "psms")
hl
```
Below, we see that we can extract an assay using its index or its
name. The individual assays are stored as *SummarizedExperiment*
object and further access its quantitative data and metadata using
the `assay` and `rowData` functions.
```{r subsetassay}
hl[[1]]
hl[["psms"]]
head(assay(hl[["psms"]]))
head(rowData(hl[["psms"]]))
```
For further details on how to manipulate such objects, refer to the
`r BiocStyle::Biocpkg("MultiAssayExperiment")` [@MAE] and
`r BiocStyle::Biocpkg("SummarizedExperiment")` [@SE] packages.
It is also possible to first create a `SummarizedExperiment`, and then
only include it into a `QFeatures` object.
```{r}
se <- readSummarizedExperiment(hlpsms, ecol = 1:10)
se
```
```{r}
QFeatures(list(psm = se))
```
At this stage, i.e. at the beginning of the analysis, whether you have
a `SummarizedExperiment` or a `QFeatures` object, it is a good time to
define the experimental design in the `colData` slot.
### Exercise {-}
The CPTAC spike-in study 6 [@Paulovich:2010] combines the Sigma UPS1
standard containing 48 different human proteins that are spiked in at
5 different concentrations (conditions A to E) into a constant yeast
protein background. The sample were acquired in triplicate on
different instruments in different labs. We are going to start with a
subset of the CPTAC study 6 containing conditions A and B for a single
lab.
```{r cptac, echo = FALSE, fig.cap = "The CPTAC spike-in study design (credit Lieven Clement, statOmics, Ghent University).", out.width='70%', fig.align='center'}
knitr::include_graphics("./img/cptac.png")
```
The peptide-level data, as processed by MaxQuant [@Cox:2008] is
available in the `msdata` package:
```{r msdata}
basename(f <- msdata::quant(pattern = "cptac", full.names = TRUE))
```
`r msmbstyle::question_begin()`
Read these data in as either a `SummarizedExperiment` or a `QFeatures`
object.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
From the names of the columns, we see that the quantitative columns,
starting with `"Intensity."` (note the dot!) are at positions 56 to
61.
```{r cptac_cols}
names(read.delim(f))
(i <- grep("Intensity\\.", names(read.delim(f))))
```
We now read these data using the `readSummarizedExperiment`
function. This peptide-level expression data will be imported into R
as an instance of class `SummarizedExperiment`. We also use the
`fnames` argument to set the row-names of the `peptides` assay to the
peptide sequences and specify that the file is a tab-separated table.
```{r readse}
cptac_se <- readSummarizedExperiment(f, ecol = i,
fnames = "Sequence",
sep = "\t")
cptac_se
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Before proceeding, we are going to clean up the sample names by
removing the unnecessary *Intensity* prefix and annotate the
experiment in the object's `colData`.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r cptac_names}
colnames(cptac_se) <- sub("I.+\\.", "", colnames(cptac_se))
cptac_se$condition <- sub("_[7-9]", "", colnames(cptac_se))
cptac_se$id <- sub("^.+_", "", colnames(cptac_se))
colData(cptac_se)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
There are many row variables that aren't useful here. Get rid or all
of them but `Sequence`, `Proteins`, `Leading.razor.protein`, `PEP`,
`Score`, `Reverse`, and `Potential.contaminant`.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r vars}
keep_var <- c("Sequence", "Proteins", "Leading.razor.protein", "PEP",
"Score", "Reverse", "Potential.contaminant")
rowData(cptac_se) <- rowData(cptac_se)[, keep_var]
```
`r msmbstyle::solution_end()`
## Analysis pipeline
A typical quantitative proteomics data processing is composed of the
following steps, which we are going to apply to the cptac data created
above.
- Data import
- Exploratory data analysis (PCA)
- Missing data management (filtering and/or imputation)
- Data cleaning
- Transformation and normalisation
- Aggregation
- Downstream analysis
```{r pkgs, message = FALSE}
library("tidyverse")
library("ggplot2")
library("QFeatures")
library("limma")
```
### Missing values
Missing values can be highly frequent in proteomics. There are two
reasons supporting the existence of missing values, namely biological
or technical.
1. Values that are missing due to the absence (or extremely low
concentration) of a protein are observed for biological reasons,
and their pattern **aren't random** (MNAR). A protein missing
due to the suppression of its expression will not be missing at
random: it will be missing in the condition in which it was
suppressed, and be present in the condition where it is expressed.
2. Due to its data-dependent acquisition, mass spectrometry isn't
capable of assaying all peptides in a sample. Peptides that are
less abundant than some of their co-eluting ions, peptides that do
not ionise well or peptides that do not get identified might be
sporadically missing in the final quantitation table, despite their
presence in the biological samples. Their absence patterns are
(completely) **random** (MAR or MCAR) in such cases.
Often, third party software that produce quantitative data use zeros
instead of properly reporting missing values. We can use the
`zeroIsNA()` function to replace the `0` by `NA` values in our
`cptac_se` object and then explore the missing data patterns across
columns and rows.
```{r na}
cptac_se <- zeroIsNA(cptac_se)
nNA(cptac_se)
```
```{r imagena, echo = FALSE, fig.cap = "Distribution of missing value (white). Peptides row with more missing values are moved towards the top of the figure."}
.image2 <- function (x, yticks = 10,
x.cex.axis = 0.75,
y.cex.axis = 0.75,
xlab = "Samples",
ylab = "Features", ...) {
nc <- ncol(x)
nr <- nrow(x)
lab <- colnames(x)
if (is.null(lab))
lab <- 1:nc
graphics::image(t(x), xlab = xlab, ylab = ylab,
xaxt = "n",
yaxt = "n", ...)
axis(1, seq(0, 1, 1/(nc - 1)), labels = lab, cex.axis = x.cex.axis)
yticks <- seq(0, 1, 1/(yticks - 1)) * nr
axis(2, seq(0, 1, 1/(length(yticks) - 1)),
labels = round(yticks, 0),
cex.axis = y.cex.axis)
invisible(NULL)
}
.x <- is.na(assay(cptac_se))
o <- order(rowSums(.x))
.image2(.x[o, ],
col = c("black", "white"),
ylab = "Peptides")
```
Let's now explore these missing values:
- Explore the number or proportion of missing values across peptides
and samples of the `cptac_se` data.
```{r na2}
barplot(nNA(cptac_se)$nNAcols$pNA)
table(nNA(cptac_se)$nNArows$nNA)
```
- Remove rows that have *too many* missing values. You can do this by
hand or using the `filterNA()` function.
```{r filterna}
## remove rows that have 4 or more NAs out of 6
cptac_se <- filterNA(cptac_se, pNA = 4/6)
```
### Imputation
Imputation is the technique of replacing missing data with probable
values. This can be done with `impute()` method. As we have discussed
above, there are however two types of missing values in mass
spectrometry-based proteomics, namely data missing at random (MAR),
and data missing not at random (MNAR). These two types of missing
data, those missing at random, and those missing not at random, need
to be imputed with [different types of imputation
methods](https://rformassspectrometry.github.io/QFeatures/articles/Processing.html#imputation-1)
[@Lazar:2016].
```{r miximp, echo = FALSE, fig.cap = "Mixed imputation method. Black cells represent presence of quantitation values and light grey corresponds to missing data. The two groups of interest are depicted in green and blue along the heatmap columns. Two classes of proteins are annotated on the left: yellow are proteins with randomly occurring missing values (if any) while proteins in brown are candidates for non-random missing value imputation."}
data(se_na2)
x <- assay(impute(se_na2, "zero"))
x[x != 0] <- 1
suppressPackageStartupMessages(library("gplots"))
heatmap.2(x, col = c("lightgray", "black"),
scale = "none", dendrogram = "none",
trace = "none", keysize = 0.5, key = FALSE,
RowSideColors = ifelse(rowData(se_na2)$randna, "orange", "brown"),
ColSideColors = rep(c("steelblue", "darkolivegreen"), each = 8))
```
```{r lazar, fig.cap = "Effect of the nature of missing values on their imputation. Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better.", echo = FALSE, out.width='100%'}
knitr::include_graphics("./img/imp-sim.png")
```
Generally, it is recommended to use **hot deck** methods (nearest
neighbour (**left**), maximum likelihood, ...) when data are missing
at random.Conversely, MNAR features should ideally be imputed with a
**left-censor** (minimum value (**right**), but not zero, ...) method.
There are various methods to perform data imputation, as described in
`?impute`. The `r CRANpkg("imp4p")` package contains additional
functionality, including some to estimate the randomness of missing
data.
The general syntax for imputation is shown below, using the `se_na2`
object as an example:
```{r impute}
data(se_na2)
## impute missing values using knn imputation
impute(se_na2, method = "knn")
```
`r msmbstyle::question_begin()`
Following the example above, apply a mixed imputation, using knn for
data missing at random and the zero imputation for data missing not at
random. Hint: the `randna` variable defines which features are assumed
to be missing at random.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r naex1, message = FALSE}
impute(se_na2, "mixed",
randna = rowData(se_na2)$randna,
mar = "knn", mnar = "zero")
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
When assessing missing data imputation methods, such as in [Lazar et
al. (2016)](https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.5b00981),
one often replaces values with missing data, imputes these with a
method of choice, then quantifies the difference between original
(expected) and observed (imputed) values. Here, using the `se_na2`
data, use this strategy to assess the difference between knn and
Bayesian PCA imputation.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r naex2, cache = TRUE}
imp1 <- impute(se_na2, method = "knn")
imp2 <- impute(se_na2, method = "bpca")
summary(abs(assay(imp1)[is.na(assay(se_na2))] - assay(imp2)[is.na(assay(se_na2))]))
summary(as.numeric(na.omit(assay(se_na2))))
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
When assessing the impact of missing value imputation on real data,
one can't use the strategy above. Another useful approach is to assess
the impact of the imputation method on the distribution of the
quantitative data. For instance, here is the intensity distribution of
the `se_na2` data. Verify the effect of applying `knn`, `zero`,
`MinDet` and `bpca` on this distribution.
```{r nasetdist, fig.cap = "Intensity disctribution of the `naset` data."}
plot(density(na.omit(assay(se_na2))))
```
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r naex3, cache = TRUE}
cls <- c("black", "red", "blue", "steelblue", "orange")
plot(density(na.omit(assay(se_na2))), col = cls[1])
lines(density(assay(impute(se_na2, method = "knn"))), col = cls[2])
lines(density(assay(impute(se_na2, method = "zero"))), col = cls[3])
lines(density(assay(impute(se_na2, method = "MinDet"))), col = cls[4])
lines(density(assay(impute(se_na2, method = "bpca"))), col = cls[5])
legend("topright", legend = c("orig", "knn", "zero", "MinDet", "bpca"),
col = cls, lwd = 2, bty = "n")
```
`r msmbstyle::solution_end()`
**Tip**: When downstream analyses permit, it might be safer not to
impute data and deal explicitly with missing values. Indeed missing
data imputation is not straightforward, and is likely to dramatically
fail when a high proportion of data is missing (10s of %). It is
possible to keep NAs when performing hypothesis tests^[Still, it is
recommended to explore missingness as part of the exploratory data
analysis.], but (generally) not to perform a principal component
analysis.
### Identification quality control
As discussed in the previous chapter, PSMs are deemed relevant after
comparison against hits from a decoy database. The origin of these
hits is recorded with `+` in the `Reverse` variable:
```{r revervetab}
table(rowData(cptac_se)$Reverse)
```
Similarly, a proteomics experiment is also searched against a database
of contaminants:
```{r cont}
table(rowData(cptac_se)$Potential.contaminant)
```
Let's visualise some of the cptac's metadata using standard `ggplot2`
code:
`r msmbstyle::question_begin()`
Visualise the identification score and the posterior probability
probability (PEP) distributions from forward and reverse hits and
interpret the figure.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r idqc1}
rowData(cptac_se) %>%
as_tibble() %>%
ggplot(aes(x = Score, colour = Reverse)) +
geom_density()
```
```{r idqc2}
rowData(cptac_se) %>%
as_tibble() %>%
ggplot(aes(x = PEP, colour = Reverse)) +
geom_density()
```
`r msmbstyle::solution_end()`
**Note**: it is also possible to compute and visualise protein groups
as connected components starting from a quantitative dataset such as a
`SummarizedExperiment`. See the [*Using quantitative
data*](https://rformassspectrometry.github.io/PSMatch/articles/AdjacencyMatrix.html#using-quantitative-data)
section in the *Understanding protein groups with adjacency matrices*
vignette.
### Creating the QFeatures data
We can now create our `QFeatures` object using the
`SummarizedExperiment` as shown below.
```{r qf}
cptac <- QFeatures(list(peptides = cptac_se))
cptac
```
We should also assign the `QFeatures` column data with the
`SummarizedExperiment` slot.
```{r qfcd}
colData(cptac) <- colData(cptac_se)
```
Note that it is also possible to directly create a `QFeatures` object
with the `readQFeatures()` function and the same arguments as the
`readSummarizedExperiment()` used above. In addition, most functions
used above and below work on single `SummarizedExperiment` objects or
assays within a `QFeatures` object.
### Filtering out contaminants and reverse hits
`r msmbstyle::question_begin()`
Using the `filterFeatures()` function, filter out the reverse and
contaminant hits, and also retain those that have a posterior error
probability smaller than 0.05.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r qcfilter}
cptac <-
cptac %>%
filterFeatures(~ Reverse != "+") %>%
filterFeatures(~ Potential.contaminant != "+") %>%
filterFeatures(~ PEP < 0.05)
```
`r msmbstyle::solution_end()`
### Log-transformation and normalisation
The two code chunks below log-transform and normalise using the assay
`i` as input and adding a new one names as defined by `name`.
```{r log}
cptac <- logTransform(cptac, i = "peptides",
name = "log_peptides")
```
`r msmbstyle::question_begin()`
Use the `normalize()` method to normalise the data. The syntax is the
same as `logTransform()`. Use the `center.median` method.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r nrom}
cptac <- normalize(cptac, i = "log_peptides",
name = "lognorm_peptides",
method = "center.median")
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Visualise the result of the transformations above. The
`plotDensities()` function from the `limma` package is very
convenient, but feel free to use boxplots, violin plots, or any other
visualisation that you deem useful to assess the tranformations.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r plotdens, fig.cap = "Three peptide level assays: raw data, log transformed and normalised.", fig.width = 15, fig.height = 5}
par(mfrow = c(1, 3))
limma::plotDensities(assay(cptac[["peptides"]]))
limma::plotDensities(assay(cptac[["log_peptides"]]))
limma::plotDensities(assay(cptac[["lognorm_peptides"]]))
```
`r msmbstyle::solution_end()`
### Aggregation
`r msmbstyle::question_begin()`
Use median aggregation to aggregation peptides into protein
values. This is not necessarily the best choice, as we will see
later, but a good start.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r qfagg1, message = FALSE}
cptac <-
aggregateFeatures(cptac,
"lognorm_peptides",
name = "proteins_med",
fcol = "Leading.razor.protein",
fun = colMedians,
na.rm = TRUE)
```
`r msmbstyle::solution_end()`
Looking at the `.n` row variable computed during the aggregation, we
see that most proteins result from the aggregation of 5 peptides or
less, while very few proteins are accounted for by tens of peptides.
```{r}
table(rowData(cptac[["proteins_med"]])$.n)
```
### Principal component analysis
```{r pca, message = FALSE}
library("factoextra")
pca_pep <-
cptac[["lognorm_peptides"]] %>%
filterNA() %>%
assay() %>%
t() %>%
prcomp(scale = TRUE, center = TRUE) %>%
fviz_pca_ind(habillage = cptac$condition, title = "Peptides")
pca_prot <-
cptac[["proteins_med"]] %>%
filterNA() %>%
assay() %>%
t() %>%
prcomp() %>%
fviz_pca_ind(habillage = cptac$condition,
title = "Proteins (median aggregation)")
```
```{r plotpca, fig.width = 12, fig.height = 6, fig.cap = "Peptide and protein level PCA analyses."}
library("patchwork")
pca_pep + pca_prot
```
### Visualisation
Below, we use the `longFormat()` function to extract the quantitative
and row data in a long format, that can be directly reused by the
tidyverse tools.
```{r vis, message = FALSE, warning = FALSE, fig.width = 12, fig.height = 6, fig.cap = "Peptide and protein expression profile."}
longFormat(cptac["P02787ups|TRFE_HUMAN_UPS", ,
c("lognorm_peptides", "proteins_med")]) %>%
as_tibble() %>%
mutate(condition = ifelse(grepl("A", colname), "A", "B")) %>%
ggplot(aes(x = colname, y = value, colour = rowname, shape = condition)) +
geom_point(size = 3) +
geom_line(aes(group = rowname)) +
facet_grid(~ assay) +
ggtitle("P02787ups|TRFE_HUMAN_UPS")
```
We can also visualise the assays withing a `QFeatures` object and
their relation.
```{r plotqf}
plot(cptac)
```
`r msmbstyle::question_begin()`
The example above shows a simple linear relationship between
assays. Create a more interesting one by applying a different