script__WGBS_ChIP-seq_comparison.Rmd

---
title: "WGBS ChIP-seq comparison with ENCODE and ChIP-Atlas data"
date: "Compiled at `r format(Sys.time(), '%Y-%m-%d %H:%M:%S', tz = 'UTC')` UTC"
output: github_document
params:
  name: "script__WGBS_ChIP-seq_comparison" # change if you rename file
---

```{r here, message=FALSE, echo = F}
here::i_am(paste0(params$name, ".Rmd"), uuid = "c5353e6a-3848-43ac-aaac-45660207548f")
knitr::opts_chunk$set(dpi = 200, warning = F, message = F)
```


The purpose of this document is to validate some of our results with data from ENCODE and ChIP-Atlas. 

In this script we analyze the data in A549, K562 cells, H1 cells, and mES cells.


```{r packages}
library(conflicted)
library(pheatmap)
library(RColorBrewer)
library(ggplot2)
library(ppcor)
```

```{r directories, echo = F}
# create or *empty* the target directory, used to write this file's data: 
projthis::proj_create_dir_target(params$name, clean = TRUE)

# function to get path to target directory: path_target("sample.csv")
path_target <- projthis::proj_path_target(params$name)

# function to get path to previous data: path_source("00-import", "sample.csv")
path_source <- projthis::proj_path_source(params$name)
```


# cell type: A549

We gather the data from the following sources:


H3K27me3: https://www.encodeproject.org/files/ENCFF702IOJ/

CBX8: https://www.encodeproject.org/files/ENCFF081CPV/ 
(peak identifier: ENCFF552VXR)
WGBS (meDNA): https://www.encodeproject.org/files/ENCFF723WVM/


As all files are of different lengths the average values for the ChIP-Seq datasets are computed for consecutive genome bins (default size: 10kb, 1000 bp, distance between bins: 0) by using the bins mode in deeptools under removal of blacklisted regions to a multiBigwigSummary object.

## Bin size: 1000bp

```{r}
## read processed multiBigwigSummary object
pat <- "data/deeptools/"

multiBigwigSummary_A549 <- read.table(paste0(pat, "multiBigwigSummary_A549_CBX8_H3K27me3_WGBS.tabular"))

## change column names
colnames(multiBigwigSummary_A549) <- c('chr',	'start',	'end',	'CBX8',	'H3K27me3',	'WGBS')
```

### Only keep chromosome 1 - 22 and X


```{r}
multiBigwigSummary_A549 <- multiBigwigSummary_A549[which(multiBigwigSummary_A549$chr %in% paste0('chr', c(1:22, "X"))), ]
dim(multiBigwigSummary_A549)
```

Here's the remaining set of chromosomes considered in out analysis

```{r}
table(multiBigwigSummary_A549$chr)
```

```{r}
head(multiBigwigSummary_A549)
```

### Missing values 

```{r}
(colSums(is.na(multiBigwigSummary_A549))/nrow(multiBigwigSummary_A549))[4:6]
```

### Dimension of the dataset

```{r}
dim(multiBigwigSummary_A549)
```

### Plot distribution of fold changes


```{r warning=F, message = F}
# WGBS
distr_WGBS <- ggplot(multiBigwigSummary_A549, aes(x=WGBS)) + geom_histogram() + theme_minimal()

# H3K27me3
distr_H3K27me3 <- ggplot(multiBigwigSummary_A549, aes(x=H3K27me3)) + geom_histogram() + theme_minimal()

# CBX8
distr_CBX8 <- ggplot(multiBigwigSummary_A549, aes(x=log(CBX8+0.1) )) + geom_histogram() + theme_minimal()

gridExtra::grid.arrange(distr_WGBS, distr_H3K27me3, distr_CBX8)
```


The WGBS data can take values between 0 and 100. For all the other data fold changes to a reference genome are given.


## Kendall's correlation

We look into Kendall's partial correlation to account for the non-normal distribution and for the fact that the WGBS data lives on a different scale than the ChIP-seq data.


### Entire genome

```{r eval = T}
mm <- multiBigwigSummary_A549[, c('CBX8',	'H3K27me3',	'WGBS')]
mm_int <- mm['H3K27me3'] * mm['WGBS']
names(mm_int) <- 'H3K27me3:WGBS'
mm <- cbind(mm, mm_int)
## remove rows with NA values
remove_rows <- which(rowSums(is.na(mm)) > 0 )
mm <- mm[-remove_rows, ]
```

```{r eval = T}
set.seed(1234)
subs <- sample(1:nrow(mm), 50000) ## only a subsample to reduce compute time
pkend_A549 <- pcor(mm[subs, ], method = "kendall")
```


```{r}
pheatmap::pheatmap(pkend_A549$estimate, 
         breaks = seq(-1, 1, length.out = 40),
         color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(40),
         cluster_rows = F, 
         cluster_cols = F, 
         display_numbers = T, 
         number_format = "%.3f",
         fontsize_number = 13,
         cellwidth = 45,
         cellheight = 45, 
         border_color = "white")
```


# cell type: K562


We gather the data from the following sources:


H3K27me3: https://www.encodeproject.org/files/ENCFF405HIO/

CBX8: https://www.encodeproject.org/files/ENCFF687ZGN/ 
CBX8 peaks: https://www.encodeproject.org/files/ENCFF522HZT/

WGBS (meDNA): https://www.encodeproject.org/files/ENCFF459XNY/


```{r}
## read processed multiBigwigSummary object
multiBigwigSummary_K562 <- read.table(paste0(pat, "multiBigwigSummary_K562_CBX8_H3K27me3_WGBS.tabular"))

## change column names
colnames(multiBigwigSummary_K562) <- c('chr',	'start',	'end',	'WGBS',	'H3K27me3',	'CBX8')
```

### Only keep chromosome 1 - 22 and X


```{r}
multiBigwigSummary_K562 <- multiBigwigSummary_K562[which(multiBigwigSummary_K562$chr %in% paste0('chr', c(1:22, "X"))), ]
dim(multiBigwigSummary_K562)
```

Here's the remaining set of chromosomes considered in our analysis

```{r}
table(multiBigwigSummary_K562$chr)
```

```{r}
head(multiBigwigSummary_K562)
```

### Missing values 

```{r}
(colSums(is.na(multiBigwigSummary_K562))/nrow(multiBigwigSummary_K562))[4:6]
```

### Dimension of the dataset

```{r}
dim(multiBigwigSummary_K562)
```

### Plot distribution of fold changes


```{r warning=F, message = F}
# WGBS
distr_WGBS <- ggplot(multiBigwigSummary_K562, aes(x = WGBS)) + geom_histogram() + theme_minimal()

# H3K27me3
distr_H3K27me3 <- ggplot(multiBigwigSummary_K562, aes(x = H3K27me3)) + geom_histogram() + theme_minimal()

# CBX8
distr_CBX8 <- ggplot(multiBigwigSummary_K562, aes(x = CBX8)) + geom_histogram() + theme_minimal()


gridExtra::grid.arrange(distr_WGBS, distr_H3K27me3, distr_CBX8)
```


The WGBS data can take values between 0 and 100. For all the other data fold changes to a reference genome are given.


## Kendall's correlation


```{r eval = T}
mm <- multiBigwigSummary_K562[, c('CBX8',	'H3K27me3',	'WGBS')]
mm_int <- mm['H3K27me3'] * mm['WGBS']
names(mm_int) <- 'H3K27me3:WGBS'
mm <- cbind(mm, mm_int)
## remove rows with NA values
remove_rows <- which(rowSums(is.na(mm)) > 0 )
mm <- mm[-remove_rows, ]
```

```{r}
set.seed(1234)
subs <- sample(1:nrow(mm), 50000)
pkend_K562 <- pcor(mm[subs, ], method = "kendall")
```


```{r}
pheatmap::pheatmap(pkend_K562$estimate, 
         breaks = seq(-1, 1, length.out = 40),
         color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(40),
         cluster_rows = F, 
         cluster_cols = F, 
         display_numbers = T, 
         number_format = "%.3f",
         fontsize_number = 13,
         cellwidth = 45,
         cellheight = 45, 
         border_color = "white")
```


# cell type: H1


We gather the data from the following sources:


H3K27me3: https://www.encodeproject.org/files/ENCFF345VHG/ (rep 1 + 2 combined)

CBX8: https://www.encodeproject.org/files/ENCFF284JDC/ (rep 1 + 2 combined)
(peaks: ENCFF483UZG)
WGBS (meDNA): https://www.encodeproject.org/files/ENCFF975NYJ/ 
              
              
```{r}
## read processed multiBigwigSummary object

multiBigwigSummary_H1_1000bp <- read.table(paste0(pat, "multiBigwigSummary_H1_CBX8_H3K27me3_WGBS.tabular"))
```

```{r echo = F}
## only keep one WGBS experiment
multiBigwigSummary_H1_1000bp <- multiBigwigSummary_H1_1000bp[, c(1, 2, 3, 4, 6, 7)]
```


```{r}
## change column names
colnames(multiBigwigSummary_H1_1000bp) <- c('chr',	'start',	'end',	'H3K27me3', 'WGBS',	'CBX8')
```

### Only keep chromosome 1-22 and X

```{r}
multiBigwigSummary_H1_1000bp <- multiBigwigSummary_H1_1000bp[which(multiBigwigSummary_H1_1000bp$chr %in% paste0('chr', c(1:22, "X"))), ]
dim(multiBigwigSummary_H1_1000bp)
```

```{r}
table(multiBigwigSummary_H1_1000bp$chr)
```

```{r}
(colSums(is.na(multiBigwigSummary_H1_1000bp))/nrow(multiBigwigSummary_H1_1000bp))[4:6]
```


### Plot distribution of fold changes


```{r warning=F, message = F}
# WGBS
distr_WGBS <- ggplot(multiBigwigSummary_H1_1000bp, aes(x=WGBS)) + geom_histogram() + theme_minimal()

# H3K27me3
distr_H3K27me3 <- ggplot(multiBigwigSummary_H1_1000bp, aes(x=H3K27me3)) + geom_histogram() + theme_minimal()

# CBX8
distr_CBX8 <- ggplot(multiBigwigSummary_H1_1000bp, aes(x=CBX8)) + geom_histogram() + theme_minimal()

gridExtra::grid.arrange(distr_WGBS, distr_H3K27me3, distr_CBX8)
```


## Kendall's correlation


```{r eval = T}
mm <- multiBigwigSummary_H1_1000bp[, c('CBX8',	'H3K27me3',	'WGBS')]
mm_int <- mm['H3K27me3'] * mm['WGBS']
names(mm_int) <- 'H3K27me3:WGBS'
mm <- cbind(mm, mm_int)
## remove rows with NA values
remove_rows <- which(rowSums(is.na(mm)) > 0 )
mm <- mm[-remove_rows, ]
```

```{r}
set.seed(1234)
subs <- sample(1:nrow(mm), 50000)
pkend_H1 <- pcor(mm[subs, ], method = "kendall")
```


```{r}
pheatmap::pheatmap(pkend_H1$estimate, 
         breaks = seq(-1, 1, length.out = 40),
         color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(40),
         cluster_rows = F, 
         cluster_cols = F, 
         display_numbers = T, 
         number_format = "%.3f",
         fontsize_number = 13,
         cellwidth = 45,
         cellheight = 45, 
         border_color = "white")
```


# mES cells

CBX8 mm9 BigWig: https://chip-atlas.org/view?id=SRX426373
(peak: SRX5090173.05)
H3K27me3 mm9 BigWig: https://chip-atlas.org/view?id=SRX006968 
WGBS mm9 BigWig (Methylation rate): https://chip-atlas.org/view?id=DRX001152 
blacklisted regions mm9: https://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm9-mouse/ 


```{r}
## read processed multiBigwigSummary object

multiBigwigSummary_mES <- read.table(paste0(pat, "multiBigwigSummary_mES_CBX8_H3K27me3_WGBS.tabular"))
```

```{r}
colnames(multiBigwigSummary_mES) <- c('chr',	'start',	'end','H3K27me3', 'CBX8', 'WGBS')
```

### Only keep chromosome 1-22 and X

```{r}
multiBigwigSummary_mES <- multiBigwigSummary_mES[which(multiBigwigSummary_mES$chr %in% paste0('chr', c(1:22, "X"))), ]
dim(multiBigwigSummary_mES)
```


```{r}
table(multiBigwigSummary_mES$chr)
```

### How many NaNs are in the data?

```{r}
(colSums(is.na(multiBigwigSummary_mES))/nrow(multiBigwigSummary_mES))[4:6]
```


### Plot distribution of fold changes


```{r warning=F, message = F}
# WGBS
distr_WGBS <- ggplot(multiBigwigSummary_mES, aes(x=WGBS)) + geom_histogram() + theme_minimal()

# H3K27me3
distr_H3K27me3 <- ggplot(multiBigwigSummary_mES, aes(x=H3K27me3)) + geom_histogram() + theme_minimal()

# CBX8
distr_CBX8 <- ggplot(multiBigwigSummary_mES, aes(x=CBX8)) + geom_histogram() + theme_minimal()

gridExtra::grid.arrange(distr_WGBS, distr_H3K27me3, distr_CBX8)
```

## Kendall's correlation


```{r eval = T}
mm <- multiBigwigSummary_mES[, c('CBX8',	'H3K27me3',	'WGBS')]
mm_int <- mm['H3K27me3'] * mm['WGBS'] 
names(mm_int) <- 'H3K27me3:WGBS' 
mm <- cbind(mm, mm_int)
## remove rows with NA values
remove_rows <- which(rowSums(is.na(mm)) > 0 )
mm <- mm[-remove_rows, ]
```

```{r}
set.seed(1234)
subs <- sample(1:nrow(mm), 50000)
pkend_mES <- pcor(mm[subs, ], method = "kendall")
```


```{r}
pheatmap::pheatmap(pkend_mES$estimate, 
         breaks = seq(-1, 1, length.out = 100),
         color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdBu")))(100),
         cluster_rows = F, 
         cluster_cols = F, 
         display_numbers = T, 
         number_format = "%.3f",
         fontsize_number = 13,
         cellwidth = 45,
         cellheight = 45, 
         border_color = "white")
```


## Files written

These files have been written to the target directory, ```r paste0("data/", params$name)```:

```{r list-files-target}
projthis::proj_dir_info(path_target())
```