Skip to content

Commit

Permalink
Complete vignettes and add troubleshooting
Browse files Browse the repository at this point in the history
  • Loading branch information
HDash committed Jul 11, 2024
1 parent 0acce4b commit e86a0a2
Show file tree
Hide file tree
Showing 4 changed files with 847 additions and 40 deletions.
245 changes: 229 additions & 16 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ repo <- strsplit(URL,"/")[[1]][5]

`r rworkflows::use_badges(add_hex = FALSE, add_authors = FALSE)`

**Authors: Hiranyamaya (Hiru) Dash, Thomas Roberts, Nathan Skene**
**Authors:** ***Hiranyamaya (Hiru) Dash, Thomas Roberts, Nathan Skene***
**Updated:** ***`r format(Sys.Date(), '%b-%d-%Y')`***


Expand All @@ -27,13 +27,17 @@ repo <- strsplit(URL,"/")[[1]][5]
profiling methods with motif enrichment as the key benchmark. The package
outputs an HTML report consisting of three sections:

1. **General Metrics**: Overview of peaks-related general metrics for the
datasets (FRiP scores, peak widths and motif-summit distances).
2. **Known Motif Enrichment Analysis**: Statistics for the frequency of
user-provided motifs enriched in the datasets.
3. **De-Novo Motif Enrichment Analysis**: Statistics for the frequency of
de-novo discovered motifs enriched in the datasets and compared with known
motifs.
1. **General Metrics**: Provides an overview of metrics related to dataset
peaks, including FRiP scores, peak widths, and motif-to-summit distances.

2. **Known Motif Enrichment Analysis**: Presents statistics on the frequency of
enriched user-supplied motifs in the datasets and compares them between the
common and unique peaks from comparison and reference datasets.

3. **De-Novo Motif Enrichment Analysis**: Details the statistics of de-novo
discovered motifs in common and unique peaks from comparison and reference
datasets. Examines motif similarities and identifies the closest known motifs in
the JASPAR or the provided database.

<!-- If you use ``r pkg``, please cite: -->

Expand All @@ -44,35 +48,244 @@ motifs.
`MotifPeeker` uses
[`memes`](https://www.bioconductor.org/packages/release/bioc/html/memes.html)
which relies on a local install of the
[MEME suite](https://meme-suite.org/meme/). Installation instructions can be
found [here](https://www.bioconductor.org/packages/release/bioc/vignettes/
memes/inst/doc/install_guide.html).
[MEME suite](https://meme-suite.org/meme/), which can be installed as follows:
```
MEME_VERSION=5.5.5 # or the latest version
wget https://meme-suite.org/meme/meme-software/$MEME_VERSION/meme-$MEME_VERSION.tar.gz
tar zxf meme-$MEME_VERSION.tar.gz
cd meme-$MEME_VERSION
./configure --prefix=$HOME/meme --with-url=http://meme-suite.org/ \
--enable-build-libxml2 --enable-build-libxslt
make
make install
# Add to PATH
echo 'export PATH=$HOME/meme/bin:/meme/libexec/meme-$MEME_VERSION:$PATH' >> ~/.bashrc
source ~/.bashrc
```

**NOTE:** It is important that Perl dependencies associated with MEME suite are
also installed, particularly `XML::Parser`, which can be installed using the
following command in the terminal:
```
cpan install XML::Parser
```
For more information, refer to the Perl dependency section of the MEME suite
[here](https://meme-suite.org/meme/doc/install.html#prereq_perl).
For more information, refer to the [Perl dependency section of the MEME suite](https://meme-suite.org/meme/doc/install.html#prereq_perl).

Once the MEME suite and its associated Perl dependencies are installed,
`MotifPeeker` can be installed using the following code:
Once the MEME suite and its associated Perl dependencies are installed, the
development version of `MotifPeeker` can be installed using the following code:
```R
if(!require("remotes")) install.packages("remotes")

remotes::install_github("`r paste(owner,repo,sep='/')`")
library(`r pkg`)
```

Alternatively, you can use the [Docker/Singularity container](https://neurogenomics.github.io/MotifPeeker/articles/docker.html) to
run the package out-of-the-box.

### Documentation
## Documentation

#### [EpiCompare Website](https://`r owner`.github.io/`r repo`)
#### [Get Started](https://neurogenomics.github.io/MotifPeeker/articles/MotifPeeker.html)
#### [Docker/Singularity Container](https://neurogenomics.github.io/MotifPeeker/articles/docker.html)
#### [Troubleshooting](https://neurogenomics.github.io/MotifPeeker/articles/troubleshooting.html)

## Usage

Load the package and example datasets.
```R
library(MotifPeeker)
data("CTCF_ChIP_peaks", package = "MotifPeeker")
data("CTCF_TIP_peaks", package = "MotifPeeker")
data("motif_MA1102.2", package = "MotifPeeker")
data("motif_MA1930.2", package = "MotifPeeker")
```

Prepare input files.
```R
peak_files <- list(CTCF_ChIP_peaks, CTCF_TIP_peaks)
alignment_files <- list(
system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"),
system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker")
)
motif_files <- list(motif_MA1102.2, motif_MA1930.2)
```

Run `MotifPeeker()`:
```R
MotifPeeker(
peak_files = peak_files,
reference_index = 2, # Set TIP-seq experiment as reference
alignment_files = alignment_files,
exp_labels = c("ChIP", "TIP"),
exp_type = c("chipseq", "tipseq"),
genome_build = "hg38",
motif_files = motif_files,
cell_counts = NULL, # No cell-count information
denovo_motif_discovery = TRUE,
denovo_motifs = 3,
motif_db = NULL,
download_buttons = TRUE,
out_dir = tempdir(),
workers = 2,
debug = FALSE,
quiet = FALSE,
verbose = TRUE
)
```

### Required Inputs

These input parameters must be provided:

<details>
<summary><strong>Details</strong></summary>

- `peak_files`: A list of path to peak files or `GRanges` objects with the
peaks to analyse. Currently, only peak files from `MACS2/3` (`.narrowPeak`)
and `SEACR` (`.bed`) are supported. ENCODE file IDs can also be provided to
automatically fetch peak file(s) from the ENCODE database.
- `reference_index`: An integer specifying the index of the reference dataset
in the `peak_files` list to use as reference for various comparisons.
(default = 1)
- `genome_build`: A character string or a `BSgenome` object specifying the
genome build of the datasets. At the moment, only hg38 and hg19 are supported
as abbreviated input.
- `out_dir`: A character string specifying the output directory to save the
HTML report and other files.

</details>

### Optional Inputs

These input parameters optional, but *recommended* to add more analyses, or
enhance them:

<details>
<summary><strong>Details</strong></summary>

- `alignment_files`: A list of path to alignment files or `Rsamtools::BamFile`
objects with the alignment sequences to analyse. Alignment files are used to
calculate read-related metrics like FRiP score. ENCODE file IDs can also be
provided to automatically fetch alignment file(s) from the ENCODE database.
- `exp_labels`: A character vector of labels for each peak file. If not
provided, capital letters will be used as labels in the report.
- `exp_type`: A character vector of experimental types for each peak file.
Useful for comparison of different methods. If not provided, all datasets will
be classified as "unknown" experiment types in the report. `exp_type` is used
only for labelling. It does not affect the analyses. You can also input custom
strings. Datasets will be grouped as long as they match their respective
`exp_type`. Supported experimental types are:
- `chipseq`: ChIP-seq data
- `tipseq`: TIP-seq data
- `cuttag`: CUT&Tag data
- `cutrun`: CUT&Run data
- `motif_files`: A character vector of path to motif files, or a vector of
`universalmotif-class` objects. Required to run Known Motif Enrichment
Analysis. JASPAR matrix IDs can also be provided to automatically fetch motifs
from the JASPAR.
- `motif_labels`: A character vector of labels for each motif file. Only used if
path to file names are passed in motif_files. If not provided, the motif file
names will be used as labels.
- `cell_counts`: An integer vector of experiment cell counts for each peak file
(if available). Creates additional comparisons based on cell counts.
- `motif_db`: Path to `.meme` format file to use as reference database, or a
list of `universalmotif-class` objects. Results from de-novo motif discovery
are searched against this database to find similar motifs. If not provided,
JASPAR CORE database will be used, making this parameter **truly optional**.
**NOTE**: p-value estimates are inaccurate when the database has fewer than
50 entries.

</details>

### Other Options

For more information on additional parameters, please refer to the documentation
for [`MotifPeeker()`](https://neurogenomics.github.io/MotifPeeker/reference/MotifPeeker.html).

### Runtime Guidance

For 4 datasets, the runtime is approximately 3 minutes with
denovo_motif_discovery disabled. However, de-novo motif discovery can take
hours to complete.

To make computation faster, we highly recommend tuning the following arguments:

<details>
<summary><strong>Details</strong></summary>

- `workers`: Running motif discovery in parallel can significantly reduce
runtime, but it is very memory-intensive, consuming upwards of 10GB of RAM per
thread. Memory starvation can greatly slow the process, so set `workers` with
caution.
- `denovo_motifs`: The number of motifs to discover per sequence group
exponentially increases runtime. We recommend no more than 5 motifs to make a
meaningful inference.
- `trim_seq_width`: Trimming sequences before running de-novo motif discovery
can significantly reduce the search space. Sequence length can exponentially
increase runtime. We recommend running the script with
`denovo_motif_discovery = FALSE` and studying the motif-summit distance
distribution under general metrics to find the sequence length that captures
most motifs. A good starting point is 150 but it can be reduced further if
appropriate.

</details>

### Outputs

`MotifPeeker` generates its output in a new folder within he `out_dir`
directory. The folder is named `MotifPeeker_YYYYMMDD_HHMMSS` and contains the
following files:

- `MotifPeeker.html`: The main HTML report, including all analyses and plots.
- Output from various MEME suite tools in their respecive sub-directories, if
`save_runfiles` is set to `TRUE`.

## Datasets

`MotifPeeker` comes with several datasets bundled:

<details>
<summary><strong>Details</strong></summary>

- `CTCF_TIP_peaks`: Human CTCF peak file generated with TIP-seq using
HCT116 cell-line. No control files were used to generate the peak file. The
peaks were called using `MACS3` with `CTCF_TIP_alignment.bam` as input.
- `CTCF_ChIP_peaks`: Human CTCF peak file generated with ChIP-seq using
HCT116 cell-line. No control files were used to generate the peak file. The
peaks were called using `MACS3` with `CTCF_ChIP_alignment.bam` as input.
- `motif_MA1102.3`: The JASPAR motif for CTCFL (MA1102.3) for *Homo Sapiens*.
Sourced from [JASPAR](https://jaspar.elixir.no/matrix/MA1102.3/)
- `motif_MA1930.2`: The JASPAR motif for CTCFL (MA1930.2) for *Homo Sapiens*.
Sourced from [JASPAR](https://jaspar.elixir.no/matrix/MA1930.2/)
- `CTCF_TIP_alignment.bam`: Human CTCF alignment file generated with TIP-seq
using HCT116 cell-line. The alignment file was generated using the
[`nf-core/cutandrun`](https://nf-co.re/cutandrun/3.2.2) pipeline.
Raw read files were sourced from *NIH Sequence Read Archives*
[*ID: SRR16963166*](https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR16963166).
Only available as *extdata*.
- `CTCF_ChIP_alignment.bam`: Human CTCF alignment file generated with ChIP-seq
using HCT116 cell-line. Sourced from
[ENCODE (Accession: ENCFF091ODJ)](https://www.encodeproject.org/files/ENCFF091ODJ/).
Only available as *extdata*.

</details>

Please note that the peaks and alignments included are a very small subset
(*chr10:65,654,529-74,841,155*) of the actual data. It only serves as an example
to demonstrate the package and run tests to maintain the integrity of the
package.

## Contact

### [Neurogenomics Lab](https://www.neurogenomics.co.uk/inst/report/EpiCompare.html)
UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London
[GitHub](https://github.com/neurogenomics)

<hr>

Expand Down
Loading

0 comments on commit e86a0a2

Please sign in to comment.