Flags sequences that are entirely composed of tandem repeats
This workflow comes with a conda configuration file that will install most of its reprequisites. To use it, you must first install Anaconda or miniconda. I recommend miniconda. Download the appropriate installer from here. If you are working on an MacOS device with an M1 or M2 chip, see the note below.
You will also need git, which is standard on most systems. Alternatively, it can be installed using conda:
conda install git
Clone this repository to your local system:
git clone https://github.com/jmeppley/concatemer_finding
Navigate to the repository's root dir:
cd concatemer_finding
Create a conda envirnoment using the included configuration:
conda env create -p ./conda.env -f conda.yaml
Activate the environment
conda activate ./conda.env
Just run snakemake to test. (-j 5: use 5 threads)
snakemake -j 5
Apple MacOS devices with an M1 or M2 processor can run either the "macOS Apple M1" (aka "arm") version of conda or the "macOS Intel x86" (aka "intel") version. The binaries compiled for intel will run a little more slowly than the M1 binaries, but not everything is avaiable for M1/M2 yet. In general, the bioconda packages are not yet available for M1/M2, but for this project, the one we need (minimap) is available in my conda channel.
TL;DR: Either of the mac-os 64-bit versions of miniconda should work here.
Change the default configuration for snakemake. This assumes you run the repo dir. See the snakemake docs for running elsewhere.
snakemake --config output_dir='/path/to/output_dir' \
reads_template='/path/to/{sample}.reads.fasta'
The above assumes reads from each sample are in a single fasta file per sample.
For each sample, the workflow will generate a table with results for each method used. For each method, there will be 4 columns:
- repeat_size: the size of the repeated element
- copies: the number of repeats
- repeat_score: a measure of how well the repeated sequences match
- state: a final deterimination of whether the method thinks this is a repeat. This looks at repeat_score and copies.
There are up to four methods: combinations of the two search engines (minimap and lastal) and the two heuristics (cluster and fft). So the columns will look like:
repeat_size_lastal_fft ... state_lastal_fft ... copies_minimap_fft ... state_minimap_cluster