Skip to content

Commit 94ad39a

Browse files
authored
Merge pull request #51 from harvardinformatics/cactus-replace-tutorial
Adding cactus replace tutorial
2 parents b3b7bab + 34bb997 commit 94ad39a

9 files changed

+478
-48
lines changed

docs/resources/Tutorials/add-to-whole-genome-alignment-cactus.md

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ In order to run the last step of the workflow that converts the HAL format to a
159159
160160
??? info "The Cactus update input file"
161161
162-
The various Cactus commands depend on a single input file with information about the genomes to align. This file is automatically generated by the pipeline at `[output_dir]/cactus-update-input.txt`. This file is a simple tab delimited file and should contains line:
162+
The various Cactus commands depend on a single input file with information about the genomes to align. This file is automatically generated by the pipeline at `[output_dir]/cactus-update-input.txt`. This file is a simple tab delimited file and should contain one line:
163163
164164
```
165165
[tip label to add to tree] [path/to/genome/fasta.file] [new branch length to add to the tree]
@@ -176,18 +176,20 @@ In order to run the last step of the workflow that converts the HAL format to a
176176
Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
177177

178178
```
179-
cactus_path: <path/to/cactus-singularity-image OR download>
179+
cactus_path: <path/to/cactus-singularity-image OR download OR a version string (e.g. 2.9.5)>
180180

181-
cactus_gpu_path: <path/to/cactus-GPU-singularity-image OR download>
181+
cactus_gpu_path: <path/to/cactus-GPU-singularity-image OR download OR a version string (e.g. 2.9.5)>
182182

183183
input_hal: <path/to/hal-file>
184184

185185
new_genome_name: <tip label of new genome>
186186

187-
new_branch_length: <new branch length to connect the new genome to the tree with>
187+
new_genome_fasta: <path/to/new/genome.fasta>
188188

189189
new_anc_node: <label for new ancestral node connected to new genome>
190190

191+
new_branch_length: <new branch length to connect the new genome to the tree with>
192+
191193
parent_node: <parent node of existing branch>
192194

193195
child_node: <child node of existing branch>
@@ -215,8 +217,9 @@ Simply replace the string surrounded by <> with the path or option desired. Belo
215217
| `cactus_gpu_path` | Path to the Cactus GPU Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. If a version string is provided (e.g. 2.9.5), then that version will be downloaded and used. This will only be used if `use_gpu` is True. |
216218
| `input_hal` | Path to the previously generated HAL file to which you want to add a new genome. |
217219
| `new_genome_name` | The label to give your new genome in the tree and HAL file. |
218-
| `new_branch_length` | The length of the branch to create that will connect your new genome to the tree. |
220+
| `new_genome_fasta` | The path to the FASTA file containing the new genome to add to the alignment. |
219221
| `new_anc_node` | The name to give the new ancestral node that connects the new genome to an existing branch in the tree. |
222+
| `new_branch_length` | The length of the branch to create that will connect your new genome to the tree. |
220223
| `parent_node` | The name of the ancestral node of the existing branch to which the new branch is connected. |
221224
| `child_node` | The name of the descendant node of the existing branch to which the new branch is connected. |
222225
| `top_branch_length` | The existing branch defined by `parent_node` and `child_node` will be split by `new_anc_node`. This is the length of the new, top-most branch created by the split (defined by `parent_node` and `new_anc_node`). |
@@ -231,18 +234,19 @@ Simply replace the string surrounded by <> with the path or option desired. Belo
231234
Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:
232235
233236
```
234-
preprocess_partition: "shared"
235-
preprocess_cpu: 8
236-
preprocess_mem: 25000 # in MB
237-
preprocess_time: 30 # in minutes
238-
239-
##########################
240-
241-
blast_partition: "gpu_test" # If use_gpu is True, this must be a partition with GPUs
242-
blast_gpu: 1 # If use_gpu is False, this will be ignored
243-
blast_cpu: 48
244-
blast_mem: 50000 # in MB
245-
blast_time: 120 # in minutes
237+
rule_resources:
238+
preprocess:
239+
partition: shared
240+
mem_mb: 25000
241+
cpus: 8
242+
time: 30
243+
244+
blast:
245+
partition: gpu # If use_gpu is True, this must be a partition with GPUs
246+
mem_mb: 50000
247+
cpus: 48
248+
gpus: 1 # If use_gpu is False, this will be ignored
249+
time: 30
246250
```
247251
248252
**The rule _blast_ is the only one that uses GPUs if `use_gpu` is True.**
@@ -251,7 +255,7 @@ blast_time: 120 # in minutes
251255
252256
* Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
253257
* **Allocate the proper partitions based on `use_gpu`.** If you want to use the GPU version of cactus (*i.e.* you have set `use_gpu: True` in the config file), the partition for the rule **blast** must be GPU enabled. If not, the pipeline will fail to run.
254-
* The `blast_gpu:` option will be ignored if `use_gpu: False` is set.
258+
* The `blast: gpus:` option will be ignored if `use_gpu: False` is set.
255259
* **mem is in MB** and **time is in minutes**.
256260
257261
You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs and GPUs.
@@ -269,7 +273,7 @@ First, we want to make sure everything is setup properly by using the `--dryrun`
269273
This is done with the following command, changing the snakefile `-s` and `--configfile` paths to the one you have created for your project:
270274
271275
```bash
272-
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_uodate.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun
276+
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_update.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun
273277
```
274278

275279
??? info "Command breakdown"
@@ -319,7 +323,7 @@ If you see any red text, that likely means an error has occurred that must be ad
319323
If you're satisfied that the `--dryrun` has completed successfully and you are ready to start submitting Cactus jobs to the cluster, you can do so by simply removing the `--dryrun` option from the command above:
320324

321325
```bash
322-
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus.smk> --configfile <path/to/your/snakmake-config.yml>
326+
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_update.smk> --configfile <path/to/your/snakmake-config.yml>
323327
```
324328

325329
This will start submitting jobs to SLURM. On your screen, you will see continuous updates regarding job status in blue text. In another terminal, you can also check on the status of your jobs by running `squeue -u <your user id>`.
@@ -351,7 +355,7 @@ Here is a breakdown of the files so you can investigate them and prepare similar
351355

352356
You will first need to [run the test to generate the HAL file](whole-genome-alignment-cactus.md#test-dataset). Then, you can add the Gorilla sequence to it using this pipeline. We recommend running this test dataset before setting up your own project.
353357

354-
First, open the config file, `tests/evolverMammals/evolverMammals-update-cfg.yaml` and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to `tmp_dir` to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.
358+
After you've generated the HAL file, to add a genome, open the config file, `tests/evolverMammals/evolverMammals-update-cfg.yaml` and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to `tmp_dir` to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.
355359

356360
After that, run a dryrun of the test dataset by changing into the `tests/evolverMammals` directory and running:
357361

@@ -372,7 +376,7 @@ The pipeline will output a [.paf](https://github.com/lh3/miniasm/blob/master/PAF
372376

373377
The final alignment will also be presented in MAF format as `<final_prefix>.<maf_reference>.maf`, again where `<maf_reference>` is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, `<final_prefix>.<maf_reference>.nodupes.maf` will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with `--dupeMode single`. See the [Cactus documentation regarding MAF export](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#maf-export) for more info.
374378

375-
A suit of tools called [HAL tools](https://github.com/ComparativeGenomicsToolkit/Hal) is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop [taffy](https://github.com/ComparativeGenomicsToolkit/taffy), which can manipulate MAF files by converting them to TAF files.
379+
A suite of tools called [HAL tools](https://github.com/ComparativeGenomicsToolkit/Hal) is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop [taffy](https://github.com/ComparativeGenomicsToolkit/taffy), which can manipulate MAF files by converting them to TAF files.
376380

377381
## Questions/troubleshooting
378382

docs/resources/Tutorials/how-to-annotate-a-genome.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ While the first of these points is dependent upon your research program, the sec
7474
Below is a decision tree for picking an annotation method, based upon our evaluation of the performance of 12 different methods in our forthcoming paper in *Genome Research.
7575

7676
<center>
77-
![Genome annotation method decision tree](../img/genome_annotation_decision_chart.png)
77+
<img src="../../img/genome_annotation_decision_chart.png" alt="Genome annotation method decision tree" />
7878
</center>
7979

8080
The dashed lines that indicate "optional integration" refer to the combining of more than one genome annotation method, which we elaborate upon below.

docs/resources/Tutorials/installing-command-line-software-conda-mamba.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ To install mamba, first navigate to the [Miniforge3 repository page](https://git
3131
### Mac/Linux
3232

3333
<center>
34-
![A screenshot of the miniforge repository's installation instructions](../img/mamba-install1.png)
34+
<img src="../../img/mamba-install1.png" alt="A screenshot of the miniforge repository's installation instructions" />
3535
</center>
3636

3737
On Mac and Linux machines (the [Harvard cluster runs a version of Linux](https://www.rc.fas.harvard.edu/about/cluster-architecture/)), you'll want to open your Terminal or login to the server to type the download and install commands.
@@ -59,7 +59,7 @@ If necessary, Miniforge does provide an explicit Windows installer for conda/mam
5959
Once you have followed the above instructions and **restarted your terminal or reconnected to the server**, you should now see that mamba is activated because the `(base)` environment prefix appears before your prompt:
6060

6161
<center>
62-
![A screenshot of a command prompt with (base) prepended to it](../img/prompt1.png)
62+
<img src="../../img/prompt1.png" alt="A screenshot of a command prompt with (base) prepended to it" />
6363
</center>
6464

6565
mamba can be used to manage environments. **Environments** modify aspects of a user's file system that make it easier to install and run software, essentially giving the user full control over their own software and negating the need to access critical parts of the file system.
@@ -101,7 +101,7 @@ mamba env list
101101
Once you are in an environment, your prompt should be updated to be pre-fixed with that environment's name:
102102

103103
<center>
104-
![A screenshot of a command prompt with (project-env) prepended to it](../img/prompt2.png)
104+
<img src="../../img/prompt2.png" alt="A screenshot of a command prompt with (project-env) prepended to it" />
105105
</center>
106106

107107
!!! tip "Environments must be activated every time you log on"

docs/resources/Tutorials/pangenome-cactus-minigraph.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ Cactus-minigraph requires that you select one sample as a reference sample [for
133133
Besides the sequence input, the pipeline needs some extra configuration to know where to look for files and write output. That is done in the Snakemake configuration file for a given run. It contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
134134

135135
```
136-
cactus_path: <path/to/cactus-singularity-image OR download>
136+
cactus_path: <path/to/cactus-singularity-image OR download OR version string>
137137
138138
input_file: <path/to/cactus-input-file>
139139
@@ -162,17 +162,19 @@ Simply replace the string surrounded by <> with the path or option desired. Belo
162162
Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:
163163

164164
```
165-
minigraph_partition: "shared"
166-
minigraph_cpu: 8
167-
minigraph_mem: 25000
168-
minigraph_time: 30
165+
rule_resources:
166+
minigraph:
167+
partition: shared
168+
mem_mb: 25000
169+
cpus: 8
170+
time: 30
169171
```
170172

171173
!!! warning "Notes on resource allocation"
172174

173175
* Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
174176
* The steps in the cactus-minigraph pipeline are not GPU compatible, so there are no GPU options in this pipeline.
175-
* **mem is in MB** and **time is in minutes**.
177+
* **mem_mb is in MB** and **time is in minutes**.
176178

177179
You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs.
178180

0 commit comments

Comments
 (0)