You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/add-to-whole-genome-alignment-cactus.md
+26-22Lines changed: 26 additions & 22 deletions
Original file line number
Diff line number
Diff line change
@@ -159,7 +159,7 @@ In order to run the last step of the workflow that converts the HAL format to a
159
159
160
160
??? info "The Cactus update input file"
161
161
162
-
The various Cactus commands depend on a single input file with information about the genomes to align. This file is automatically generated by the pipeline at `[output_dir]/cactus-update-input.txt`. This file is a simple tab delimited file and should contains line:
162
+
The various Cactus commands depend on a single input file with information about the genomes to align. This file is automatically generated by the pipeline at `[output_dir]/cactus-update-input.txt`. This file is a simple tab delimited file and should contain one line:
163
163
164
164
```
165
165
[tip label to add to tree] [path/to/genome/fasta.file] [new branch length to add to the tree]
@@ -176,18 +176,20 @@ In order to run the last step of the workflow that converts the HAL format to a
176
176
Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
177
177
178
178
```
179
-
cactus_path: <path/to/cactus-singularity-image OR download>
179
+
cactus_path: <path/to/cactus-singularity-image OR download OR a version string (e.g. 2.9.5)>
180
180
181
-
cactus_gpu_path: <path/to/cactus-GPU-singularity-image OR download>
181
+
cactus_gpu_path: <path/to/cactus-GPU-singularity-image OR download OR a version string (e.g. 2.9.5)>
@@ -215,8 +217,9 @@ Simply replace the string surrounded by <> with the path or option desired. Belo
215
217
| `cactus_gpu_path` | Path to the Cactus GPU Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. If a version string is provided (e.g. 2.9.5), then that version will be downloaded and used. This will only be used if `use_gpu` is True. |
216
218
| `input_hal` | Path to the previously generated HAL file to which you want to add a new genome. |
217
219
| `new_genome_name` | The label to give your new genome in the tree and HAL file. |
218
-
| `new_branch_length` | The length of the branch to create that will connect your new genome to the tree. |
220
+
| `new_genome_fasta` | The path to the FASTA file containing the new genome to add to the alignment. |
219
221
| `new_anc_node` | The name to give the new ancestral node that connects the new genome to an existing branch in the tree. |
222
+
| `new_branch_length` | The length of the branch to create that will connect your new genome to the tree. |
220
223
| `parent_node` | The name of the ancestral node of the existing branch to which the new branch is connected. |
221
224
| `child_node` | The name of the descendant node of the existing branch to which the new branch is connected. |
222
225
| `top_branch_length` | The existing branch defined by `parent_node` and `child_node` will be split by `new_anc_node`. This is the length of the new, top-most branch created by the split (defined by `parent_node` and `new_anc_node`). |
@@ -231,18 +234,19 @@ Simply replace the string surrounded by <> with the path or option desired. Belo
231
234
Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:
232
235
233
236
```
234
-
preprocess_partition: "shared"
235
-
preprocess_cpu: 8
236
-
preprocess_mem: 25000 # in MB
237
-
preprocess_time: 30 # in minutes
238
-
239
-
##########################
240
-
241
-
blast_partition: "gpu_test" # If use_gpu is True, this must be a partition with GPUs
242
-
blast_gpu: 1 # If use_gpu is False, this will be ignored
243
-
blast_cpu: 48
244
-
blast_mem: 50000 # in MB
245
-
blast_time: 120 # in minutes
237
+
rule_resources:
238
+
preprocess:
239
+
partition: shared
240
+
mem_mb: 25000
241
+
cpus: 8
242
+
time: 30
243
+
244
+
blast:
245
+
partition: gpu # If use_gpu is True, this must be a partition with GPUs
246
+
mem_mb: 50000
247
+
cpus: 48
248
+
gpus: 1 # If use_gpu is False, this will be ignored
249
+
time: 30
246
250
```
247
251
248
252
**The rule _blast_ is the only one that uses GPUs if `use_gpu` is True.**
@@ -251,7 +255,7 @@ blast_time: 120 # in minutes
251
255
252
256
* Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
253
257
* **Allocate the proper partitions based on `use_gpu`.** If you want to use the GPU version of cactus (*i.e.* you have set `use_gpu: True` in the config file), the partition for the rule **blast** must be GPU enabled. If not, the pipeline will fail to run.
254
-
* The `blast_gpu:` option will be ignored if `use_gpu: False` is set.
258
+
* The `blast: gpus:` option will be ignored if `use_gpu: False` is set.
255
259
* **mem is in MB** and **time is in minutes**.
256
260
257
261
You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs and GPUs.
@@ -269,7 +273,7 @@ First, we want to make sure everything is setup properly by using the `--dryrun`
269
273
This is done with the following command, changing the snakefile `-s` and `--configfile` paths to the one you have created for your project:
270
274
271
275
```bash
272
-
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_uodate.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun
276
+
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_update.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun
273
277
```
274
278
275
279
??? info "Command breakdown"
@@ -319,7 +323,7 @@ If you see any red text, that likely means an error has occurred that must be ad
319
323
If you're satisfied that the `--dryrun` has completed successfully and you are ready to start submitting Cactus jobs to the cluster, you can do so by simply removing the `--dryrun` option from the command above:
320
324
321
325
```bash
322
-
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus.smk> --configfile <path/to/your/snakmake-config.yml>
326
+
snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_update.smk> --configfile <path/to/your/snakmake-config.yml>
323
327
```
324
328
325
329
This will start submitting jobs to SLURM. On your screen, you will see continuous updates regarding job status in blue text. In another terminal, you can also check on the status of your jobs by running `squeue -u <your user id>`.
@@ -351,7 +355,7 @@ Here is a breakdown of the files so you can investigate them and prepare similar
351
355
352
356
You will first need to [run the test to generate the HAL file](whole-genome-alignment-cactus.md#test-dataset). Then, you can add the Gorilla sequence to it using this pipeline. We recommend running this test dataset before setting up your own project.
353
357
354
-
First, open the config file, `tests/evolverMammals/evolverMammals-update-cfg.yaml` and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to `tmp_dir` to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.
358
+
After you've generated the HAL file, to add a genome, open the config file, `tests/evolverMammals/evolverMammals-update-cfg.yaml` and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to `tmp_dir` to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.
355
359
356
360
After that, run a dryrun of the test dataset by changing into the `tests/evolverMammals` directory and running:
357
361
@@ -372,7 +376,7 @@ The pipeline will output a [.paf](https://github.com/lh3/miniasm/blob/master/PAF
372
376
373
377
The final alignment will also be presented in MAF format as `<final_prefix>.<maf_reference>.maf`, again where `<maf_reference>` is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, `<final_prefix>.<maf_reference>.nodupes.maf` will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with `--dupeMode single`. See the [Cactus documentation regarding MAF export](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#maf-export) for more info.
374
378
375
-
A suit of tools called [HAL tools](https://github.com/ComparativeGenomicsToolkit/Hal) is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop [taffy](https://github.com/ComparativeGenomicsToolkit/taffy), which can manipulate MAF files by converting them to TAF files.
379
+
A suite of tools called [HAL tools](https://github.com/ComparativeGenomicsToolkit/Hal) is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop [taffy](https://github.com/ComparativeGenomicsToolkit/taffy), which can manipulate MAF files by converting them to TAF files.
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/how-to-annotate-a-genome.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ While the first of these points is dependent upon your research program, the sec
74
74
Below is a decision tree for picking an annotation method, based upon our evaluation of the performance of 12 different methods in our forthcoming paper in *Genome Research.
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/installing-command-line-software-conda-mamba.md
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ To install mamba, first navigate to the [Miniforge3 repository page](https://git
31
31
### Mac/Linux
32
32
33
33
<center>
34
-

34
+
<imgsrc="../../img/mamba-install1.png"alt="A screenshot of the miniforge repository's installation instructions" />
35
35
</center>
36
36
37
37
On Mac and Linux machines (the [Harvard cluster runs a version of Linux](https://www.rc.fas.harvard.edu/about/cluster-architecture/)), you'll want to open your Terminal or login to the server to type the download and install commands.
@@ -59,7 +59,7 @@ If necessary, Miniforge does provide an explicit Windows installer for conda/mam
59
59
Once you have followed the above instructions and **restarted your terminal or reconnected to the server**, you should now see that mamba is activated because the `(base)` environment prefix appears before your prompt:
60
60
61
61
<center>
62
-

62
+
<img src="../../img/prompt1.png" alt="A screenshot of a command prompt with (base) prepended to it" />
63
63
</center>
64
64
65
65
mamba can be used to manage environments. **Environments** modify aspects of a user's file system that make it easier to install and run software, essentially giving the user full control over their own software and negating the need to access critical parts of the file system.
@@ -101,7 +101,7 @@ mamba env list
101
101
Once you are in an environment, your prompt should be updated to be pre-fixed with that environment's name:
102
102
103
103
<center>
104
-

104
+
<img src="../../img/prompt2.png" alt="A screenshot of a command prompt with (project-env) prepended to it" />
105
105
</center>
106
106
107
107
!!! tip "Environments must be activated every time you log on"
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/pangenome-cactus-minigraph.md
+8-6Lines changed: 8 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -133,7 +133,7 @@ Cactus-minigraph requires that you select one sample as a reference sample [for
133
133
Besides the sequence input, the pipeline needs some extra configuration to know where to look for files and write output. That is done in the Snakemake configuration file for a given run. It contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
134
134
135
135
```
136
-
cactus_path: <path/to/cactus-singularity-image OR download>
136
+
cactus_path: <path/to/cactus-singularity-image OR download OR version string>
137
137
138
138
input_file: <path/to/cactus-input-file>
139
139
@@ -162,17 +162,19 @@ Simply replace the string surrounded by <> with the path or option desired. Belo
162
162
Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:
163
163
164
164
```
165
-
minigraph_partition: "shared"
166
-
minigraph_cpu: 8
167
-
minigraph_mem: 25000
168
-
minigraph_time: 30
165
+
rule_resources:
166
+
minigraph:
167
+
partition: shared
168
+
mem_mb: 25000
169
+
cpus: 8
170
+
time: 30
169
171
```
170
172
171
173
!!! warning "Notes on resource allocation"
172
174
173
175
* Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
174
176
* The steps in the cactus-minigraph pipeline are not GPU compatible, so there are no GPU options in this pipeline.
175
-
* **mem is in MB** and **time is in minutes**.
177
+
* **mem_mb is in MB** and **time is in minutes**.
176
178
177
179
You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs.
0 commit comments