You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/add-to-whole-genome-alignment-cactus.md
+14-14Lines changed: 14 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -96,12 +96,12 @@ With that, you should be ready to set-up your data for the pipeline!
96
96
97
97
## Inputs you need to prepare
98
98
99
-
To run this pipeline, you will need:
99
+
To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):
100
100
101
-
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus.
102
-
2. The location in the tree to add your alignment.
103
-
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment.
104
-
4. A reference genome to project the alignment to MAF format.
101
+
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus (`input_hal`).
102
+
2. The location in the tree to add your alignment (see below).
103
+
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment (`new_genome_fasta`).
104
+
4. A reference genome to project the alignment to MAF format (`maf_reference`).
105
105
106
106
!!! warning "[The FASTA file must softmasked!](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#interface)"
107
107
@@ -129,19 +129,19 @@ which would result in:
129
129
130
130
Now that we have the tree, we need to figure out where to put our new genome. We will need to come up with the following information:
131
131
132
-
1. A **tip label** or name for our new genome.
133
-
2. The **branch length** of the new branch connecting the new genome to an existing branch.
134
-
3. A **label or name** for the new node in our tree, connecting the new branch to an existing branch.
135
-
4. The **branch** on which to add the new node, defined by a parent and a child node.
136
-
5. The branch on which we add that node will have its length split into two separate branches. We must provide the **top-most** branch length of these two new branches (*i.e.* the one defined by our new node as the child).
132
+
1. A **tip label** or name for our new genome (`new_genome_name`).
133
+
2. The **branch length** of the new branch connecting the new genome to an existing branch (`new_branch_length`).
134
+
3. A **label or name** for the new node in our tree, connecting the new branch to an existing branch (`new_anc_node`).
135
+
4. The **branch** on which to add the new node, defined by a parent (`parent_node`) and a child (`child_node`) node.
136
+
5. The branch on which we add that node will have its length split into two separate branches. We must provide the **top-most** branch length of these two new branches (*i.e.* the one defined by our new node as the child) (`top_branch_length`).
137
137
138
138
We borrow and slightly modify an [image from the cactus documentation](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/updating-alignments.md#adding-a-new-genome) to visualize these pieces of information on an example tree:
139
139
140
140
<center>
141
141
<img src="../../img/cactus-adding-to-branch2-ai.png" alt="Two panels, the first showing a phylogenetic tree with 3 tips and internal nodes labeled, the second showing a 4th tip being added to the tree.">
142
142
</center>
143
143
144
-
In this context, we are adding the genome with the name "6" to our HAL. We are adding it such that it branches off from the branch defined by node 4 as the child and node 5 as the parent. To do so, we create a new node, which we come up with a name for (let's say RC for red circle), and a new branch 6-RC. This new RC node splits the 4-5 branch into two new branches: 4-RC and RC-5. For the pipeline you will need to provide the branch length of the **new** 6-RC branch and the **new** of RC-5 (**top-most**) branch.
144
+
In this context, we are adding the genome with the name **"6"** to our HAL. We are adding it such that it branches off from the branch defined by node 4 as the child and node 5 as the parent. To do so, we create a new node, which we come up with a name for (let's say **RC** for red circle), and a new branch 6-RC. This new RC node splits the 4-5 branch into two new branches: 4-RC and RC-5. For the pipeline you will need to provide the branch length of the **new** 6-RC branch and the **new** of RC-5 (**top-most**) branch.
145
145
146
146
If you're very good at parsing Newick tree strings by eye, you may be able to get this information just by looking at the output of `halStats --tree`. However in most cases, you'll want to look at an image of the tree. Consider using some sort of tree viewing software like [SeaView](https://doua.prabi.fr/software/seaview) or [the ape library in R](https://cran.r-project.org/web/packages/ape/index.html). EMBL also has an [online, interactive tree viewer](https://itol.embl.de/) where you can just paste the tree string to see an image of it.
147
147
@@ -153,7 +153,7 @@ Once you have the 5 pieces of information from the tree listed above, you're rea
153
153
154
154
### Reference sample
155
155
156
-
In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See [pipeline outputs](#pipeline-outputs) for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above tree to be the genome labeled **1** in the tree, I would put the string `1` in the `maf_reference:` line of the Snakemake config file.
156
+
In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See [pipeline outputs](#pipeline-outputs) for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above tree to be the genome labeled **1** in the tree, I would put the string `1` in the `maf_reference` line of the Snakemake config file.
157
157
158
158
### Preparing the Snakemake config file
159
159
@@ -171,7 +171,7 @@ In order to run the last step of the workflow that converts the HAL format to a
171
171
172
172
The config for the Cactus test data can be found at [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/tests/evolverMammals/evolverMammals-update-cfg.yaml) or at `tests/evolverMammals/evolverMammals-update-cfg.yaml` in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! **Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.**
173
173
174
-
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/update-config-template.yaml) or at `update-config-template.yaml` in your downloaded cactus-snakemake repo.
174
+
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/config-templates/update-config-template.yaml) or at `config-templates/update-config-template.yaml` in your downloaded cactus-snakemake repo.
175
175
176
176
Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
The pipeline will output a [.paf](https://github.com/lh3/miniasm/blob/master/PAF.md), a [.hal](https://github.com/ComparativeGenomicsToolkit/hal/blob/master/README.md), and a [.fa](https://en.wikipedia.org/wiki/FASTA_format) file for the new ancestral node as well as the parent node. If you specified `overwrite_original_hal: False`The final alignment file will be `<final_prefix>.hal`, where `<final_prefix>` is whatever you specified in the Snakemake config file. Otherwise, the original HAL will be modified in place.
375
+
The pipeline will output a [.paf](https://github.com/lh3/miniasm/blob/master/PAF.md), a [.hal](https://github.com/ComparativeGenomicsToolkit/hal/blob/master/README.md), and a [.fa](https://en.wikipedia.org/wiki/FASTA_format) file for the new ancestral node as well as the parent node. If you specified `overwrite_original_hal: False`the final alignment file will be `<final_prefix>.hal`, where `<final_prefix>` is whatever you specified in the Snakemake config file. Otherwise, the original HAL will be modified in place.
376
376
377
377
The final alignment will also be presented in MAF format as `<final_prefix>.<maf_reference>.maf`, again where `<maf_reference>` is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, `<final_prefix>.<maf_reference>.nodupes.maf` will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with `--dupeMode single`. See the [Cactus documentation regarding MAF export](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#maf-export) for more info.
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/pangenome-cactus-minigraph.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -89,10 +89,10 @@ With that, you should be ready to set-up your data for the pipeline!
89
89
90
90
## Inputs you need to prepare
91
91
92
-
To run this pipeline, you will need:
92
+
To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):
93
93
94
-
1. The assembled genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files for each sample.
95
-
2. A reference sample.
94
+
1. The assembled genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files for each sample (specified in `input_file`).
95
+
2. A reference sample (`reference`).
96
96
97
97
You will use these to create the input file for Cactus-minigraph.
98
98
@@ -128,7 +128,7 @@ Cactus-minigraph requires that you select one sample as a reference sample [for
128
128
129
129
The config for the Cactus-minigraph test data can be found at [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/tests/yeast-minigraph/yeast-minigraph-cfg.yaml) or at `tests/yeast-minigraph/yeast-minigraph-cfg.yaml` in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! **Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.**
130
130
131
-
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/minigraph-config-template.yaml) or at `minigraph-config-template.yaml` in your downloaded cactus-snakemake repo.
131
+
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/config-templates/minigraph-config-template.yaml) or at `config-templates/minigraph-config-template.yaml` in your downloaded cactus-snakemake repo.
132
132
133
133
Besides the sequence input, the pipeline needs some extra configuration to know where to look for files and write output. That is done in the Snakemake configuration file for a given run. It contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
Copy file name to clipboardExpand all lines: docs/resources/Tutorials/replace-genome-whole-genome-alignment-cactus.md
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -96,12 +96,12 @@ With that, you should be ready to set-up your data for the pipeline!
96
96
97
97
## Inputs you need to prepare
98
98
99
-
To run this pipeline, you will need:
99
+
To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):
100
100
101
-
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus.
102
-
2. The **name** of the genome you want to replace in the hAL file
103
-
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment.
104
-
4. A reference genome to project the alignment to MAF format.
101
+
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus (`input_hal`).
102
+
2. The **name** of the genome you want to replace in the HAL file (`replace`).
103
+
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment (`new_genome_fasta`).
104
+
4. A reference genome to project the alignment to MAF format (`maf_reference`).
105
105
106
106
!!! warning "[The FASTA file must softmasked!](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#interface)"
If we want to replace the *simHuman_chr6* genome in our HAL with a new version of the sequence, we would set this label as the value for `replace:` in our Snakemake config file below.
130
+
If we want to replace the *simHuman_chr6* genome in our HAL with a new version of the sequence, we would set this label as the value for `replace` in our Snakemake config file below.
131
131
132
132
You can also run `halStats --genomes example.hal` to print out the labels without the Newick tree formatting.
133
133
@@ -155,7 +155,7 @@ In order to run the last step of the workflow that converts the HAL format to a
155
155
156
156
The config for the Cactus test data can be found at [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/tests/evolverMammals/evolverMammals-replace-cfg.yaml) or at `tests/evolverMammals/evolverMammals-update-cfg.yaml` in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! **Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.**
157
157
158
-
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/replace-config-template.yaml) or at `replace-config-template.yaml` in your downloaded cactus-snakemake repo.
158
+
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/config-templates/replace-config-template.yaml) or at `config-templates/replace-config-template.yaml` in your downloaded cactus-snakemake repo.
159
159
160
160
Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
0 commit comments