Skip to content

Commit cf771e2

Browse files
authored
Merge pull request #52 from harvardinformatics/cactus-add-outgroup-tutorial
Cactus add outgroup tutorial
2 parents 94ad39a + 6291fa8 commit cf771e2

14 files changed

+1190
-34
lines changed

.githooks/pre-commit

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#!/bin/sh
2+
# .githooks/pre-commit
3+
4+
echo "Running pre-commit hook via sh..."
5+
6+
FILE="mkdocs.yml"
7+
TEMP_FILE="$(mktemp)"
8+
9+
# Check if the file exists
10+
if [ ! -f "$FILE" ]; then
11+
echo "Warning: mkdocs.yml not found"
12+
exit 0
13+
fi
14+
15+
# Recomment the development-only ignore line
16+
# Matches exactly any uncommented line saying "ignore: ['*.ipynb']"
17+
MODIFIED=0
18+
while IFS= read -r LINE; do
19+
if echo "$LINE" | grep -q "^\s*ignore: \['\*\.ipynb'\]"; then
20+
# Extract indent and recomment cleanly
21+
INDENT=$(printf "%s" "$LINE" | sed -E 's/^(\s*).*$/\1/')
22+
BODY=$(printf "%s" "$LINE" | sed -E 's/^\s*(.*)$/\1/')
23+
echo "${INDENT}# ${BODY}" >> "$TEMP_FILE"
24+
MODIFIED=1
25+
else
26+
echo "$LINE" >> "$TEMP_FILE"
27+
fi
28+
done < "$FILE"
29+
30+
if [ "$MODIFIED" -eq 1 ]; then
31+
echo "Detected uncommented development-only ignore line in mkdocs.yml"
32+
echo "Re-commenting it: ignore: ['*.ipynb'] → # ignore: ['*.ipynb']"
33+
mv "$TEMP_FILE" "$FILE"
34+
git add "$FILE"
35+
else
36+
rm "$TEMP_FILE"
37+
fi
38+
39+
exit 0

.github/workflows/gh-pages.yml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,17 @@ jobs:
3131
actions: read
3232
steps:
3333
# Cache lychee external URL results for 30 days
34+
- name: Checkout repository
35+
uses: actions/checkout@v4
36+
3437
- name: Download site
3538
uses: actions/download-artifact@v4
3639
with:
3740
name: github-pages
38-
- run: tar -xf artifact.tar && rm artifact.tar
41+
42+
- name: Extract artifact
43+
run: tar -xf artifact.tar && rm artifact.tar
44+
3945
# https://github.com/lycheeverse/lychee-action#utilising-the-cache-feature
4046
- name: Restore lychee cache
4147
id: restore-cache
@@ -44,11 +50,13 @@ jobs:
4450
path: .lycheecache
4551
key: cache-lychee-${{ github.sha }}
4652
restore-keys: cache-lychee-
53+
4754
- name: Run lychee
4855
uses: lycheeverse/lychee-action@v1.8.0
4956
with:
50-
args: "--base . --cache --max-cache-age 30d --max-concurrency 1 --require-https --timeout 5 --exclude-path 'assets/home.html' --exclude 'academic.oup.com/bioinformatics/' --exclude 'useast.ensembl.org' --exclude 'doi.org' --exclude 'academic.oup.com/nar' --exclude 'gnu.org' --exclude 'anaconda.org' --exclude 'fonts.gstatic.com' --exclude 'www.microsoft.com/en-us/microsoft-365/onedrive/online-cloud-storage' --exclude-path 404.html -- './**/*.html' './**/*.css'"
57+
args: "--base . --cache --config .github/workflows/lychee.toml -- '.site/**/*.html' '.site/**/*.css'"
5158
fail: true
59+
5260
- name: Save lychee cache
5361
uses: actions/cache/save@v3
5462
if: always()

.github/workflows/lychee.toml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# lychee.toml
2+
3+
# Optional: where to store cache
4+
cache = true
5+
max_cache_age = "30d"
6+
max_concurrency = 1
7+
require_https = true
8+
timeout = 5
9+
10+
# Exclude full URLs (exact matches)
11+
exclude = [
12+
"https://scholar.google.com",
13+
"https://academic.oup.com/bioinformatics/",
14+
"https://useast.ensembl.org",
15+
"https://doi.org",
16+
"https://academic.oup.com/nar",
17+
"https://www.gnu.org",
18+
"https://anaconda.org",
19+
"https://fonts.gstatic.com",
20+
"https://www.microsoft.com/en-us/microsoft-365/onedrive/online-cloud-storage",
21+
]
22+
23+
# Exclude files or paths from checking
24+
exclude_path = [
25+
"assets/home.html",
26+
"404.html"
27+
]

docs/resources/Tutorials/add-outgroup-to-whole-genome-alignment-cactus.md

Lines changed: 436 additions & 0 deletions
Large diffs are not rendered by default.

docs/resources/Tutorials/add-to-whole-genome-alignment-cactus.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,12 @@ With that, you should be ready to set-up your data for the pipeline!
9696

9797
## Inputs you need to prepare
9898

99-
To run this pipeline, you will need:
99+
To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):
100100

101-
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus.
102-
2. The location in the tree to add your alignment.
103-
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment.
104-
4. A reference genome to project the alignment to MAF format.
101+
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus (`input_hal`).
102+
2. The location in the tree to add your alignment (see below).
103+
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment (`new_genome_fasta`).
104+
4. A reference genome to project the alignment to MAF format (`maf_reference`).
105105

106106
!!! warning "[The FASTA file must softmasked!](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#interface)"
107107

@@ -129,19 +129,19 @@ which would result in:
129129
130130
Now that we have the tree, we need to figure out where to put our new genome. We will need to come up with the following information:
131131
132-
1. A **tip label** or name for our new genome.
133-
2. The **branch length** of the new branch connecting the new genome to an existing branch.
134-
3. A **label or name** for the new node in our tree, connecting the new branch to an existing branch.
135-
4. The **branch** on which to add the new node, defined by a parent and a child node.
136-
5. The branch on which we add that node will have its length split into two separate branches. We must provide the **top-most** branch length of these two new branches (*i.e.* the one defined by our new node as the child).
132+
1. A **tip label** or name for our new genome (`new_genome_name`).
133+
2. The **branch length** of the new branch connecting the new genome to an existing branch (`new_branch_length`).
134+
3. A **label or name** for the new node in our tree, connecting the new branch to an existing branch (`new_anc_node`).
135+
4. The **branch** on which to add the new node, defined by a parent (`parent_node`) and a child (`child_node`) node.
136+
5. The branch on which we add that node will have its length split into two separate branches. We must provide the **top-most** branch length of these two new branches (*i.e.* the one defined by our new node as the child) (`top_branch_length`).
137137
138138
We borrow and slightly modify an [image from the cactus documentation](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/updating-alignments.md#adding-a-new-genome) to visualize these pieces of information on an example tree:
139139
140140
<center>
141141
<img src="../../img/cactus-adding-to-branch2-ai.png" alt="Two panels, the first showing a phylogenetic tree with 3 tips and internal nodes labeled, the second showing a 4th tip being added to the tree.">
142142
</center>
143143
144-
In this context, we are adding the genome with the name "6" to our HAL. We are adding it such that it branches off from the branch defined by node 4 as the child and node 5 as the parent. To do so, we create a new node, which we come up with a name for (let's say RC for red circle), and a new branch 6-RC. This new RC node splits the 4-5 branch into two new branches: 4-RC and RC-5. For the pipeline you will need to provide the branch length of the **new** 6-RC branch and the **new** of RC-5 (**top-most**) branch.
144+
In this context, we are adding the genome with the name **"6"** to our HAL. We are adding it such that it branches off from the branch defined by node 4 as the child and node 5 as the parent. To do so, we create a new node, which we come up with a name for (let's say **RC** for red circle), and a new branch 6-RC. This new RC node splits the 4-5 branch into two new branches: 4-RC and RC-5. For the pipeline you will need to provide the branch length of the **new** 6-RC branch and the **new** of RC-5 (**top-most**) branch.
145145
146146
If you're very good at parsing Newick tree strings by eye, you may be able to get this information just by looking at the output of `halStats --tree`. However in most cases, you'll want to look at an image of the tree. Consider using some sort of tree viewing software like [SeaView](https://doua.prabi.fr/software/seaview) or [the ape library in R](https://cran.r-project.org/web/packages/ape/index.html). EMBL also has an [online, interactive tree viewer](https://itol.embl.de/) where you can just paste the tree string to see an image of it.
147147
@@ -153,7 +153,7 @@ Once you have the 5 pieces of information from the tree listed above, you're rea
153153
154154
### Reference sample
155155
156-
In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See [pipeline outputs](#pipeline-outputs) for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above tree to be the genome labeled **1** in the tree, I would put the string `1` in the `maf_reference:` line of the Snakemake config file.
156+
In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See [pipeline outputs](#pipeline-outputs) for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above tree to be the genome labeled **1** in the tree, I would put the string `1` in the `maf_reference` line of the Snakemake config file.
157157
158158
### Preparing the Snakemake config file
159159
@@ -171,7 +171,7 @@ In order to run the last step of the workflow that converts the HAL format to a
171171
172172
The config for the Cactus test data can be found at [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/tests/evolverMammals/evolverMammals-update-cfg.yaml) or at `tests/evolverMammals/evolverMammals-update-cfg.yaml` in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! **Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.**
173173
174-
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/update-config-template.yaml) or at `update-config-template.yaml` in your downloaded cactus-snakemake repo.
174+
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/config-templates/update-config-template.yaml) or at `config-templates/update-config-template.yaml` in your downloaded cactus-snakemake repo.
175175
176176
Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
177177

@@ -372,7 +372,7 @@ snakemake -j 10 -e slurm -s ../../cactus_update.smk --configfile evolverMammals-
372372

373373
## Pipeline outputs
374374

375-
The pipeline will output a [.paf](https://github.com/lh3/miniasm/blob/master/PAF.md), a [.hal](https://github.com/ComparativeGenomicsToolkit/hal/blob/master/README.md), and a [.fa](https://en.wikipedia.org/wiki/FASTA_format) file for the new ancestral node as well as the parent node. If you specified `overwrite_original_hal: False` The final alignment file will be `<final_prefix>.hal`, where `<final_prefix>` is whatever you specified in the Snakemake config file. Otherwise, the original HAL will be modified in place.
375+
The pipeline will output a [.paf](https://github.com/lh3/miniasm/blob/master/PAF.md), a [.hal](https://github.com/ComparativeGenomicsToolkit/hal/blob/master/README.md), and a [.fa](https://en.wikipedia.org/wiki/FASTA_format) file for the new ancestral node as well as the parent node. If you specified `overwrite_original_hal: False` the final alignment file will be `<final_prefix>.hal`, where `<final_prefix>` is whatever you specified in the Snakemake config file. Otherwise, the original HAL will be modified in place.
376376

377377
The final alignment will also be presented in MAF format as `<final_prefix>.<maf_reference>.maf`, again where `<maf_reference>` is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, `<final_prefix>.<maf_reference>.nodupes.maf` will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with `--dupeMode single`. See the [Cactus documentation regarding MAF export](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#maf-export) for more info.
378378

docs/resources/Tutorials/pangenome-cactus-minigraph.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,10 @@ With that, you should be ready to set-up your data for the pipeline!
8989

9090
## Inputs you need to prepare
9191

92-
To run this pipeline, you will need:
92+
To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):
9393

94-
1. The assembled genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files for each sample.
95-
2. A reference sample.
94+
1. The assembled genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files for each sample (specified in `input_file`).
95+
2. A reference sample (`reference`).
9696

9797
You will use these to create the input file for Cactus-minigraph.
9898

@@ -128,7 +128,7 @@ Cactus-minigraph requires that you select one sample as a reference sample [for
128128

129129
The config for the Cactus-minigraph test data can be found at [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/tests/yeast-minigraph/yeast-minigraph-cfg.yaml) or at `tests/yeast-minigraph/yeast-minigraph-cfg.yaml` in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! **Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.**
130130

131-
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/minigraph-config-template.yaml) or at `minigraph-config-template.yaml` in your downloaded cactus-snakemake repo.
131+
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/config-templates/minigraph-config-template.yaml) or at `config-templates/minigraph-config-template.yaml` in your downloaded cactus-snakemake repo.
132132

133133
Besides the sequence input, the pipeline needs some extra configuration to know where to look for files and write output. That is done in the Snakemake configuration file for a given run. It contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
134134

docs/resources/Tutorials/replace-genome-whole-genome-alignment-cactus.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,12 @@ With that, you should be ready to set-up your data for the pipeline!
9696

9797
## Inputs you need to prepare
9898

99-
To run this pipeline, you will need:
99+
To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):
100100

101-
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus.
102-
2. The **name** of the genome you want to replace in the hAL file
103-
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment.
104-
4. A reference genome to project the alignment to MAF format.
101+
1. A [**HAL file**](https://github.com/ComparativeGenomicsToolkit/Hal) with a whole genome alignment generated by Cactus (`input_hal`).
102+
2. The **name** of the genome you want to replace in the HAL file (`replace`).
103+
3. The [**softmasked**](#4-how-can-i-tell-if-my-genome-fasta-files-are-softmasked) genome [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file for the genome you want to add to the alignment (`new_genome_fasta`).
104+
4. A reference genome to project the alignment to MAF format (`maf_reference`).
105105

106106
!!! warning "[The FASTA file must softmasked!](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#interface)"
107107

@@ -127,7 +127,7 @@ which would result in:
127127
((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;
128128
```
129129
130-
If we want to replace the *simHuman_chr6* genome in our HAL with a new version of the sequence, we would set this label as the value for `replace:` in our Snakemake config file below.
130+
If we want to replace the *simHuman_chr6* genome in our HAL with a new version of the sequence, we would set this label as the value for `replace` in our Snakemake config file below.
131131
132132
You can also run `halStats --genomes example.hal` to print out the labels without the Newick tree formatting.
133133
@@ -155,7 +155,7 @@ In order to run the last step of the workflow that converts the HAL format to a
155155
156156
The config for the Cactus test data can be found at [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/tests/evolverMammals/evolverMammals-replace-cfg.yaml) or at `tests/evolverMammals/evolverMammals-update-cfg.yaml` in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! **Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.**
157157
158-
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/replace-config-template.yaml) or at `replace-config-template.yaml` in your downloaded cactus-snakemake repo.
158+
Additionally, a blank template file is located [here](https://github.com/harvardinformatics/cactus-snakemake/blob/main/config-templates/replace-config-template.yaml) or at `config-templates/replace-config-template.yaml` in your downloaded cactus-snakemake repo.
159159
160160
Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see [below](#specifying-resources-for-each-rule)). The first part should look something like this:
161161

0 commit comments

Comments
 (0)