Skip to content

Commit

Permalink
Quartz sync: Feb 22, 2024, 4:34 PM
Browse files Browse the repository at this point in the history
  • Loading branch information
dubssieg committed Feb 22, 2024
1 parent b3f04b4 commit 946242e
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 1 deletion.
10 changes: 10 additions & 0 deletions content/Building a graph/minigraph-cactus.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ Lines with a `#` are interpreted as notes, and will be skipped. Empty lines will
> [!WARNING] Warning
> Strictly don't use `_` in sequence names, nor spaces, nor prefixes (ex: if you have a sequence named `zea252` and a sequence named `zea2`, pipeline will crash. `seq1` and `seq10` also, but not `seq01` and `seq00`)
### Known graph inconsistencies
As i dug deep in pangenome graphs, I remarked some weird behaviors exposed by minigraph-cactus, such as the following:
- A chain of nodes, used in a single genome, can exist in the graph. They consist of a single node, fraction in many small ones, that have no reasons to be there (no variations, no alternative paths, no filtering nor clipping on the graph).
- Some edges that exists according to the paths in the graph are not referenced into the edges list, if another edge with an opposite direction exists; however, it is not 100% consistent: some cases exists where both edges are referenced.
### Choosing a reference

The reference will satisfy the following properties:
Expand All @@ -50,6 +54,7 @@ The reference will satisfy the following properties:
+ Be a "reference-sense" path in vg/gbz and will therefore be indexably for fast coordinate lookup
+ Be the basis for the output VCF and therefore won't appear as a sample in the VCF
+ Be used to divide the graph into chromosomes
+ You may select multiple genomes as references
One can define multiple references, but it won't help for clipping (but for filter?), cyclicity, nor nodes in forward orientation purposes.

> [!WARNING] Warning
Expand All @@ -59,6 +64,11 @@ One can define multiple references, but it won't help for clipping (but for filt
> + Cut down sequences to match the threshold
> + Try PGGB
### Build graph from multifasta files
In the case you build from multiple individuals (files) with many entries (fasta fields), each . You will find reference to the files in the W-lines:
+ the name given to the sample (cactus pipeline file) will be **in the first field**.
+ the name of each fasta header will be **in the third field**.

### Control input sequence order

To create graph with sequence in a specific order that you can control, using the argument `minigraphSortInput="none"` disables default sorting by mash distance. It is to be specified in the cactus config file.
Expand Down
4 changes: 4 additions & 0 deletions content/Building a graph/pggb.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ title: "PGGB : PanGenome Graph Builder"
> [!NOTE] Note
> Fasta must be mereged in a single file, for instance using `cat *.fasta > out.fa`. Then, it needs to be indexed with `samtools faidx out.fa`. Just specify the main `out.fa` file when using PGGB, it will find the indexed file by itself.
### Build graph from multifasta files
In the case you build from multiple individuals (files) with many entries (fasta fields), each . You will find reference to the files in the P-lines:
+ the name given to the sample (cactus pipeline file) will be **in the first field**.
+ the name of each fasta header will be **in the third field**.
### PGGB output

The pipeline outputs six different graphs, that corresponds to different steps of the pipeline.
5 changes: 4 additions & 1 deletion content/Useful commands/sequences.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,13 @@ title: Interact with sequences
Get statistics on sequences:
+ awk command to get the size of all lectures in a file : `awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta |paste - -`
+ samtools command to index a file (useful for pggb): `samtools faidx myfile.fasta`
+ Replace string in file: `"s/thing_to_replace/thing_replacing/g" file > out`
+ Replace string in file: `sed -i "s/thing_to_replace/thing_replacing/g" file > out` and replace spaces `sed 's/[[:space:]]/_/g' file.fa > out.fa`
Split a multifasta file:
+ `awk -F '>' '/^>/ {F=sprintf("%s.fasta", $2); print > F;next;} {print F; close(F)}' < file.fasta`

Get fast stats on a GFA file:
+ Print all number of lines types: `<graph.gfa sed 's/^\(.\).*/\1/' | sort | uniq -c`
+ Get all P-lines from file: `grep '^P' input_file.gfa | awk '{print $2}'`

Schueldule jobs on SLURM cluster:
+ See [here](https://stackoverflow.com/questions/60583279/how-to-make-sbatch-job-run-after-a-previous-one-has-completed) to chain jobs
1 change: 1 addition & 0 deletions content/Working with graphs/catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This section will try to cover as much tools as it can, pointing to existing cat

Known catalogs or blogs:
+ [Catalog](https://pangenome.github.io/) from the PGGB team
+ [awesome-pangenomes](https://github.com/colindaven/awesome-pangenomes) by Colin Davenport

Tools:
+ [[bubblegun]], a bubble and superbubble caller
Expand Down

0 comments on commit 946242e

Please sign in to comment.