Quartz sync: Feb 22, 2024, 4:34 PM

dubssieg · Feb 22, 2024 · 946242e · 946242e
1 parent b3f04b4
commit 946242e
Show file tree

Hide file tree

Showing 4 changed files with 19 additions and 1 deletion.
diff --git a/content/Building a graph/minigraph-cactus.md b/content/Building a graph/minigraph-cactus.md
@@ -41,6 +41,10 @@ Lines with a `#` are interpreted as notes, and will be skipped. Empty lines will
 > [!WARNING] Warning
 > Strictly don't use `_` in sequence names, nor spaces, nor prefixes (ex: if you have a sequence named `zea252` and a sequence named `zea2`, pipeline will crash. `seq1` and `seq10` also, but not `seq01` and `seq00`)
 
+### Known graph inconsistencies
+As i dug deep in pangenome graphs, I remarked some weird behaviors exposed by minigraph-cactus, such as the following:
+- A chain of nodes, used in a single genome, can exist in the graph. They consist of a single node, fraction in many small ones, that have no reasons to be there (no variations, no alternative paths, no filtering nor clipping on the graph).
+- Some edges that exists according to the paths in the graph are not referenced into the edges list, if another edge with an opposite direction exists; however, it is not 100% consistent: some cases exists where both edges are referenced.
 ### Choosing a reference
 
 The reference will satisfy the following properties:
@@ -50,6 +54,7 @@ The reference will satisfy the following properties:
 + Be a "reference-sense" path in vg/gbz and will therefore be indexably for fast coordinate lookup
 + Be the basis for the output VCF and therefore won't appear as a sample in the VCF
 + Be used to divide the graph into chromosomes
++ You may select multiple genomes as references
 One can define multiple references, but it won't help for clipping (but for filter?), cyclicity, nor nodes in forward orientation purposes.
 
 > [!WARNING] Warning
@@ -59,6 +64,11 @@ One can define multiple references, but it won't help for clipping (but for filt
 > + Cut down sequences to match the threshold
 > + Try PGGB
 
+### Build graph from multifasta files
+In the case you build from multiple individuals (files) with many entries (fasta fields), each . You will find reference to the files in the W-lines:
++ the name given to the sample (cactus pipeline file) will be **in the first field**.
++ the name of each fasta header will be **in the third field**.
+
 ### Control input sequence order
 
 To create graph with sequence in a specific order that you can control, using the argument `minigraphSortInput="none"` disables default sorting by mash distance. It is to be specified in the cactus config file.

diff --git a/content/Building a graph/pggb.md b/content/Building a graph/pggb.md
@@ -8,6 +8,10 @@ title: "PGGB : PanGenome Graph Builder"
 > [!NOTE] Note
 > Fasta must be mereged in a single file, for instance using `cat *.fasta > out.fa`. Then, it needs to be indexed with `samtools faidx out.fa`. Just specify the main `out.fa` file when using PGGB, it will find the indexed file by itself.
 
+### Build graph from multifasta files
+In the case you build from multiple individuals (files) with many entries (fasta fields), each . You will find reference to the files in the P-lines:
++ the name given to the sample (cactus pipeline file) will be **in the first field**.
++ the name of each fasta header will be **in the third field**.
 ### PGGB output
 
 The pipeline outputs six different graphs, that corresponds to different steps of the pipeline. 
diff --git a/content/Useful commands/sequences.md b/content/Useful commands/sequences.md
@@ -4,10 +4,13 @@ title: Interact with sequences
 Get statistics on sequences:
 + awk command to get the size of all lectures in a file : `awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta |paste - -`
 + samtools command to index a file (useful for pggb): `samtools faidx myfile.fasta`
-+ Replace string in file: `"s/thing_to_replace/thing_replacing/g" file > out`
++ Replace string in file: `sed -i "s/thing_to_replace/thing_replacing/g" file > out` and replace spaces `sed 's/[[:space:]]/_/g' file.fa > out.fa`
+Split a multifasta file:
++ `awk -F '>' '/^>/ {F=sprintf("%s.fasta", $2); print > F;next;} {print F; close(F)}' < file.fasta`
 
 Get fast stats on a GFA file:
 + Print all number of lines types: `<graph.gfa sed 's/^\(.\).*/\1/' | sort | uniq -c` 
++ Get all P-lines from file: `grep '^P' input_file.gfa | awk '{print $2}'`
 
 Schueldule jobs on SLURM cluster:
 + See [here](https://stackoverflow.com/questions/60583279/how-to-make-sbatch-job-run-after-a-previous-one-has-completed) to chain jobs
diff --git a/content/Working with graphs/catalog.md b/content/Working with graphs/catalog.md
@@ -8,6 +8,7 @@ This section will try to cover as much tools as it can, pointing to existing cat
 
 Known catalogs or blogs:
 + [Catalog](https://pangenome.github.io/) from the PGGB team
++ [awesome-pangenomes](https://github.com/colindaven/awesome-pangenomes) by Colin Davenport
 
 Tools:
 + [[bubblegun]], a bubble and superbubble caller