From 946242e765bdd2f1f5271c93af8f96ed591ad710 Mon Sep 17 00:00:00 2001 From: Tharos Date: Thu, 22 Feb 2024 16:34:03 +0100 Subject: [PATCH] Quartz sync: Feb 22, 2024, 4:34 PM --- content/Building a graph/minigraph-cactus.md | 10 ++++++++++ content/Building a graph/pggb.md | 4 ++++ content/Useful commands/sequences.md | 5 ++++- content/Working with graphs/catalog.md | 1 + 4 files changed, 19 insertions(+), 1 deletion(-) diff --git a/content/Building a graph/minigraph-cactus.md b/content/Building a graph/minigraph-cactus.md index 6d39a148e89a5..96394f14dea65 100644 --- a/content/Building a graph/minigraph-cactus.md +++ b/content/Building a graph/minigraph-cactus.md @@ -41,6 +41,10 @@ Lines with a `#` are interpreted as notes, and will be skipped. Empty lines will > [!WARNING] Warning > Strictly don't use `_` in sequence names, nor spaces, nor prefixes (ex: if you have a sequence named `zea252` and a sequence named `zea2`, pipeline will crash. `seq1` and `seq10` also, but not `seq01` and `seq00`) +### Known graph inconsistencies +As i dug deep in pangenome graphs, I remarked some weird behaviors exposed by minigraph-cactus, such as the following: +- A chain of nodes, used in a single genome, can exist in the graph. They consist of a single node, fraction in many small ones, that have no reasons to be there (no variations, no alternative paths, no filtering nor clipping on the graph). +- Some edges that exists according to the paths in the graph are not referenced into the edges list, if another edge with an opposite direction exists; however, it is not 100% consistent: some cases exists where both edges are referenced. ### Choosing a reference The reference will satisfy the following properties: @@ -50,6 +54,7 @@ The reference will satisfy the following properties: + Be a "reference-sense" path in vg/gbz and will therefore be indexably for fast coordinate lookup + Be the basis for the output VCF and therefore won't appear as a sample in the VCF + Be used to divide the graph into chromosomes ++ You may select multiple genomes as references One can define multiple references, but it won't help for clipping (but for filter?), cyclicity, nor nodes in forward orientation purposes. > [!WARNING] Warning @@ -59,6 +64,11 @@ One can define multiple references, but it won't help for clipping (but for filt > + Cut down sequences to match the threshold > + Try PGGB +### Build graph from multifasta files +In the case you build from multiple individuals (files) with many entries (fasta fields), each . You will find reference to the files in the W-lines: ++ the name given to the sample (cactus pipeline file) will be **in the first field**. ++ the name of each fasta header will be **in the third field**. + ### Control input sequence order To create graph with sequence in a specific order that you can control, using the argument `minigraphSortInput="none"` disables default sorting by mash distance. It is to be specified in the cactus config file. diff --git a/content/Building a graph/pggb.md b/content/Building a graph/pggb.md index 5fd312f899652..8799a84ee180f 100644 --- a/content/Building a graph/pggb.md +++ b/content/Building a graph/pggb.md @@ -8,6 +8,10 @@ title: "PGGB : PanGenome Graph Builder" > [!NOTE] Note > Fasta must be mereged in a single file, for instance using `cat *.fasta > out.fa`. Then, it needs to be indexed with `samtools faidx out.fa`. Just specify the main `out.fa` file when using PGGB, it will find the indexed file by itself. +### Build graph from multifasta files +In the case you build from multiple individuals (files) with many entries (fasta fields), each . You will find reference to the files in the P-lines: ++ the name given to the sample (cactus pipeline file) will be **in the first field**. ++ the name of each fasta header will be **in the third field**. ### PGGB output The pipeline outputs six different graphs, that corresponds to different steps of the pipeline. \ No newline at end of file diff --git a/content/Useful commands/sequences.md b/content/Useful commands/sequences.md index abfb2887f1ed4..fa49a6291d197 100644 --- a/content/Useful commands/sequences.md +++ b/content/Useful commands/sequences.md @@ -4,10 +4,13 @@ title: Interact with sequences Get statistics on sequences: + awk command to get the size of all lectures in a file : `awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta |paste - -` + samtools command to index a file (useful for pggb): `samtools faidx myfile.fasta` -+ Replace string in file: `"s/thing_to_replace/thing_replacing/g" file > out` ++ Replace string in file: `sed -i "s/thing_to_replace/thing_replacing/g" file > out` and replace spaces `sed 's/[[:space:]]/_/g' file.fa > out.fa` +Split a multifasta file: ++ `awk -F '>' '/^>/ {F=sprintf("%s.fasta", $2); print > F;next;} {print F; close(F)}' < file.fasta` Get fast stats on a GFA file: + Print all number of lines types: `