forked from jackyzha0/quartz
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
24 changed files
with
227 additions
and
67 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
--- | ||
title: "GfaGraphs: Python abstraction layer for GFA graph format" | ||
--- | ||
![[library_flowchart.png]] | ||
![[library_flowchart.png]] | ||
Known limitations: | ||
+ As of now, not scaling well in terms of memory for huge graphs (like full HPRC) as 256G of RAM is not sufficient to load PGGB and MGC graphs in memory at the same time | ||
+ Takes a long time to load huge graphs (many hours for HPRC aswell) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,26 @@ | ||
--- | ||
title: odgi | ||
--- | ||
List of [commands](https://odgi.readthedocs.io/en/latest/rst/commands/odgi.html) is available here. | ||
List of [commands](https://odgi.readthedocs.io/en/latest/rst/commands/odgi.html) is available here. | ||
# General commands | ||
## Convert graph formats | ||
|
||
To convert from odgi (.og) format to another formats (like GFA for instance) it is possible to use `odgi view`. | ||
|
||
```bash | ||
# Convert to GFA | ||
odgi view -g -i $INPUT > $OUTPUT | ||
# INPUT is a .og file | ||
# OUTPUT is a new .gfa file | ||
# -g stands for "convert to gfa" | ||
``` | ||
# Python bindings | ||
|
||
> [!WARNING] Warning | ||
> It exists an older implementation of bindings, which is the one referenced in the readsthedocs.io, HOWEVER it is not [the one which should be used](https://github.com/pangenome/odgi/blob/master/test/python/odgi_ffi.md) for [performance reasons](https://github.com/pangenome/odgi/blob/master/test/python/odgi_performance.md) as well as stability issues... | ||
According to the documentation, `odgi_ffi` is meant to be used more as a tool to build a Python library than being the actual Python library. | ||
|
||
> Note that odgi also has an older high-level Python API `import odgi` that is somewhat obsolete. Instead you should probably use below `import odgi_ffi` lower level API to construct your own library. | ||
In order to fix segfaults, set `LD_PRELOAD=libjemalloc.so.2` before running Python scripts. However, I could not get it to work in any way, as if I can as the time I'm writing those lines import the bindings, I could not load a graph, giving an error when I try to (`RuntimeError: Error rewinding to load non-magic-prefixed SerializableHandleGraph`) . Given the maintainers are not implying that the [library is not fully stable](https://github.com/pangenome/odgi/issues/425#issuecomment-1305566300) I won't settle on it for the future. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
--- | ||
title: VG toolkit | ||
--- | ||
> [!WARNING] Warning | ||
> vg commands on graphs that are compressed **does not work**. It will raise a 'invalid graph type' error. | ||
## Convert from GFA1.1 to GFA1 | ||
|
||
`vg convert in.gfa -W -f > out.gfa` | ||
+ `-W` stands for suppress W-lines | ||
+ `-f` is to output to file | ||
## Convert from vg, json to GFA | ||
`vg view [-J|-V|-F] input_graph -g > out.gfa` | ||
|
||
## Call bubbles on graph to get variants | ||
`vg deconstruct -p ref graph.gfa > variants.vcf` | ||
+ `-p [STR]` stands for the path to use as reference to call variants |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
--- | ||
title: Feedback on pangenome graph construction | ||
--- | ||
Our object : a 'variation graph' (which is not a De Bruijn graph) which contains nodes with labels. | ||
|
||
To construct: | ||
+ pairwise alignment | ||
+ with software designed for full-genome alignment | ||
+ MSA (multiple sequence alignment) | ||
+ graph construction | ||
+ create nodes and edges | ||
+ save paths | ||
+ post-process (optionnal) | ||
+ pruning | ||
+ `gfaffix` at some point | ||
+ topological simplification | ||
+ compression | ||
+ ... | ||
|
||
Used tools today: | ||
+ Variation Graph (VG) | ||
+ Minigraph (MG) | ||
+ From an alignment `minigraph --ggen -L <min_size_of_variants> -c <genomes>` | ||
+ The graph is relative to reference: if we can't align on it, we don't put it in graph | ||
+ L parameter lowered makes minigraph much slower and yield issues | ||
+ Higher L parameters can help align more diverging sequences | ||
+ Minigraph-Cactus (MGC) | ||
+ It is possible to give a guide tree | ||
+ High level SV graph from MG | ||
+ This graph is used as backbone | ||
+ Put something as 'reference': this sequence won't be clipped nor cycled | ||
+ PanGenome Graph Builder (PGGB) | ||
+ Curate data before to disassemble chromosomes (tutorials available, where?) | ||
+ Huge possibilities: how to cluster chromosomes that are close together? | ||
+ Use of `wfmash` for pairwise all-vs-all alignment | ||
+ For graph induction: `seqwish` | ||
+ Smoothing with `smoothxg` | ||
+ May add paths that are not even describing a genome? | ||
+ Notion of consensus path elaborated [here](https://github.com/pangenome/smoothxg/issues/37) | ||
+ Keeps a consensus and destroys some paths that does not follow | ||
+ From the author: | ||
+ Many things should be removed | ||
+ As of now, they don't even use it internally | ||
+ Output of seqwish: should be default output but very large file | ||
+ Problem: algorithms like stochastic gradient descent on multi-thread implies that 'seeds' are not fixed: we can have different graphs from the same data | ||
![[Pasted image 20240115144532.png]] | ||
+ Post process with `gfaffix` and `odgi` | ||
|
||
Cycles are a problem for future usage of graphs. Implement a tool to 'linearize' a graph? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
URL: https://www.nature.com/articles/s41586-023-05896-x | ||
|
||
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1 . These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
URL: https://www.biorxiv.org/content/10.1101/2023.04.05.535718v1 | ||
|
||
Pangenome graphs can represent all variation between multiple genomes, but existing methods for constructing them are biased due to reference-guided approaches. In response, we have developed PanGenome Graph Builder (PGGB), a reference-free pipeline for constructing unbi-ased pangenome graphs. PGGB uses all-to-all whole-genome alignments and learned graph embeddings to build and iteratively refine a model in which we can identify variation, measure conservation, detect recombination events, and infer phylogenetic relationships. |
3 changes: 3 additions & 0 deletions
3
...ent/_publications/Cactus - Algorithms for genome multiple sequence alignment.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
DOI : https://doi.org/10.1101%2Fgr.123356.111 | ||
|
||
Much attention has been given to the problem of creating reliable multiple sequence alignments in a model incorporating substitutions, insertions, and deletions. Far less attention has been paid to the problem of optimizing alignments in the presence of more general rearrangement and copy number variation. Using Cactus graphs, recently introduced for representing sequence alignments, we describe two complementary algorithms for creating genomic alignments. We have implemented these algorithms in the new “Cactus” alignment program. We test Cactus using the Evolver genome evolution simulator, a comprehensive new tool for simulation, and show using these and existing simulations that Cactus significantly outperforms all of its peers. Finally, we make an empirical assessment of Cactus's ability to properly align genes and find interesting cases of intra-gene duplication within the primates. |
8 changes: 8 additions & 0 deletions
8
content/_publications/Cactus Graphs for Genome Comparisons.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
DOI: 10.1089/cmb.2010.0252 | ||
|
||
We introduce a data structure, analysis, and visualization scheme called a cactus graph for | ||
comparing sets of related genomes. In common with multi-break point graphs and A-Bruijn | ||
graphs, cactus graphs can represent duplications and general genomic rearrangements, but | ||
additionally, they naturally decompose the common substructures in a set of related genomes | ||
into a hierarchy of chains that can be visualized as two-dimensional multiple alignments and | ||
nets that can be visualized in circular genome plots. |
3 changes: 3 additions & 0 deletions
3
content/_publications/Construction and representation of human pangenome graphs.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
URL: https://pasteur.hal.science/pasteur-04126278/ | ||
|
||
As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. In this work we collect all publicly available high-quality human haplotypes and constructed the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost , mdbg , Minigraph , Minigraph-Cactus and pggb . We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application. |
12 changes: 12 additions & 0 deletions
12
content/_publications/Distance indexing and seed clustering in sequence graphs.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355256/pdf/btaa446.pdf | ||
|
||
Motivation: Graph representations of genomes are capable of expressing more genetic variation and can therefore | ||
better represent a population than standard linear genomes. However, due to the greater complexity of genome | ||
graphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in | ||
genome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a | ||
graph context. In read mapping algorithms such distance calculations are fundamental to determining if seed align- | ||
ments could belong to the same mapping. | ||
Results: We have developed an algorithm for quickly calculating the minimum distance between positions on a se- | ||
quence graph using a minimum distance index. We have also developed an algorithm that uses the distance index | ||
to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical | ||
to use for a new generation of mapping algorithms based upon genome graphs. |
4 changes: 4 additions & 0 deletions
4
content/_publications/GBZ file format for pangenome graphs.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
URL: https://pubmed.ncbi.nlm.nih.gov/36179091/ | ||
|
||
**Motivation:** Pangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space efficiently. | ||
**Results:** We propose the GBZ file format based on data structures used in the Giraffe short-read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems. |
3 changes: 3 additions & 0 deletions
3
...ions/Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
URL: https://www.liebertpub.com/doi/10.1089/cmb.2023.0186 | ||
|
||
A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped. |
3 changes: 3 additions & 0 deletions
3
...nt/_publications/Movi - a fast and cache-efficient full-text pangenome index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
URL : https://www.biorxiv.org/content/10.1101/2023.11.04.565615v1.full | ||
|
||
Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the “move structure” was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi’s index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification – such as pseudo-matching lengths – at least ten times faster than the fastest available methods. Movi achieves this speed by leveraging the move structure’s strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. Movi’s fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval. |
3 changes: 3 additions & 0 deletions
3
...ns/Pangenome graph construction from genome alignments with Minigraph-Cactus.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
URL: https://www.nature.com/articles/s41587-023-01793-w | ||
|
||
Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph’s ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a _Drosophila melanogaster_ pangenome. |
Oops, something went wrong.