diff --git a/content/Building a graph/minigraph-cactus.md b/content/Building a graph/minigraph-cactus.md index 567a08750bd40..9e922b8f11277 100644 --- a/content/Building a graph/minigraph-cactus.md +++ b/content/Building a graph/minigraph-cactus.md @@ -52,6 +52,13 @@ The reference will satisfy the following properties: + Be used to divide the graph into chromosomes One can define multiple references, but it won't help for clipping (but for filter?), cyclicity, nor nodes in forward orientation purposes. +> [!WARNING] Warning +> minigraph-cactus is **NOT RECOMMENDED** (see [this discussion](https://github.com/orgs/ComparativeGenomicsToolkit/discussions/1252)) for genomes that have a higher mash distance than 0.02 from the reference; it may [yield a warning](https://github.com/ComparativeGenomicsToolkit/cactus/blob/v2.7.0/src/cactus/refmap/cactus_minigraph.py#L288-L291) but may not do it properly. +> Solutions are : +> + Align with an aligner like **Progressive Cactus** from a tree (`mashtree` can be useful) +> + Cut down sequences to match the threshold +> + Try PGGB + ### Control input sequence order To create graph with sequence in a specific order that you can control, using the argument `minigraphSortInput="none"` disables default sorting by mash distance. It is to be specified in the cactus config file. diff --git a/content/Building a graph/minigraph.md b/content/Building a graph/minigraph.md index a12a9ed7e75f9..ca018f6ff2a2e 100644 --- a/content/Building a graph/minigraph.md +++ b/content/Building a graph/minigraph.md @@ -14,8 +14,13 @@ You can specify any number of `.fasta`/`.fa` files, as well as `.gfa` graph file + `c` enables base-level alignment + `x` is to specify a preset, here `ggs`, which is a simple algorithm for incremental graph generation +> [!IMPORTANT] Publication and availability +> Publication is [available](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z), and source code is available [here](https://github.com/lh3/minigraph) + The output will be in **rGFA** format, a sub-type of GFA1 that adds information about positions in the graph but removes information of genomes' origins. In rGFA, you don't have W-lines or P-lines that do serves to get the information of which fragment goes to which genome. -It's a development choice +It's a [development choice](https://github.com/lh3/minigraph/issues/27) that was made [in the formalism of rGFA](https://github.com/lh3/minigraph/issues/26), because H. Li see his tool as a way to [embed multiple genomes on a reference](https://github.com/nf-core/pangenome/issues/20), and not doing something which is reference-free. + +A pull request was made in 2022, adding [P-lines support to minigraph](https://github.com/lh3/minigraph/pull/77) but was never accepted. However, one can get this version by getting the associated commit ID. > [!WARNING] Warning -> minigraph outputs nodes prefixed with `s` ; with some tools (such as odgi) it may cause crashes. \ No newline at end of file +> minigraph outputs nodes prefixed with `s` ; with some tools (such as odgi) it may cause crashes. To convert those rGFA's to standard GFA files, [you can use gfautil](https://github.com/vgteam/vg/issues/3129) \ No newline at end of file diff --git a/content/Publications/Publications.canvas b/content/Publications/Publications.canvas new file mode 100644 index 0000000000000..aa1e4d39828c0 --- /dev/null +++ b/content/Publications/Publications.canvas @@ -0,0 +1,248 @@ +{ + "nodes":[ + {"id":"9ee4e8e1010c6b03","type":"text","text":"Permet de subdiviser le problème en sous-problèmes indépendants","x":1400,"y":-140,"width":400,"height":70}, + {"id":"167906313c410a7c","type":"text","text":"Depuis n'importe quel noeud du graphe, cette propriété permet de définir une structure d'arbre hiérarchique récursive.","x":1400,"y":-380,"width":400,"height":100}, + {"id":"58bad2b9c05a7f76","type":"text","text":"# Cactus Graphs for Genome Comparisons\nDOI: 10.1089/cmb.2010.0252\n\nWe introduce a data structure, analysis, and visualization scheme called a cactus graph for\ncomparing sets of related genomes. In common with multi-break point graphs and A-Bruijn\ngraphs, cactus graphs can represent duplications and general genomic rearrangements, but\nadditionally, they naturally decompose the common substructures in a set of related genomes\ninto a hierarchy of chains that can be visualized as two-dimensional multiple alignments and\nnets that can be visualized in circular genome plots.","x":-870,"y":-1380,"width":780,"height":300,"color":"4"}, + {"id":"919383e2a232b1fe","type":"text","text":"Break-point graphs, multi-breakpoint graphs, A-Brujin graphs","x":680,"y":-1500,"width":555,"height":60}, + {"id":"a6826d2c452750ad","type":"text","text":"Souvent NP difficile pour 3 génomes ou plus","x":680,"y":-1410,"width":555,"height":50}, + {"id":"884872dc5076dca6","type":"text","text":"Préalablement, la notion d'intervalles conservés au sein de set de permutations signées a été montrée, et il a été montré que ces intervalles pouvaient à la fois être imbriqués et organisés en séquences. Cela permettait à la structure d'avoir une forme semblable à un arbre, efficace pour le calcul comme le stockage.","x":80,"y":-1290,"width":440,"height":210}, + {"id":"9f6fc359db1439ff","type":"text","text":"Multi-séquence","x":1287,"y":-1335,"width":250,"height":60}, + {"id":"3fce6a85e9ce9de2","type":"text","text":"Duplication arbitraire (deux éléments ou plus du génome peuvent être considérés homologues)","x":1275,"y":-1239,"width":405,"height":119}, + {"id":"498a6976afcf8582","type":"text","text":"Le but de ce papier est d'introduire une nouvelle structure de données, le cactus graph, qui généralise cette notion d'intervalles conservés et leurs hiérachies","x":680,"y":-1305,"width":515,"height":100,"color":"1"}, + {"id":"c73aa01ef3b7754c","type":"text","text":"Un *block* est un alignement en deux dimensions d'un set de segments homologues ne différant que par des substitutions","x":80,"y":-1040,"width":435,"height":100}, + {"id":"4402546a20b4c968","type":"text","text":"Une *adjacency* est un ensemble de morceaux restants d'ADN non alignés ; ceux-ci se trouvent entre les segments alignés, et aux extrémités des chromosomes","x":80,"y":-920,"width":435,"height":120}, + {"id":"c5bed1188d635294","type":"text","text":"Un *cap* est une jonction artificielle qui modélise une zone non existante dans la séquence (ex : un télomère) et permet de relier des extrémités de chromosome en réelles adjacences","x":80,"y":-780,"width":435,"height":120}, + {"id":"c2362093c3e8151b","type":"text","text":"Une famille de *caps* homologues est appelé une *end*","x":604,"y":-760,"width":349,"height":80}, + {"id":"66a9ce421a4c5ec9","type":"text","text":"Alignement multi-séquence peut être représenté sous forme de matrice ou de DAGs, mais les réarrangements à grande échelle mettent à mal ces approches.","x":80,"y":-1440,"width":435,"height":120}, + {"id":"acbc449c2d39d522","type":"text","text":"Un *thread* est un chemin d'*adjacencies* et segments alternés connectés par des *caps* qui est encadré par des *adjacencies* connectées à des *caps*","x":604,"y":-1050,"width":460,"height":120}, + {"id":"1e1e30ad4bcc8914","type":"text","text":"Un *net* est un graphe où tous les noeuds sont des *end* et chaque arête représente un set d'adjacences entre les deux *caps* qu'il connecte","x":1064,"y":-900,"width":405,"height":120}, + {"id":"53c74e047c34032d","type":"text","text":"Papier très clair sur les définitions, nécessaire pour comprendre le suivant","x":-900,"y":-1440,"width":370,"height":80,"color":"3"}, + {"id":"3aba27866e336e46","type":"text","text":"Dense et difficile à interpréter, les arêtes sont des géodésiques qui traversent les noeuds","x":1592,"y":-610,"width":360,"height":110}, + {"id":"1041d25cf9b9250b","type":"text","text":"Peuvent être décomposés en composantes plus petites","x":1592,"y":-490,"width":360,"height":60}, + {"id":"c8d5f84f9751bd4c","type":"file","file":"_imgs/net_chains_threads.jpg","x":1500,"y":-1026,"width":400,"height":391}, + {"id":"742daf28a78c57f6","type":"text","text":"Un ensemble hiérarchisé de *chains* se définit comme suit : pour chaque cycle simple contenant *origin*, on concatène les *blocks* en une *child chain*. Chaque noeud rencontré est un *link* dans la chaîne ; permet de former un *cactus tree*","x":600,"y":-1727,"width":520,"height":130}, + {"id":"d559ecfdf58843b1","type":"file","file":"_imgs/cactus_tree.jpg","x":600,"y":-2016,"width":400,"height":263}, + {"id":"713c8be3fb77242f","type":"file","file":"_imgs/cactus_graph.jpg","x":140,"y":-2106,"width":400,"height":222}, + {"id":"529fdb5950adb38d","type":"text","text":"Le *cactus graph* est constitué comme suit :\n+ Chaque noeud est un *subnet* d'un *complete net*\n+ Chaque arête est un *block*\n+ Chaque *block* dans les génomes comparés apparaît comme une arête\n+ chaque *adjacency* entre deux segments ou un segment et un *telomere cap* est représentée comme un *subnet*\n+ Il consiste en une seule composante connexe, composée de cycles simples\n+ Tous les cycles simples ont une orientation\n+ Toutes les *telomere ends* sont dans un unique *subnet* représenté par un noeud appelé *origin*.","x":-20,"y":-1849,"width":560,"height":375,"color":"1"}, + {"id":"2c7fb3c6110987c0","type":"text","text":"La relation d'homologie est repésentée par ~ sur S' x S', ave S' étant l'ensemble des positions et des positions des revcomp sur l'ensemble des séquences de S","x":-192,"y":229,"width":440,"height":122}, + {"id":"5378f136cde75253","type":"text","text":"La relation ~ découpe S en un ensmeble de colonnes","x":38,"y":643,"width":330,"height":80}, + {"id":"0d4c74e6c3256e9c","type":"text","text":"On qualifie le rapport entre ces relations en faisant l'hypothèse d'une évolution","x":58,"y":463,"width":380,"height":60}, + {"id":"0ee0b762dc16bcf4","type":"text","text":"La relation ~~ désigne la relation d'alignement sur S' x S' qui a les mêmes propriétés que ~.","x":288,"y":250,"width":430,"height":80}, + {"id":"a37fa3a946925b0a","type":"text","text":"Bloc: colonne contenant deux positions ou plus","x":553,"y":783,"width":250,"height":60}, + {"id":"daacdab1cb57974c","type":"text","text":"Avec ces trois éléments, on décrit une forme restreinte et simplifiée d'un graphe d'adjacence","x":623,"y":543,"width":350,"height":100,"color":"1"}, + {"id":"e761aeaf9b4c190e","type":"text","text":"Une séquence est un ensemble de paires orientées A|T, T|A, C|G et G|C. Une séquence ADN est un mot sur l'alphabet associé","x":798,"y":263,"width":420,"height":88}, + {"id":"b22ba889e7168219","type":"text","text":"Chaque bloc et son miroir sont représentés par une paire de noeuds \"block end nodes\" connectés par une \"block edge\"","x":908,"y":743,"width":340,"height":140}, + {"id":"a3f8d96e7aceab99","type":"text","text":"### Graphe d'adjacence","x":1288,"y":1163,"width":290,"height":60}, + {"id":"6938afa6bfc6e16d","type":"file","file":"_imgs/caf_algorithm.jpg","x":1531,"y":1257,"width":400,"height":146}, + {"id":"26b0fd371f64a696","type":"text","text":"### Graphe cactus","x":1288,"y":1443,"width":290,"height":60}, + {"id":"2abe6bfd138a9db9","type":"file","file":"_imgs/bar_algorithm.jpg","x":1531,"y":1543,"width":400,"height":336}, + {"id":"48faed08f7460406","type":"text","text":"### Graphe cactus multi-niveaux","x":1233,"y":1903,"width":400,"height":60}, + {"id":"b33691b10ecd2b74","type":"text","text":"Les *stubs* représentent, de manière orientée, l'information sur les connexions aux extrémités des séquences. Constitués par une fonction non-injective, leftstub(x) renvoie le *stub* associé au début de x et rightstub(x) renvoie le *stub* associé à la fin de x.","x":1248,"y":237,"width":430,"height":186}, + {"id":"572e1342e4d6b1c1","type":"text","text":"Arêtes d'adjacence : il y a une adjacence entre une paire de positions si aucune position dans la subsequence est contenue dans un *block*, c'est-à-dire que toutes qu'aucune des bases ne sont alignées","x":1318,"y":663,"width":360,"height":180}, + {"id":"9b12143cbb4118a9","type":"text","text":"Chaque stub est représentée par un \"stub end node\" et un \"dead end node\" ","x":1731,"y":723,"width":250,"height":120}, + {"id":"a25245bd984a162e","type":"text","text":"Tous les \"dead end nodes\" sont connectés en une clique par des \"backdoor adjacency edges\"","x":1856,"y":883,"width":295,"height":120}, + {"id":"e74aa0cb235b818b","type":"file","file":"_imgs/injection.png","x":1478,"y":0,"width":400,"height":226}, + {"id":"739fec564ad1be89","type":"text","text":"### Homology and phylogeny","x":288,"y":53,"width":360,"height":60}, + {"id":"1daf2c11f1ccebf0","type":"text","text":"### Sequences and stubs","x":1038,"y":53,"width":340,"height":60}, + {"id":"8995e775a8dbba70","type":"text","text":"\"Cactus graph\" pour de l'alignement à travers de multiples génomes","x":118,"y":-430,"width":340,"height":80}, + {"id":"5a563da47acfed7a","type":"text","text":"Définition d'un \"cactus graph\" : graphe connecté ayant la propriété que chaque arête appartient au plus à un simple cycle, un simple cycle étant un cycle ou aucun noeud ni arête (à part noeud du début et de la fin)","x":553,"y":-395,"width":480,"height":140,"color":"1"}, + {"id":"f48a4c7882b18109","type":"text","text":"Première théorisation des \"cactus graphs\" : Harary F, Uhlenbeck GE 1953. On the number of husimi trees: I. Proc Natl Acad Sci 39: 315–322","x":553,"y":-525,"width":485,"height":100}, + {"id":"bc6ec2440d65a561","type":"text","text":"Un *complete net* a un noeud à chaque *end* de chaque *block* et un noeud pour chaque fin de télomère. Il y a une arête entre deux noeuds dès que il y a une adjacence entre eux dans n'importe lequel des génomes","x":1073,"y":-685,"width":405,"height":160}, + {"id":"a329a5cea24dcbcb","type":"text","text":"Peuvent overlap","x":1056,"y":2050,"width":250,"height":60}, + {"id":"136c200527f45d5f","type":"text","text":"Pour faire une hiérarchie (*tree*) de snarls, il faut en exclure certains","x":1056,"y":2150,"width":250,"height":100}, + {"id":"0381618fe1300d0b","type":"text","text":"On obtient alors une foret, avec une structure similaire aux forets de superbubbles","x":1045,"y":2275,"width":273,"height":120}, + {"id":"20e8fe9a92a9fc20","type":"text","text":"On parle d'une famille *compatible* de snarls lorque pour toute paire de snarls ont des subgraphs strictement disjoints ou strictement imbriqués","x":1356,"y":2093,"width":250,"height":215}, + {"id":"b09229bf8670d741","type":"text","text":"Le *digraph* (A) est un cas particulier, où chaque arête connecte à la fois un côté gauche et un côté droit","x":773,"y":1354,"width":325,"height":120}, + {"id":"c6d832679d9cf6c2","type":"text","text":"erreur de séquençage","x":562,"y":974,"width":250,"height":60}, + {"id":"a0820e26c5132c08","type":"text","text":"*bubble* : structure avec une source et un siphon communs, mais dont le reste des sections sont disjointes","x":-66,"y":1074,"width":500,"height":80}, + {"id":"99944f7644f8775f","type":"text","text":"variation génétique","x":562,"y":1154,"width":250,"height":60}, + {"id":"6a40b6d09c5dcd37","type":"text","text":"*superbubble* : sous-graphe plus complexe, avec un set de paths, non nécessairmeent disjoints, qui commencent et se terminent à source et siphon communs","x":283,"y":1314,"width":408,"height":120}, + {"id":"0d67d9ec7a8efcab","type":"text","text":"*Graphe bidirigé* (B) : graphe où chaque bout de chaque arête a une orientation","x":349,"y":1494,"width":349,"height":120}, + {"id":"df62fb5c6731c321","type":"file","file":"_imgs/types_of_graphs.jpg","x":782,"y":1503,"width":400,"height":303}, + {"id":"3a1f12b39c04853b","type":"text","text":"Une marche dirigée (*directed walk*) sur un graphe est une marche qui visite chaque vertex en sortant par la sortie opposée à l'entrée","x":349,"y":1654,"width":342,"height":120}, + {"id":"0ffdd366b904f707","type":"text","text":"*Biedged graph* (C) : contient deux types d'arêtes : noires et grises. Chaque noeud est obligatoirmeent connecté par une arête noire.","x":361,"y":1830,"width":349,"height":130}, + {"id":"9707d09e99a83455","type":"text","text":"Un cycle dirigé (*directed cycle*) est une marche dirigée fermée qui commence et termine sur le même côté ou sur des côtés différents d'un noeud","x":773,"y":1840,"width":375,"height":120}, + {"id":"9187b5a93d2e77d8","type":"text","text":"Une paire de noeuds distincts sont une *chain pair* ssi dans le graphe cactus résultant ils sont projetés sur le même noeud, et leurs deux arêtes noires connectées se projettent sur le même simple cycle.","x":-1337,"y":2350,"width":425,"height":160}, + {"id":"1dbb4f038ac21611","type":"text","text":"Montré par Patten et al. 2011, fusionner toutes les classes d'équivalence des noeuds triplement connectés donne un *cactus graph*","x":-2001,"y":2320,"width":440,"height":110}, + {"id":"d1fb5f4183f8fbc0","type":"text","text":"### Cactus graph\nUn cactus graph est un graphe dans lequel n'importe qu'elle paire de noeuds est au plus connectée par deux arêtes entre eux. Dans un cactus graph, chaque arête fait partie d'au plus un cycle, et deux cycles ne peuvent avoir une intersection que sur un unique noeud.","x":-1872,"y":1920,"width":621,"height":190,"color":"1"}, + {"id":"70cb0146dc3f1fbe","type":"text","text":"Pour un *cactus graph* donné, le graphe qui consiste à rassembler à condenser toutes les arêtes en simples cycles est nommé une *bridge forest*","x":-2132,"y":2560,"width":360,"height":120}, + {"id":"c7e28ab1b6d1dfff","type":"text","text":"Une chaîne (cyclique) est une séquence cyclique de *chain pairs* dans le même cycle dans le graphe cactus et ordonnés dans ce cycle","x":-1337,"y":2560,"width":425,"height":140}, + {"id":"c54f8b5a1531dfe0","type":"text","text":"Une séquence maximum de *bridge pairs* connectés par des noeuds incidents de degré 2 est une *(acyclic) chain*","x":-1712,"y":2668,"width":335,"height":120}, + {"id":"cdfed778307857a7","type":"text","text":"Une paire distincte de noeuds dans le graphe bidirigé sont une *bridge pair* si ils projettent sur le même noeud et que leurs deux arêtes noires incidentes sont des *bridges*","x":-1712,"y":2478,"width":335,"height":164}, + {"id":"bd2ac0b855e308f1","type":"text","text":"# Progressive Cactus is a multiple-genome aligner for the thousand-genome era\nDOI: https://www.nature.com/articles/s41586-020-2871-y\n\nNew genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies[1](https://www.nature.com/articles/s41586-020-2871-y#ref-CR1 \"Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).\"),[2](https://www.nature.com/articles/s41586-020-2871-y#ref-CR2 \"Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M. & Jaffe, D. B. Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).\"),[3](https://www.nature.com/articles/s41586-020-2871-y#ref-CR3 \"Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).\"). For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database[4](https://www.nature.com/articles/s41586-020-2871-y#ref-CR4 \"Kitts, P. A. et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 44 (D1), D73–D80 (2016).\") increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies[5](https://www.nature.com/articles/s41586-020-2871-y#ref-CR5 \"Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).\") are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus[6](https://www.nature.com/articles/s41586-020-2871-y#ref-CR6 \"Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).\"), a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.","x":-3180,"y":2920,"width":1188,"height":417,"color":"4"}, + {"id":"800d68620586fe43","type":"text","text":"Ajout d'un *input guide tree* qui permet de divisier le problème général en une série de sous-problèmes","x":-2450,"y":3680,"width":425,"height":87,"color":"1"}, + {"id":"464d3a859a82198b","type":"text","text":"# Cactus: Algorithms for genome multiple sequence alignment\nDOI : https://doi.org/10.1101%2Fgr.123356.111\n\nMuch attention has been given to the problem of creating reliable multiple sequence alignments in a model incorporating substitutions, insertions, and deletions. Far less attention has been paid to the problem of optimizing alignments in the presence of more general rearrangement and copy number variation. Using Cactus graphs, recently introduced for representing sequence alignments, we describe two complementary algorithms for creating genomic alignments. We have implemented these algorithms in the new “Cactus” alignment program. We test Cactus using the Evolver genome evolution simulator, a comprehensive new tool for simulation, and show using these and existing simulations that Cactus significantly outperforms all of its peers. Finally, we make an empirical assessment of Cactus's ability to properly align genes and find interesting cases of intra-gene duplication within the primates.","x":-1151,"y":-277,"width":920,"height":360,"color":"4"}, + {"id":"c17ec7ef81f35b4b","type":"text","text":"# Superbubbles, ultrabubbles and cacti\nURL : https://pubmed.ncbi.nlm.nih.gov/29461862/\n\nA superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats (e.g., variant cell format (VCF)).","x":-1420,"y":1314,"width":1220,"height":360,"color":"4"}, + {"id":"8392d7487ecced04","type":"text","text":"cacti jamais défini ?","x":-441,"y":1284,"width":250,"height":60,"color":"3"}, + {"id":"8ff05f3c7e0b7038","type":"text","text":"# Pangenome graph construction from genome alignments with Minigraph-Cactus\nURL: https://www.nature.com/articles/s41587-023-01793-w\n\nPangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph’s ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a _Drosophila melanogaster_ pangenome.","x":-5520,"y":4600,"width":1200,"height":360,"color":"4"}, + {"id":"c6fb2f397f3cf773","type":"text","text":"# The design and construction of reference pangenome graphs with minigraph\nURL: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z\n\nThe recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.","x":-6160,"y":2728,"width":1140,"height":260,"color":"4"}, + {"id":"799f68e892314b6b","type":"text","text":"Utilise des assemblies ancestrales reconstruites afin de combiner les sous-alignements","x":-1932,"y":3707,"width":290,"height":120}, + {"id":"a4cd608c024b8594","type":"text","text":"2 à 5 génomes par sous-alignement","x":-1747,"y":3919,"width":250,"height":60}, + {"id":"6dc45fab5aead6a1","type":"text","text":"découpage récursif selon le guide tree","x":-2066,"y":3919,"width":250,"height":60}, + {"id":"958344047a413d04","type":"text","text":"reconstruction par sous-groupe d'une assembly ancestrale","x":-1497,"y":3775,"width":250,"height":104}, + {"id":"363a63c9986152c9","type":"text","text":"Alignement au sein du noeud","x":-1216,"y":4069,"width":289,"height":60}, + {"id":"93fe12ee89f385ed","type":"text","text":"Scaling quadratique en nombre de bases sur la version d'origine de cactus : ajout d'une stratégie d'alignement progressive","x":-1857,"y":2860,"width":395,"height":120}, + {"id":"a3e944cb171404a8","type":"file","file":"_imgs/progressive_cactus.png","x":-1952,"y":3160,"width":513,"height":515}, + {"id":"6d2888fd017db5c0","type":"text","text":"Utilisés comme séquences à aligner pour les noeuds de rang supérieur","x":-1298,"y":3593,"width":347,"height":87}, + {"id":"0607cb8ec901ea4e","type":"text","text":"**Impact de l'input guide tree**","x":-2620,"y":4069,"width":275,"height":60}, + {"id":"f4d55f5538722fc0","type":"text","text":"Non négligeable si l'arbre est incorrect ou inconnu","x":-2780,"y":4270,"width":263,"height":100}, + {"id":"b67aca38385cc0e2","type":"text","text":"Effet réduit par l'ajout de la notion d'extragroupe, voire de plusieurs extragroupes","x":-2482,"y":4270,"width":337,"height":100}, + {"id":"15e35fa2d6d51a3d","type":"text","text":"# Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs\nURL: https://www.liebertpub.com/doi/10.1089/cmb.2023.0186\n\nA pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.","x":-4080,"y":4230,"width":1240,"height":420,"color":"4"}, + {"id":"ccad44d8e99b297f","type":"text","text":"Comparaison contre :\n+ minigraph\n+ graphaligner\n+ graphchainer\n+ (minimap2)","x":-3818,"y":4820,"width":240,"height":180}, + {"id":"473c33069dc9b5e1","type":"text","text":"Prend en compte des fonctions de gap lors de l'alignement de séquences sur un pangénome au format d'un DAG","x":-3460,"y":4820,"width":422,"height":100,"color":"1"}, + {"id":"b857d70c30b2a368","type":"text","text":"Du coup, applications ? Parce que les graphes MGC ou PGGB ne sont pas des DAG","x":-3464,"y":5000,"width":430,"height":80,"color":"2"}, + {"id":"4b073f56cef38de7","type":"text","text":"Peut-être nécessité d'un outil qui transforme des pangenome graphs en DAGs ?","x":-2988,"y":4980,"width":297,"height":120,"color":"5"}, + {"id":"2cd0774814970b9a","type":"text","text":"binary tree qui n'a pas besoin d'être complètement résolu","x":-2362,"y":3949,"width":250,"height":102}, + {"id":"8eb4486aa40f0ee0","type":"text","text":"Nouveau papier en préparation \"Haplotype-aware Sequence-to-Graph Alignment\"","x":-3265,"y":4160,"width":454,"height":90,"color":"3"}, + {"id":"b6da29614bf625f9","type":"text","text":"# Construction and representation of human pangenome graphs\nURL: https://pasteur.hal.science/pasteur-04126278/\n\nAs a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. In this work we collect all publicly available high-quality human haplotypes and constructed the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost , mdbg , Minigraph , Minigraph-Cactus and pggb . We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.","x":-4570,"y":5600,"width":1170,"height":335,"color":"4"}, + {"id":"959c6871b98e5f24","type":"text","text":"Travail comparatif sur :\n+ Bifrost\n+ mdbg\n+ Minigraph\n+ Minigraph-cactus\n+ PGGB","x":-3284,"y":5570,"width":250,"height":230}, + {"id":"2ad06502c64301be","type":"text","text":"Colored compacted de Brujin graphs (ccdbg)","x":-2941,"y":5625,"width":250,"height":60}, + {"id":"c1747efd88b1391c","type":"text","text":"Graphes de variation","x":-2941,"y":5708,"width":250,"height":60}, + {"id":"f51cc8b5120d57bd","type":"text","text":"Des *chain pairs* contigues dans une *chain* partagent deux côtés opposés d'une même arête noire","x":-1294,"y":2738,"width":340,"height":100}, + {"id":"654e9a95fdb293f5","type":"file","file":"_imgs/superbubbles.jpg","x":-540,"y":1724,"width":400,"height":179}, + {"id":"837c14414c238af7","type":"text","text":"Strictement disjointes","x":-644,"y":2180,"width":250,"height":60}, + {"id":"35c597f2a9236a4f","type":"text","text":"Définition de Odnodera et al. 2011)","x":-575,"y":2292,"width":235,"height":60}, + {"id":"129f89296035d1b3","type":"text","text":"Strictement imbriquées","x":-641,"y":2000,"width":245,"height":65}, + {"id":"e4eeaa710c4ce695","type":"text","text":"Propriétés d'imbrication strictes","x":-679,"y":2090,"width":320,"height":60}, + {"id":"9e629313e68e4600","type":"text","text":"Format HAL et HAL toolkit :\n+ format d'alignement gardant cette notion d'arbre\n+ format qui peut être modifié ; ajout ou retrait de génomes sans avoir à tout recalculer depuis zéro","x":-1377,"y":2876,"width":356,"height":209}, + {"id":"af918cdf44aa146d","type":"text","text":"# Distance indexing and seed clustering in sequence graphs\nURL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355256/pdf/btaa446.pdf\n\nMotivation: Graph representations of genomes are capable of expressing more genetic variation and can therefore\nbetter represent a population than standard linear genomes. However, due to the greater complexity of genome\ngraphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in\ngenome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a\ngraph context. In read mapping algorithms such distance calculations are fundamental to determining if seed align-\nments could belong to the same mapping.\nResults: We have developed an algorithm for quickly calculating the minimum distance between positions on a se-\nquence graph using a minimum distance index. We have also developed an algorithm that uses the distance index\nto cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical\nto use for a new generation of mapping algorithms based upon genome graphs.","x":-1171,"y":3140,"width":1060,"height":400,"color":"4"}, + {"id":"c4e7f42fad8cc6dd","type":"text","text":"### Ultrabubble\n\n+ Une superbubble dans un digraphe est une ultrabubble dans le *biedged graph* équivalent. \n+ Une ultrabubble est un snarl si sa composante séparée est acyclique et ne contient pas de *tips*.\n","x":-418,"y":2784,"width":615,"height":185,"color":"1"}, + {"id":"a76a161d695d3749","type":"text","text":"### Superbubble\nN'importe quelle paire de noeuds distincts (x,y) dans le graphe forment une superbubble ssi :\n+ y est atteignable depuis x\n+ les noeuds atteignables depuis x sans dépasser y sont identiques aux noeuds atteignables depuis y sans dépasser x\n+ Le sous-graphe obtenu par ce set est acyclique\n+ aucun autre noeud du set ne forme une paire avec x, y ou les 2 qui remplit tous les critères précédents","x":-260,"y":2033,"width":482,"height":370,"color":"1"}, + {"id":"3bac4b15a3b76c6d","type":"text","text":"### Snarl\nGénéralisation aux *biedged graphs* : les *snarls* sont des sous-graphes minimaux dont tous les noeus sont connectés au plus par deux arêtes au reste du graphe (2-BEC) \nUn *snarl* dans un biedged graph a ces propriétés :\n+ le retrait des arêtes noires entrant dans x et y déconnectent le graphe, formant une composante séparée\n+ il n'existe pas au sein de l'ensemble de noeuds un noued z tel que {x,z} ou {z,y} satisfait le critère précédent","x":349,"y":2292,"width":595,"height":320,"color":"1"}, + {"id":"cafa15b1c72270f0","type":"text","text":"Se base sur la mécanique de décomposition (notamment en snarls) présentée dans le papier précédent","x":44,"y":3140,"width":356,"height":100}, + {"id":"69cb9f0bc27775ce","type":"text","text":"### Snarl\nDéfini par une paire de noeuds (x,y):\n+ x et y sont nommés *boundaries*\n+ il doivent être séparables : couper les arêtes sortantes du snarl isole la structure\n+ le snarl résultant doit être minimal : rien à l'intérieur doit répondre à la règle d'au dessus **en conservant une de nos bornes**\n+ le snarl se réfère à la fois aux noeuds *boundaries* mais aussi aux noeuds entre eux","x":440,"y":3140,"width":460,"height":340}, + {"id":"8fbf3acc9cd470f1","type":"file","file":"_imgs/snarl_decomposition.jpg","x":918,"y":3140,"width":485,"height":340}, + {"id":"9927540a8f87d695","type":"text","text":"Une chaîne triviale de snarls ne contient qu'un unique snarl (pas de bubbles simples à la chaine)","x":821,"y":3540,"width":448,"height":70}, + {"id":"8ec43f4cf38a035f","type":"text","text":"Les relations d'imbrication des snarls peuvent être décrites par le snarl tree","x":821,"y":3633,"width":448,"height":74}, + {"id":"9b29d6bc4bf05e07","type":"text","text":"A chaque niveau où on descend dans le graphe, c'est un nouveau niveau d'imbrication","x":1356,"y":3603,"width":269,"height":134}, + {"id":"c00023f9483a734f","type":"text","text":"Snarl index : pour chaque snarl, on stocke la distance minimale entre chaque paire de côtés (+ ou -) de noeuds des structures (snarls, chains, nodes) en incluant les noeuds *boundaries*","x":-809,"y":3939,"width":337,"height":180}, + {"id":"41575dca10c5c02d","type":"text","text":"Chain index : pour chaque chaine, on stocke 3 tableaux :\n+ somme des préfixes : contient la distance minimale du début de la chaine au côté gauche de chaque snarl composant la chaine\n+ *loop distance* de chaque noeud *boundary* : distance minimale pour quitter un noeud *boundary*, changerde direction dans la chaîne et revenir au même côté de noeud en traversant dans l'autre sens. Il y a un tableau pour chaque direction.","x":-425,"y":3939,"width":503,"height":301}, + {"id":"012805a4b06cc5b4","type":"text","text":"Objectif : pouvoir faire du clustering sur le graphe. Utile pour les *seed-and-extend*","x":-573,"y":4350,"width":380,"height":92,"color":"1"}, + {"id":"b3ab8ceb14f25665","type":"text","text":"Objectif : à partir du snarl tree, trouver une distance minimale entre deux points dans le graphe","x":-573,"y":3670,"width":321,"height":115,"color":"1"}, + {"id":"79d85b0031c6ce3f","type":"text","text":"Initialisation avec chaque position dans un cluster séparé, puis aggrégation progressive en suivant le *snarl tree*","x":222,"y":4202,"width":360,"height":120}, + {"id":"24a49c27acfb97ef","type":"text","text":"A chaque étape, annotation du cluster avec deux *bondary distance* : les plus courtes distances depuis n'importe laquelle des positions jusqu'aux *boundaries* de la structure","x":174,"y":4396,"width":456,"height":120} + ], + "edges":[ + {"id":"ebe3edacf1266866","fromNode":"8995e775a8dbba70","fromSide":"right","toNode":"f48a4c7882b18109","toSide":"left"}, + {"id":"e116c9f5f8df0bd3","fromNode":"464d3a859a82198b","fromSide":"right","toNode":"8995e775a8dbba70","toSide":"left"}, + {"id":"e324d44edbab4147","fromNode":"8995e775a8dbba70","fromSide":"right","toNode":"5a563da47acfed7a","toSide":"left"}, + {"id":"308a282ebfe97dfd","fromNode":"5a563da47acfed7a","fromSide":"right","toNode":"167906313c410a7c","toSide":"left"}, + {"id":"8446850d8fb10db8","fromNode":"5a563da47acfed7a","fromSide":"right","toNode":"9ee4e8e1010c6b03","toSide":"left","label":"usage en genome MSA"}, + {"id":"968d49dfa20baa79","fromNode":"1daf2c11f1ccebf0","fromSide":"bottom","toNode":"e761aeaf9b4c190e","toSide":"top"}, + {"id":"f2a90d7cad64d42e","fromNode":"1daf2c11f1ccebf0","fromSide":"bottom","toNode":"b33691b10ecd2b74","toSide":"top"}, + {"id":"af6f58767a201a6b","fromNode":"739fec564ad1be89","fromSide":"bottom","toNode":"2c7fb3c6110987c0","toSide":"top"}, + {"id":"7ab8c5c7f1384efc","fromNode":"739fec564ad1be89","fromSide":"bottom","toNode":"0ee0b762dc16bcf4","toSide":"top"}, + {"id":"81bdf6ae38364162","fromNode":"2c7fb3c6110987c0","fromSide":"bottom","toNode":"0d4c74e6c3256e9c","toSide":"top"}, + {"id":"902c7d82346783d0","fromNode":"0ee0b762dc16bcf4","fromSide":"bottom","toNode":"0d4c74e6c3256e9c","toSide":"top"}, + {"id":"224b782b14a781ef","fromNode":"e761aeaf9b4c190e","fromSide":"bottom","toNode":"daacdab1cb57974c","toSide":"top"}, + {"id":"1d77852914c77688","fromNode":"b33691b10ecd2b74","fromSide":"bottom","toNode":"daacdab1cb57974c","toSide":"top"}, + {"id":"5dcd173ff72edebb","fromNode":"0ee0b762dc16bcf4","fromSide":"bottom","toNode":"daacdab1cb57974c","toSide":"top"}, + {"id":"a15296b939ba5240","fromNode":"2c7fb3c6110987c0","fromSide":"bottom","toNode":"5378f136cde75253","toSide":"left"}, + {"id":"43a8d36a681d56da","fromNode":"5378f136cde75253","fromSide":"right","toNode":"a37fa3a946925b0a","toSide":"left"}, + {"id":"8c76f4c5a7f32025","fromNode":"daacdab1cb57974c","fromSide":"bottom","toNode":"a37fa3a946925b0a","toSide":"top"}, + {"id":"cec0a068d3ebf1b3","fromNode":"a37fa3a946925b0a","fromSide":"right","toNode":"b22ba889e7168219","toSide":"left"}, + {"id":"977d9391d47df269","fromNode":"b22ba889e7168219","fromSide":"bottom","toNode":"a3f8d96e7aceab99","toSide":"top"}, + {"id":"244bf5cc5bd2f4bb","fromNode":"b33691b10ecd2b74","fromSide":"bottom","toNode":"9b12143cbb4118a9","toSide":"top"}, + {"id":"ab7a0beb585a38f0","fromNode":"9b12143cbb4118a9","fromSide":"right","toNode":"a25245bd984a162e","toSide":"top"}, + {"id":"bdb6b187915a84b5","fromNode":"a25245bd984a162e","fromSide":"bottom","toNode":"a3f8d96e7aceab99","toSide":"top"}, + {"id":"be9cfe85a6403cae","fromNode":"9b12143cbb4118a9","fromSide":"bottom","toNode":"a3f8d96e7aceab99","toSide":"top"}, + {"id":"04310ed63502617e","fromNode":"572e1342e4d6b1c1","fromSide":"bottom","toNode":"a3f8d96e7aceab99","toSide":"top"}, + {"id":"1df5f821a8c82727","fromNode":"58bad2b9c05a7f76","fromSide":"bottom","toNode":"464d3a859a82198b","toSide":"top","color":"4"}, + {"id":"ba8b3cf16e17d306","fromNode":"a3f8d96e7aceab99","fromSide":"bottom","toNode":"26b0fd371f64a696","toSide":"top","label":"algorithme CAF"}, + {"id":"54bcf6f2d1a8269f","fromNode":"26b0fd371f64a696","fromSide":"bottom","toNode":"48faed08f7460406","toSide":"top","label":"algorithme BAR"}, + {"id":"7018b930bdc7a00c","fromNode":"58bad2b9c05a7f76","fromSide":"right","toNode":"66a9ce421a4c5ec9","toSide":"left"}, + {"id":"23fa7e35e87f2d46","fromNode":"66a9ce421a4c5ec9","fromSide":"right","toNode":"919383e2a232b1fe","toSide":"left"}, + {"id":"517da59aadc7171a","fromNode":"66a9ce421a4c5ec9","fromSide":"right","toNode":"a6826d2c452750ad","toSide":"left"}, + {"id":"61839395d8e382ef","fromNode":"498a6976afcf8582","fromSide":"right","toNode":"9f6fc359db1439ff","toSide":"left"}, + {"id":"9ac70617c3094f73","fromNode":"498a6976afcf8582","fromSide":"right","toNode":"3fce6a85e9ce9de2","toSide":"left"}, + {"id":"f06a3d85df960a81","fromNode":"884872dc5076dca6","fromSide":"right","toNode":"498a6976afcf8582","toSide":"left"}, + {"id":"14395d7c84d6ce19","fromNode":"58bad2b9c05a7f76","fromSide":"right","toNode":"884872dc5076dca6","toSide":"left"}, + {"id":"893320a8783badc3","fromNode":"58bad2b9c05a7f76","fromSide":"right","toNode":"c73aa01ef3b7754c","toSide":"left"}, + {"id":"18446ccc67e981c7","fromNode":"58bad2b9c05a7f76","fromSide":"right","toNode":"4402546a20b4c968","toSide":"left"}, + {"id":"51047abffbe67d81","fromNode":"58bad2b9c05a7f76","fromSide":"right","toNode":"c5bed1188d635294","toSide":"left"}, + {"id":"1d4e0409dceb0b87","fromNode":"4402546a20b4c968","fromSide":"right","toNode":"acbc449c2d39d522","toSide":"left"}, + {"id":"365e46f84db50434","fromNode":"c73aa01ef3b7754c","fromSide":"right","toNode":"acbc449c2d39d522","toSide":"left"}, + {"id":"fbaab3197d5013bb","fromNode":"c5bed1188d635294","fromSide":"right","toNode":"acbc449c2d39d522","toSide":"left"}, + {"id":"049be9aeb41db5c7","fromNode":"c5bed1188d635294","fromSide":"right","toNode":"c2362093c3e8151b","toSide":"left"}, + {"id":"f9b6fc489cf8faf3","fromNode":"c2362093c3e8151b","fromSide":"right","toNode":"1e1e30ad4bcc8914","toSide":"left"}, + {"id":"a6f4481c331518c6","fromNode":"4402546a20b4c968","fromSide":"right","toNode":"1e1e30ad4bcc8914","toSide":"left"}, + {"id":"634c386f2f41e0d4","fromNode":"1e1e30ad4bcc8914","fromSide":"bottom","toNode":"bc6ec2440d65a561","toSide":"top"}, + {"id":"cbca79064ccf7eaf","fromNode":"bc6ec2440d65a561","fromSide":"right","toNode":"3aba27866e336e46","toSide":"left"}, + {"id":"02d175fedb4107bf","fromNode":"bc6ec2440d65a561","fromSide":"right","toNode":"1041d25cf9b9250b","toSide":"left"}, + {"id":"4de01744f4501c27","fromNode":"529fdb5950adb38d","fromSide":"right","toNode":"742daf28a78c57f6","toSide":"left"}, + {"id":"6a2c50bf5897173a","fromNode":"58bad2b9c05a7f76","fromSide":"right","toNode":"529fdb5950adb38d","toSide":"left"}, + {"id":"9a525a749a8f5be9","fromNode":"464d3a859a82198b","fromSide":"right","toNode":"739fec564ad1be89","toSide":"top"}, + {"id":"7abe58ad486ef183","fromNode":"464d3a859a82198b","fromSide":"right","toNode":"1daf2c11f1ccebf0","toSide":"top"}, + {"id":"86a80c283fdcec66","fromNode":"c6fb2f397f3cf773","fromSide":"bottom","toNode":"8ff05f3c7e0b7038","toSide":"top","color":"4"}, + {"id":"ccbf97af47d03a2c","fromNode":"bd2ac0b855e308f1","fromSide":"bottom","toNode":"8ff05f3c7e0b7038","toSide":"top","color":"4"}, + {"id":"54e0ce13d6b7d718","fromNode":"c17ec7ef81f35b4b","fromSide":"right","toNode":"a0820e26c5132c08","toSide":"left"}, + {"id":"2889826312059da4","fromNode":"a0820e26c5132c08","fromSide":"right","toNode":"c6d832679d9cf6c2","toSide":"left"}, + {"id":"e3c31dadb0a910dd","fromNode":"a0820e26c5132c08","fromSide":"right","toNode":"99944f7644f8775f","toSide":"left"}, + {"id":"2b92a6afb54af1ed","fromNode":"a0820e26c5132c08","fromSide":"bottom","toNode":"6a40b6d09c5dcd37","toSide":"left"}, + {"id":"2688c91e67850790","fromNode":"c17ec7ef81f35b4b","fromSide":"right","toNode":"0d67d9ec7a8efcab","toSide":"left"}, + {"id":"9e8a908012a6bdd3","fromNode":"0d67d9ec7a8efcab","fromSide":"right","toNode":"b09229bf8670d741","toSide":"left"}, + {"id":"9fc422907c5f9eb0","fromNode":"c17ec7ef81f35b4b","fromSide":"right","toNode":"0ffdd366b904f707","toSide":"left"}, + {"id":"0b947d400cd98ec2","fromNode":"c17ec7ef81f35b4b","fromSide":"right","toNode":"3a1f12b39c04853b","toSide":"left"}, + {"id":"c346218255b123fb","fromNode":"3a1f12b39c04853b","fromSide":"right","toNode":"9707d09e99a83455","toSide":"left"}, + {"id":"ef8658dfc14ed38f","fromNode":"c17ec7ef81f35b4b","fromSide":"right","toNode":"a76a161d695d3749","toSide":"top"}, + {"id":"a0d22ab405871e8d","fromNode":"0ffdd366b904f707","fromSide":"bottom","toNode":"3bac4b15a3b76c6d","toSide":"top"}, + {"id":"ffb00171e2e70810","fromNode":"a76a161d695d3749","fromSide":"right","toNode":"3bac4b15a3b76c6d","toSide":"top","color":"1","label":"adaptation aux biedged graphs"}, + {"id":"d0d999112b20f6ac","fromNode":"9187b5a93d2e77d8","fromSide":"bottom","toNode":"c7e28ab1b6d1dfff","toSide":"top"}, + {"id":"2629314dd672807c","fromNode":"d1fb5f4183f8fbc0","fromSide":"bottom","toNode":"9187b5a93d2e77d8","toSide":"top"}, + {"id":"de7ff07024229bd1","fromNode":"d1fb5f4183f8fbc0","fromSide":"bottom","toNode":"1dbb4f038ac21611","toSide":"top"}, + {"id":"60e9ede61ce5519a","fromNode":"1dbb4f038ac21611","fromSide":"bottom","toNode":"70cb0146dc3f1fbe","toSide":"top"}, + {"id":"29450b8323a2452b","fromNode":"70cb0146dc3f1fbe","fromSide":"right","toNode":"cdfed778307857a7","toSide":"left"}, + {"id":"70a7ba8dafde6e71","fromNode":"70cb0146dc3f1fbe","fromSide":"right","toNode":"c54f8b5a1531dfe0","toSide":"left"}, + {"id":"a47314fa9d5b65d8","fromNode":"a76a161d695d3749","fromSide":"left","toNode":"e4eeaa710c4ce695","toSide":"right"}, + {"id":"71c0457cbed67c68","fromNode":"e4eeaa710c4ce695","fromSide":"top","toNode":"129f89296035d1b3","toSide":"bottom"}, + {"id":"110fc9b614143ecc","fromNode":"e4eeaa710c4ce695","fromSide":"bottom","toNode":"837c14414c238af7","toSide":"top"}, + {"id":"f311f9d0f44c02f3","fromNode":"3bac4b15a3b76c6d","fromSide":"right","toNode":"a329a5cea24dcbcb","toSide":"left"}, + {"id":"2872436e59ec4fb8","fromNode":"3bac4b15a3b76c6d","fromSide":"right","toNode":"136c200527f45d5f","toSide":"left"}, + {"id":"f70a7d03963d68d4","fromNode":"136c200527f45d5f","fromSide":"right","toNode":"20e8fe9a92a9fc20","toSide":"left"}, + {"id":"386376220863fc6e","fromNode":"3bac4b15a3b76c6d","fromSide":"right","toNode":"0381618fe1300d0b","toSide":"left"}, + {"id":"35e1fb5b82214647","fromNode":"136c200527f45d5f","fromSide":"bottom","toNode":"0381618fe1300d0b","toSide":"top"}, + {"id":"c6d923407d8fda25","fromNode":"c4e7f42fad8cc6dd","fromSide":"top","toNode":"3bac4b15a3b76c6d","toSide":"bottom","color":"1","label":"généralisation"}, + {"id":"cb9f40234840aec2","fromNode":"a76a161d695d3749","fromSide":"bottom","toNode":"c4e7f42fad8cc6dd","toSide":"top","color":"1","label":"extension aux graphes bidirigés"}, + {"id":"9b74b92bba69ce36","fromNode":"a76a161d695d3749","fromSide":"left","toNode":"35c597f2a9236a4f","toSide":"right"}, + {"id":"fde3f265977a65ff","fromNode":"c17ec7ef81f35b4b","fromSide":"bottom","toNode":"d1fb5f4183f8fbc0","toSide":"top"}, + {"id":"06c24211732c06b6","fromNode":"bd2ac0b855e308f1","fromSide":"right","toNode":"93fe12ee89f385ed","toSide":"left"}, + {"id":"bd2247f4d68e28c0","fromNode":"bd2ac0b855e308f1","fromSide":"bottom","toNode":"800d68620586fe43","toSide":"left"}, + {"id":"97a7d3b740d57216","fromNode":"464d3a859a82198b","fromSide":"bottom","toNode":"c17ec7ef81f35b4b","toSide":"top","color":"4"}, + {"id":"89a30d2f3147142b","fromNode":"464d3a859a82198b","fromSide":"bottom","toNode":"bd2ac0b855e308f1","toSide":"top","color":"4"}, + {"id":"9d9d59fdf4eb4e8f","fromNode":"800d68620586fe43","fromSide":"right","toNode":"799f68e892314b6b","toSide":"left"}, + {"id":"c2830ac1178daa97","fromNode":"799f68e892314b6b","fromSide":"bottom","toNode":"a4cd608c024b8594","toSide":"top"}, + {"id":"2c5c354e0741048a","fromNode":"799f68e892314b6b","fromSide":"bottom","toNode":"6dc45fab5aead6a1","toSide":"top"}, + {"id":"39c967e77abf27a8","fromNode":"800d68620586fe43","fromSide":"bottom","toNode":"2cd0774814970b9a","toSide":"top"}, + {"id":"cc88d2536c556762","fromNode":"a4cd608c024b8594","fromSide":"right","toNode":"958344047a413d04","toSide":"bottom"}, + {"id":"60f95a39c85b64eb","fromNode":"958344047a413d04","fromSide":"top","toNode":"799f68e892314b6b","toSide":"right"}, + {"id":"0b057cb2697bdeb8","fromNode":"958344047a413d04","fromSide":"right","toNode":"363a63c9986152c9","toSide":"top","label":"assembly ancestrale"}, + {"id":"b9c3076fa93276be","fromNode":"a4cd608c024b8594","fromSide":"bottom","toNode":"363a63c9986152c9","toSide":"top","label":"assemblies originales"}, + {"id":"078adaa5637ef6b9","fromNode":"958344047a413d04","fromSide":"right","toNode":"6d2888fd017db5c0","toSide":"bottom"}, + {"id":"e0b543b48f85d0f2","fromNode":"bd2ac0b855e308f1","fromSide":"right","toNode":"9e629313e68e4600","toSide":"left"}, + {"id":"a28c36841d31a3ba","fromNode":"800d68620586fe43","fromSide":"bottom","toNode":"0607cb8ec901ea4e","toSide":"top"}, + {"id":"7aaf6003ad328bcc","fromNode":"0607cb8ec901ea4e","fromSide":"bottom","toNode":"b67aca38385cc0e2","toSide":"top"}, + {"id":"f89a0b3771dd7f12","fromNode":"0607cb8ec901ea4e","fromSide":"bottom","toNode":"f4d55f5538722fc0","toSide":"top"}, + {"id":"6153bd15466017cb","fromNode":"15e35fa2d6d51a3d","fromSide":"bottom","toNode":"ccad44d8e99b297f","toSide":"top"}, + {"id":"3fc4574ac8d88568","fromNode":"15e35fa2d6d51a3d","fromSide":"bottom","toNode":"473c33069dc9b5e1","toSide":"top"}, + {"id":"72cce186facc20fc","fromNode":"473c33069dc9b5e1","fromSide":"bottom","toNode":"b857d70c30b2a368","toSide":"top"}, + {"id":"0d2b8af12a718c47","fromNode":"473c33069dc9b5e1","fromSide":"right","toNode":"4b073f56cef38de7","toSide":"top"}, + {"id":"091618dd7a3964e1","fromNode":"b6da29614bf625f9","fromSide":"right","toNode":"959c6871b98e5f24","toSide":"left"}, + {"id":"b74d4005748a7017","fromNode":"959c6871b98e5f24","fromSide":"right","toNode":"2ad06502c64301be","toSide":"left"}, + {"id":"6d1715eb9a5a9c5a","fromNode":"959c6871b98e5f24","fromSide":"right","toNode":"c1747efd88b1391c","toSide":"left"}, + {"id":"9b1ee7d7288087fa","fromNode":"c17ec7ef81f35b4b","fromSide":"bottom","toNode":"af918cdf44aa146d","toSide":"top","color":"4"}, + {"id":"dc5b669fe605a811","fromNode":"c7e28ab1b6d1dfff","fromSide":"bottom","toNode":"f51cc8b5120d57bd","toSide":"top"}, + {"id":"601b64253090e1e8","fromNode":"af918cdf44aa146d","fromSide":"right","toNode":"cafa15b1c72270f0","toSide":"left"}, + {"id":"c5e98cf9cfc6f4c6","fromNode":"af918cdf44aa146d","fromSide":"right","toNode":"69cb9f0bc27775ce","toSide":"left"}, + {"id":"7606d6cdc7a0385f","fromNode":"3bac4b15a3b76c6d","fromSide":"bottom","toNode":"69cb9f0bc27775ce","toSide":"top"}, + {"id":"6608b9ba90ddbe0b","fromNode":"69cb9f0bc27775ce","fromSide":"bottom","toNode":"9927540a8f87d695","toSide":"left"}, + {"id":"a183b11ae8fa1345","fromNode":"69cb9f0bc27775ce","fromSide":"bottom","toNode":"8ec43f4cf38a035f","toSide":"left"}, + {"id":"199c78ed0baf991f","fromNode":"8ec43f4cf38a035f","fromSide":"right","toNode":"9b29d6bc4bf05e07","toSide":"left"}, + {"id":"16fd9d6161412c6f","fromNode":"af918cdf44aa146d","fromSide":"bottom","toNode":"b3ab8ceb14f25665","toSide":"top"}, + {"id":"bd115360c5acb561","fromNode":"b3ab8ceb14f25665","fromSide":"bottom","toNode":"c00023f9483a734f","toSide":"top"}, + {"id":"1dc61bbce548a31c","fromNode":"b3ab8ceb14f25665","fromSide":"bottom","toNode":"41575dca10c5c02d","toSide":"top"}, + {"id":"bf01cb52323b6ff1","fromNode":"c00023f9483a734f","fromSide":"bottom","toNode":"012805a4b06cc5b4","toSide":"top"}, + {"id":"4e0e2554d28232b3","fromNode":"41575dca10c5c02d","fromSide":"bottom","toNode":"012805a4b06cc5b4","toSide":"top"}, + {"id":"69656e75ed4cba68","fromNode":"79d85b0031c6ce3f","fromSide":"bottom","toNode":"24a49c27acfb97ef","toSide":"top"}, + {"id":"3748c68fbbc6330c","fromNode":"012805a4b06cc5b4","fromSide":"right","toNode":"79d85b0031c6ce3f","toSide":"left"} + ] +} \ No newline at end of file diff --git a/content/Working with graphs/Tools/bubblegun.md b/content/Working with graphs/Tools/bubblegun.md index 5be744c90bf31..ac9b68a6e2482 100644 --- a/content/Working with graphs/Tools/bubblegun.md +++ b/content/Working with graphs/Tools/bubblegun.md @@ -9,7 +9,10 @@ Tool for detecting bubbles and superbubbles in De-bruijn/variation graphs. Sever - Extracting two random paths from each bubble chain for haplotyping - Extracting information from long reads aligned to bubble chains -Publication available [here](https://academic.oup.com/bioinformatics/article/38/17/4217/6633304), source code available [here](https://github.com/fawaz-dabbaghieh/bubble_gun). +> [!IMPORTANT] Publication and availability +> Publication available [here](https://academic.oup.com/bioinformatics/article/38/17/4217/6633304), source code available [here](https://github.com/fawaz-dabbaghieh/bubble_gun). > [!WARNING] Warning -> The function `bfs` in the package starts an infinite loop if target node is on a end of the graph. \ No newline at end of file +> The function `bfs` in the package starts an infinite loop if target node is on a end of the graph. + +The tool, written in **Python**, is both usable in command-line and as imports in other Python scripts/programs. \ No newline at end of file diff --git a/content/Working with graphs/Tools/gfaffix.md b/content/Working with graphs/Tools/gfaffix.md new file mode 100644 index 0000000000000..6475ac1f97414 --- /dev/null +++ b/content/Working with graphs/Tools/gfaffix.md @@ -0,0 +1,34 @@ +--- +title: GFAffix +--- +It aims to compress shared sequences that are distributed along multiple paths where one path should not have a fork (meaning we have two nodes that could be merged without any consequence on the graph information, for instance). + +![[gfaffix-illustration.png]] + +> [!IMPORTANT] Publication and availability +> GFAffix appears to be **not published as of december 2023**. A preprint is in writing (see [this issue](https://github.com/marschall-lab/GFAffix/issues/9) of GFAffix, but it was delayed.) Source code is available [here](https://github.com/marschall-lab/GFAffix). +# Installation +Requires **rust**, and is available through conda. + +```bash +conda create .env-gfaffix +conda activate .env-gfaffx + +conda install -c conda-forge rust +conda install -c bioconda gfaffix + +conda deactivate +``` + +To run GFAffix, the command is: `gfaffix -o `. + +> [!NOTE] Note +> The last step of [[pggb]] applies GFAffix (taken from the docs: "Finally, we apply gfaffix to remove forks where both alternatives have the same sequence.") and [[minigraph-cactus]] applies it in it's last step (`cactus-graphmap-join`); however, if applying GFAffix on a PGGB graph returns the same graph, it is not the case for minigraph-cactus. We can expect that GFAffix is not the last step of `cactus-graphmap-join`, or is ran with exclusion patterns. + +# GFAffix and [[editions]] + +From the definition of [[editions]] I came with, I wanted to see how GFAffix impacted the resulting graph and the distance to other graphs. Without any surprise as the tool is present in both pipelines, the impact of running GFAffix is marginal. + +![[gfaffix_clustering.png]] + +However, on graphs constructed solely using seqwish, the impact of GFAffix is not marginal: 55 editions for a graph with 820 nodes and two haplotypes \ No newline at end of file diff --git a/content/Working with graphs/Tools/gfafix.md b/content/Working with graphs/Tools/gfafix.md deleted file mode 100644 index 96af0854083e6..0000000000000 --- a/content/Working with graphs/Tools/gfafix.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -title: GFAffix ---- -Tool is referenced [here](https://github.com/marschall-lab/GFAffix). It aims to compress shared sequences that are distributed along multiple paths where one path should not change - -# Installation -Requires rust, and is available through conda. - -```bash -conda create .env-gfaffix -conda activate .env-gfaffx - -conda install -c conda-forge rust -conda install -c bioconda gfaffix - -conda deactivate -``` - -To run GFAffix, the command is: `gfaffix -o `. - -> [!NOTE] Note -> The last step of [[pggb]] applies GFAffix (taken from the docs: "Finally, we apply gfaffix to remove forks where both alternatives have the same sequence.") and [[minigraph-cactus]] applies it in it's last step (`cactus-graphmap-join`); however, if applying GFAffix on a PGGB graph returns the same graph, it is not the case for minigraph-cactus. We can expect that GFAffix is not the last step of `cactus-graphmap-join`, or is ran with exclusion patterns. - -# GFAffix and [[editions]] - -From the definition of [[editions]], I wanted to see how GFAffix impacted the resulting graph and the distance to other graphs. Without any surprise as the tool is present in both pipelines, the impact of running GFAffix is marginal. - -![[gfaffix_clustering.png]] - diff --git a/content/Working with graphs/catalog.md b/content/Working with graphs/catalog.md new file mode 100644 index 0000000000000..42f1c07f1b9c1 --- /dev/null +++ b/content/Working with graphs/catalog.md @@ -0,0 +1,18 @@ +--- +title: Cataloging pangenomic tools +--- +> [!NOTE] Information +> With hundred of tools are labelled as 'pangenome graph' or 'variation graph' on github, it is technically impossible to have a complete and comprehensive catalog of tools. + +This section will try to cover as much tools as it can, pointing to existing catalogs and more in-depth descriptions of tools when I used them. + +Known catalogs or blogs: ++ [Catalog](https://pangenome.github.io/) from the PGGB team + +Tools: ++ [[bubblegun]], a bubble and superbubble caller ++ [[gfaffix]], a tool to simplify graphs ++ [[gfapy]], a python library to handle GFA format ++ [[odgi]], a toolkit for pangenomes ++ [[gfagraphs]] (own work) a library to handle GFA format ++ [[pancat]] (own work) a small toolkit for pangenomes \ No newline at end of file diff --git a/content/Working with graphs/editions.md b/content/Working with graphs/editions.md index 69a96ae20f143..708e3e8db24c5 100644 --- a/content/Working with graphs/editions.md +++ b/content/Working with graphs/editions.md @@ -1,3 +1,7 @@ --- title: Compare pangenome graphs --- +> [!NOTE] Information +> I am the author of `pancat`. Thus I only describe it's priciple, keep it in mind while reading this. The method was first presented [here](https://hal.science/hal-04320771v1) and is currently **not** published. + +In order to asses how a graph is different from another, the idea was to compare segmentation between the two graphs. \ No newline at end of file diff --git a/content/_imgs/bar_algorithm.jpg b/content/_imgs/bar_algorithm.jpg new file mode 100644 index 0000000000000..017ec11a41977 Binary files /dev/null and b/content/_imgs/bar_algorithm.jpg differ diff --git a/content/_imgs/cactus_graph.jpg b/content/_imgs/cactus_graph.jpg new file mode 100644 index 0000000000000..e3b7411f8fd7e Binary files /dev/null and b/content/_imgs/cactus_graph.jpg differ diff --git a/content/_imgs/cactus_tree.jpg b/content/_imgs/cactus_tree.jpg new file mode 100644 index 0000000000000..be39a60626f70 Binary files /dev/null and b/content/_imgs/cactus_tree.jpg differ diff --git a/content/_imgs/caf_algorithm.jpg b/content/_imgs/caf_algorithm.jpg new file mode 100644 index 0000000000000..0022f1ba31bd6 Binary files /dev/null and b/content/_imgs/caf_algorithm.jpg differ diff --git a/content/_imgs/gfaffix-illustration.png b/content/_imgs/gfaffix-illustration.png new file mode 100644 index 0000000000000..38be1e5c62a19 Binary files /dev/null and b/content/_imgs/gfaffix-illustration.png differ diff --git a/content/_imgs/injection.png b/content/_imgs/injection.png new file mode 100644 index 0000000000000..a5c1d580c46f1 Binary files /dev/null and b/content/_imgs/injection.png differ diff --git a/content/_imgs/net_chains_threads.jpg b/content/_imgs/net_chains_threads.jpg new file mode 100644 index 0000000000000..b12f4ce9a448c Binary files /dev/null and b/content/_imgs/net_chains_threads.jpg differ diff --git a/content/_imgs/progressive_cactus.png b/content/_imgs/progressive_cactus.png new file mode 100644 index 0000000000000..0ea38a9e8dc79 Binary files /dev/null and b/content/_imgs/progressive_cactus.png differ diff --git a/content/_imgs/snarl_decomposition.jpg b/content/_imgs/snarl_decomposition.jpg new file mode 100644 index 0000000000000..8b7fd54ce7019 Binary files /dev/null and b/content/_imgs/snarl_decomposition.jpg differ diff --git a/content/_imgs/superbubbles.jpg b/content/_imgs/superbubbles.jpg new file mode 100644 index 0000000000000..57f50ac0db506 Binary files /dev/null and b/content/_imgs/superbubbles.jpg differ diff --git a/content/_imgs/types_of_graphs.jpg b/content/_imgs/types_of_graphs.jpg new file mode 100644 index 0000000000000..1fe855981238e Binary files /dev/null and b/content/_imgs/types_of_graphs.jpg differ