diff --git a/content/Building a graph/minigraph-cactus.md b/content/Building a graph/minigraph-cactus.md index 9e922b8f11277..7caba28a5e70d 100644 --- a/content/Building a graph/minigraph-cactus.md +++ b/content/Building a graph/minigraph-cactus.md @@ -15,7 +15,7 @@ Here are some timings and disk consumption: | Bovine | Chromosome 28 (47Mb) | 136MB | 3 | 1.375 min | 43Go | 94MB | 210MB | | Bovine | Chromosome 3 (210Mb) | 620MB | 3 | 2.625 min | 71Go | 280MB | 550MB | -## Step-by-step walkthrough +## Points of attention > [!WARNING] Warning > minigraph-cactus uses **toil**, a [pipeline managment system](https://toil.ucsc-cgl.org/) that stores its temporary files in a folder. As of now, if you run step-by-step the pipeline, you won't have any issues, but if you try to recreate on your own one of those steps, or re-run a previously run step, you will encounter errors. Tje safest practice for a standard usecase is to destroy the jobstore manually once the job is finished. @@ -63,6 +63,10 @@ One can define multiple references, but it won't help for clipping (but for filt To create graph with sequence in a specific order that you can control, using the argument `minigraphSortInput="none"` disables default sorting by mash distance. It is to be specified in the cactus config file. +### Handling repetitive senquences + +Taken from the minigraph-cactus paper : +> Minigraph-Cactus (in common with all MSA tools we know of[24](https://www.biorxiv.org/content/10.1101/2022.10.06.511217v3.full#ref-24)) cannot presently satisfactorily align highly repetitive sequences like satellite arrays, centromeres and telomeres because they lack sufficiently unique subsequences for minigraph to use as alignment seeds. As such, these regions will remain largely unaligned throughout the pipeline and will make the graph difficult to index and map to by introducing vast amounts of redundant sequence. We recommend clipping them out for most applications and provide the option to do so by removing paths with >N bases that do not align to the underlying SV graph constructed with minigraph (**[Figure 1F](https://www.biorxiv.org/content/10.1101/2022.10.06.511217v3.full#F1)**). In preliminary studies of mapping short reads and calling small variants (see below), we found that even more aggressively filtering the graph helps improve accuracy. For this reason, an optional allele-frequency filter is included to remove nodes of the graph present in fewer than N haplotypes and can be used when making indexes for vg giraffe. ### Control graph output filtering @@ -98,4 +102,42 @@ Details about steps: Pipeline can also be executed in a single command: ```bash cactus-pangenome --outDir --outName --reference -``` \ No newline at end of file +``` + +## Step-by-step walkthrough + +> [!IMPORTANT] Beware +> Reference documentation [can be found here](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md). Next statements are issued from my own analysis of the pieline. + +At each step, **toil** seems to be called and parameters of the current command are added to it. It may explain why **toil** is lacking some files when each step is executed separately. +### STEP 1 : minigraph +Script [can be found here](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/refmap/cactus_minigraph.py). It simply builds a minigraph graph within a **toil** pipeline, with a cactus seqfile as input. + +Inputs : ++ a set of input fasta + +In this first step, sequences are ordered (by default) by their mash distance to the reference. GFA is computed within toil. At the end of this step, the [GFA from minigraph is exported](https://github.com/ComparativeGenomicsToolkit/cactus/blob/55a5a3f4cc928b646367610ca76bf9b4f42e4769/src/cactus/refmap/cactus_minigraph.py#L125C31-L125C31). + +Are performed : ++ a loading of the cactus seqfile ++ a verification of sample names, given a few rules : + + the "." character is overloaded to specify haplotype + + a file is invalid if it starts with "." + + if the file ends with a ".", there must be one numeric character behind to specify haplotype + + all sequences names that are prefixed by the reference name are invalid (naming convention for *graphmap-join*) ++ an importation of the sequences from the tree of the seqfile into the **toil** jobstore stored in .fa ++ (if asked so) a [sanitization of the fasta headers](https://github.com/ComparativeGenomicsToolkit/cactus/blob/55a5a3f4cc928b646367610ca76bf9b4f42e4769/src/cactus/preprocessor/checkUniqueHeaders.py#L37) +> It will strip everything up to and including last # so HG002#0#chr2 would get changed to just chr2 (and then id=HG002| would be prefixed as above) this keeps redundant information out of the sequence names, otherwise it can be duplicated in the final output. it also keeps # symbols out of sequence names ++ (if asked so) a mash-distance ordering of the input sequences ++ executes minigraph with parameters in the XML file + +In a discussion on [cactus discussions](https://github.com/orgs/ComparativeGenomicsToolkit/discussions/1254) it was noted that parallelism of minigraph comes by taking each chromosome on one different thread (as minigraph does not take into account inter-chromosomal events). +### STEP 2 : cactus-graphmap +Script [can be found here](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/src/cactus/refmap/cactus_graphmap.py). It does the alignment between fasta sequences and the minigraph graph created at step 1. + +Inputs : ++ a minigraph graph ++ a set of input fasta +Outputs : ++ PAF file which aligns each fasta to the contig sequences of the graph ++ multifasta containing graph contigs, treated as an assemby by cactus later on \ No newline at end of file diff --git a/content/Publications/Publications.canvas b/content/Publications/Publications.canvas index aa1e4d39828c0..28bb55fecb33d 100644 --- a/content/Publications/Publications.canvas +++ b/content/Publications/Publications.canvas @@ -75,7 +75,6 @@ {"id":"c17ec7ef81f35b4b","type":"text","text":"# Superbubbles, ultrabubbles and cacti\nURL : https://pubmed.ncbi.nlm.nih.gov/29461862/\n\nA superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats (e.g., variant cell format (VCF)).","x":-1420,"y":1314,"width":1220,"height":360,"color":"4"}, {"id":"8392d7487ecced04","type":"text","text":"cacti jamais défini ?","x":-441,"y":1284,"width":250,"height":60,"color":"3"}, {"id":"8ff05f3c7e0b7038","type":"text","text":"# Pangenome graph construction from genome alignments with Minigraph-Cactus\nURL: https://www.nature.com/articles/s41587-023-01793-w\n\nPangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph’s ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a _Drosophila melanogaster_ pangenome.","x":-5520,"y":4600,"width":1200,"height":360,"color":"4"}, - {"id":"c6fb2f397f3cf773","type":"text","text":"# The design and construction of reference pangenome graphs with minigraph\nURL: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z\n\nThe recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.","x":-6160,"y":2728,"width":1140,"height":260,"color":"4"}, {"id":"799f68e892314b6b","type":"text","text":"Utilise des assemblies ancestrales reconstruites afin de combiner les sous-alignements","x":-1932,"y":3707,"width":290,"height":120}, {"id":"a4cd608c024b8594","type":"text","text":"2 à 5 génomes par sous-alignement","x":-1747,"y":3919,"width":250,"height":60}, {"id":"6dc45fab5aead6a1","type":"text","text":"découpage récursif selon le guide tree","x":-2066,"y":3919,"width":250,"height":60}, @@ -94,10 +93,6 @@ {"id":"4b073f56cef38de7","type":"text","text":"Peut-être nécessité d'un outil qui transforme des pangenome graphs en DAGs ?","x":-2988,"y":4980,"width":297,"height":120,"color":"5"}, {"id":"2cd0774814970b9a","type":"text","text":"binary tree qui n'a pas besoin d'être complètement résolu","x":-2362,"y":3949,"width":250,"height":102}, {"id":"8eb4486aa40f0ee0","type":"text","text":"Nouveau papier en préparation \"Haplotype-aware Sequence-to-Graph Alignment\"","x":-3265,"y":4160,"width":454,"height":90,"color":"3"}, - {"id":"b6da29614bf625f9","type":"text","text":"# Construction and representation of human pangenome graphs\nURL: https://pasteur.hal.science/pasteur-04126278/\n\nAs a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. In this work we collect all publicly available high-quality human haplotypes and constructed the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost , mdbg , Minigraph , Minigraph-Cactus and pggb . We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.","x":-4570,"y":5600,"width":1170,"height":335,"color":"4"}, - {"id":"959c6871b98e5f24","type":"text","text":"Travail comparatif sur :\n+ Bifrost\n+ mdbg\n+ Minigraph\n+ Minigraph-cactus\n+ PGGB","x":-3284,"y":5570,"width":250,"height":230}, - {"id":"2ad06502c64301be","type":"text","text":"Colored compacted de Brujin graphs (ccdbg)","x":-2941,"y":5625,"width":250,"height":60}, - {"id":"c1747efd88b1391c","type":"text","text":"Graphes de variation","x":-2941,"y":5708,"width":250,"height":60}, {"id":"f51cc8b5120d57bd","type":"text","text":"Des *chain pairs* contigues dans une *chain* partagent deux côtés opposés d'une même arête noire","x":-1294,"y":2738,"width":340,"height":100}, {"id":"654e9a95fdb293f5","type":"file","file":"_imgs/superbubbles.jpg","x":-540,"y":1724,"width":400,"height":179}, {"id":"837c14414c238af7","type":"text","text":"Strictement disjointes","x":-644,"y":2180,"width":250,"height":60}, @@ -120,7 +115,19 @@ {"id":"012805a4b06cc5b4","type":"text","text":"Objectif : pouvoir faire du clustering sur le graphe. Utile pour les *seed-and-extend*","x":-573,"y":4350,"width":380,"height":92,"color":"1"}, {"id":"b3ab8ceb14f25665","type":"text","text":"Objectif : à partir du snarl tree, trouver une distance minimale entre deux points dans le graphe","x":-573,"y":3670,"width":321,"height":115,"color":"1"}, {"id":"79d85b0031c6ce3f","type":"text","text":"Initialisation avec chaque position dans un cluster séparé, puis aggrégation progressive en suivant le *snarl tree*","x":222,"y":4202,"width":360,"height":120}, - {"id":"24a49c27acfb97ef","type":"text","text":"A chaque étape, annotation du cluster avec deux *bondary distance* : les plus courtes distances depuis n'importe laquelle des positions jusqu'aux *boundaries* de la structure","x":174,"y":4396,"width":456,"height":120} + {"id":"24a49c27acfb97ef","type":"text","text":"A chaque étape, annotation du cluster avec deux *bondary distance* : les plus courtes distances depuis n'importe laquelle des positions jusqu'aux *boundaries* de la structure","x":174,"y":4396,"width":456,"height":120}, + {"id":"c6fb2f397f3cf773","type":"text","text":"# The design and construction of reference pangenome graphs with minigraph\nURL: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z\n\nThe recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.","x":-6520,"y":3120,"width":1140,"height":260,"color":"4"}, + {"id":"b6da29614bf625f9","type":"text","text":"# Construction and representation of human pangenome graphs\nURL: https://pasteur.hal.science/pasteur-04126278/\n\nAs a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. In this work we collect all publicly available high-quality human haplotypes and constructed the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost , mdbg , Minigraph , Minigraph-Cactus and pggb . We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.","x":-3631,"y":5510,"width":1170,"height":335,"color":"4"}, + {"id":"959c6871b98e5f24","type":"text","text":"Travail comparatif sur :\n+ Bifrost\n+ mdbg\n+ Minigraph\n+ Minigraph-cactus\n+ PGGB","x":-2345,"y":5480,"width":250,"height":230}, + {"id":"2ad06502c64301be","type":"text","text":"Colored compacted de Brujin graphs (ccdbg)","x":-2002,"y":5535,"width":250,"height":60}, + {"id":"c1747efd88b1391c","type":"text","text":"Graphes de variation","x":-2002,"y":5618,"width":250,"height":60}, + {"id":"a07c9f18c29efd3c","type":"text","text":"Continuation de cette définition de **snarls** pour définir la variation","x":-4520,"y":5200,"width":320,"height":80}, + {"id":"395da1694bf23728","type":"text","text":"Construction d'un SV graph avec **minigraph**","x":-5280,"y":5240,"width":250,"height":80,"color":"3"}, + {"id":"565c2f071796974b","x":-5030,"y":5715,"width":400,"height":130,"type":"text","text":"Enlever les alignements incomplets et fallacieux correspondant à de coutes chaînes visitées par un grand nombre de séquences"}, + {"id":"de1f18f12d0c57b5","x":-5280,"y":5935,"width":250,"height":60,"color":"3","type":"text","text":"Sortie en HAL, converti en vg par *hal2vg*"}, + {"id":"a126e0f80a9ae345","type":"text","text":"Construction d'un **cactus graph**","x":-5280,"y":5555,"width":250,"height":80,"color":"3"}, + {"id":"56c7e177b0c92f4b","x":-4955,"y":5380,"width":250,"height":130,"type":"text","text":""}, + {"id":"502d8f0c671f3d9f","x":-5520,"y":5685,"width":250,"height":95,"type":"text","text":"POA pour l'induction du graphe cactus sur l'alignement"} ], "edges":[ {"id":"ebe3edacf1266866","fromNode":"8995e775a8dbba70","fromSide":"right","toNode":"f48a4c7882b18109","toSide":"left"}, @@ -243,6 +250,15 @@ {"id":"bf01cb52323b6ff1","fromNode":"c00023f9483a734f","fromSide":"bottom","toNode":"012805a4b06cc5b4","toSide":"top"}, {"id":"4e0e2554d28232b3","fromNode":"41575dca10c5c02d","fromSide":"bottom","toNode":"012805a4b06cc5b4","toSide":"top"}, {"id":"69656e75ed4cba68","fromNode":"79d85b0031c6ce3f","fromSide":"bottom","toNode":"24a49c27acfb97ef","toSide":"top"}, - {"id":"3748c68fbbc6330c","fromNode":"012805a4b06cc5b4","fromSide":"right","toNode":"79d85b0031c6ce3f","toSide":"left"} + {"id":"3748c68fbbc6330c","fromNode":"012805a4b06cc5b4","fromSide":"right","toNode":"79d85b0031c6ce3f","toSide":"left"}, + {"id":"c10eb4c71860abd8","fromNode":"8ff05f3c7e0b7038","fromSide":"bottom","toNode":"a07c9f18c29efd3c","toSide":"left"}, + {"id":"aa4f4021e880fb10","fromNode":"8ff05f3c7e0b7038","fromSide":"bottom","toNode":"395da1694bf23728","toSide":"top"}, + {"id":"b6e51a72ccc8f98c","fromNode":"395da1694bf23728","fromSide":"bottom","toNode":"a126e0f80a9ae345","toSide":"top","color":"3"}, + {"id":"38102332928a2203","fromNode":"a126e0f80a9ae345","fromSide":"right","toNode":"565c2f071796974b","toSide":"top"}, + {"id":"a341a564be10134c","fromNode":"565c2f071796974b","fromSide":"bottom","toNode":"de1f18f12d0c57b5","toSide":"right"}, + {"id":"7f1e4d0002764707","fromNode":"a126e0f80a9ae345","fromSide":"bottom","toNode":"de1f18f12d0c57b5","toSide":"top","color":"3"}, + {"id":"55175decf88aca1b","fromNode":"56c7e177b0c92f4b","fromSide":"bottom","toNode":"a126e0f80a9ae345","toSide":"right"}, + {"id":"e1f5e3d22abc686a","fromNode":"395da1694bf23728","fromSide":"right","toNode":"56c7e177b0c92f4b","toSide":"top"}, + {"id":"76408abc1e2abaa4","fromNode":"a126e0f80a9ae345","fromSide":"bottom","toNode":"502d8f0c671f3d9f","toSide":"right"} ] } \ No newline at end of file