-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One huge cluster #39
Comments
Hi Orctyr, although I have seen large clusters myself, they've been on the order of a few times the expected size of a bacterial genome, rather than 100s of times.
Can you post the bin3C log from mkmap and cluster here? |
Hi cerebis,
bin3c_mkmap.log bin3c_clust.log DEBUG | 2021-11-22 10:20:14,344 | mzd.contact_map | Making raster image |
Thanks @orctyr. None of that log stands out to me as problematic. Switching to a development branchI have overlooked that the bin3C master branch has fallen quite far behind my development work. To begin with, can you please checkout the Python 3 branch and see what you get as a result without any of the work below.
The use of bin3C should be the same as before. Using this codebase will ensure that my suggestions below will be possible. QA testing with qc3C.Would you happen to have any quality metrics for your Hi-C library? Ideally, it would be great if you could run qc3C over your data. Since you have a BAM file already, I would suggest using:
You can get qc3C from DockerHub:
qc3C is also installable via conda on Linux. qc3C produces a log but the output to the console is just as informative. If you could post that here too. It will help on two fronts. 1) It will establish the quality of your Hi-C library and 2) it will give us an estimate of library fragment size. We can then use this information to restrict contact map generation to include only pairs with sufficiently large separation. More stringent contact map generationTaking the estimate for mean insert/fragment size from qc3C, we can restrict contact map generation to only including pairs with larger separation distance. This helps to exclude noise (spurious pairs) from incorperation. I would normally take 3x the estimated fragment size from qc3C -- or in a pinch (qc3C isn't working for you) try 1000 bp.
|
Hi Cerebis, Your cmd: qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out. |
Yes, sorry. In reality, both bin3C and qc3C expect a name sorted BAM file. I think the older release (master branch) of bin3C may not make an explicit check for ordering, whereas the newer py3 branch will object if your BAM does not declare a sorting order in the header. Why this worked without explicitly sorting is because the default out of bwa is effectively name sorted. However, the header is not set to declare it. I am not sure if this ws a conscious developer decision or not -- possibly because bwa does not want to guarantee the order. |
Hi Cerebis, qc3c bam module log file: |
Metagenomic Hi-C libraries tend to have much less signal that clonal eukaryotic experiments, so low values (<10%) are not uncommon. The signal strength (proportion of informative Hi-C pairs) of your library, however, is a but lower than we'd like to see. Because of this, I would not raise the minimum signal threshold, as there is not enough signal to spare. A little breakdown of the qc3C log.
This is good, since you need pairs spanning contigs to bin, however this value is also biased by the degree of assembly fragmentation.
These basic statistics indicate that the majority of pairs are either not a product of Hi-C proximity ligation or are from very closely occuring loci. There is still utility in Hi-C pairs with small separation, but they lack the same power as long-range pairs in drawing contigs across a chromosome together. We are instead relying more on contacts between tips of adjacent contigs.
Normally read-thru is a strong indicator of Hi-C products in a library, where the join between loci is traversed by one of the reads (or both) in a pair. The join contains an artefact -- which although can occur naturally -- is also a hint for Hi-C. The size of the insert relative to read length obviously affects how often this occurs. Since you have 150bp reads and 280 bp mean insert size, we'd expect to see it sometimes. But usually more pronounced overlap is required for frequent observation. Further, seeing read-thru and a secondary split mapping of the remain part of the read is less common. Together, I would suspect you do have a good proportion of Hi-C pairs, only the majority of them short-range. I see you are using Sau3AI, so I was curious if this library was produced using the Phase kit? Going forwardI would suggest you use the py3 branch and try creating two contact maps with minimum insert sizes of 700 and then 840 (2.5x and 3x insert mean). Hopefully you will eliminate the giant cluster with one of these runs. |
Hi Cerebis, bin3C-python3 version could not be download successfully here. install log: |
Hi cerebis,
I used bin3c to build mags from hi-c (50G WGS and 30G Hi-C).
bin3c two steps ran successfully.
But the mags status was a little strange. There are more than 200 mags (total 230Mb) and the first MAG contained 220Mb.
I had annotated each contig by CAT and obviously there were many different species within it.
I had try to change --min-signal from 5 to 10. But it did not work.
Is there any suggestion about this situation?
Best,
Orctyr
The text was updated successfully, but these errors were encountered: