One huge cluster #39

orctyr · 2021-11-22T13:51:50Z

Hi cerebis,

I used bin3c to build mags from hi-c (50G WGS and 30G Hi-C).
bin3c two steps ran successfully.
But the mags status was a little strange. There are more than 200 mags (total 230Mb) and the first MAG contained 220Mb.
I had annotated each contig by CAT and obviously there were many different species within it.
I had try to change --min-signal from 5 to 10. But it did not work.
Is there any suggestion about this situation?

Best,
Orctyr

cerebis · 2021-11-23T02:21:32Z

Hi Orctyr, although I have seen large clusters myself, they've been on the order of a few times the expected size of a bacterial genome, rather than 100s of times.

Firstly, what software was used to create the BAM, and what was the precise command line.
Next, can you tell me how the reads were cleaned? It is important to do this for both shotgun (prior to assembly) and Hi-C reads (prior to mapping). I used to use BB-suite for this task, but more recently I have chosen fastp.

Can you post the bin3C log from mkmap and cluster here?

orctyr · 2021-11-23T02:46:18Z

Hi cerebis,

BAM file
bwa mem -5SP contigs.fasta hic_paired.fastq.gz | samtools view -F 0x904 -bS - > hic2ctg_unsorted.bam
samtools sort -@ 8 -n hic2ctg_unsorted.bam hic2ctg #final got hic2ctg.bam file for bin3c
QC step
Trimmomatic was used to remove low-quality reads and adaptors

bin3c_mkmap.log
DEBUG | 2021-11-22 09:00:21,456 | main | bin3C v0.1.1
DEBUG | 2021-11-22 09:00:21,456 | main | 2.7.15 | packaged by conda-forge | (default, Mar 5 2020, 14:56:06) [GCC 7.3.0]
DEBUG | 2021-11-22 09:00:21,456 | main | Command line: bin3C.py mkmap -e Sau3AI --eta -v H54_mNGS_megahit.fa hic2ctg-3.bam bin3c_mkmap
INFO | 2021-11-22 09:00:55,002 | mzd.contact_map | Reading sequences...
INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | Accepted 70244 sequences covering 426495174 bp
INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | References excluded: {'seq_missing': 0, 'too_short': 0}
INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | Counting reads in bam file...
INFO | 2021-11-22 09:04:11,240 | mzd.contact_map | BAM file contains 175038961 alignments
INFO | 2021-11-22 09:25:55,360 | mzd.contact_map | Pair accounting: OrderedDict([('poor_match', 23821762), ('not_tip', 0), ('short_insert', 0), ('ref_excluded', 0), ('accepted', 62833944), ('median_excluded', 0), ('end_buffered', 0)])
INFO | 2021-11-22 09:25:55,481 | mzd.contact_map | Total extent map weight 96513729
DEBUG | 2021-11-22 09:25:55,483 | mzd.contact_map | Setting primary acceptance mask with filtering criterion min_len: 1000 min_sig: 5
DEBUG | 2021-11-22 09:25:55,484 | mzd.contact_map | Minimum length threshold removing: 0
DEBUG | 2021-11-22 09:26:01,399 | mzd.contact_map | Minimum signal threshold removing: 35404
DEBUG | 2021-11-22 09:26:01,409 | mzd.contact_map | Accepted sequences: 34840
INFO | 2021-11-22 09:26:01,613 | main | Saving contact map instance

DEBUG | 2021-11-22 10:20:14,344 | mzd.contact_map | Making raster image
DEBUG | 2021-11-22 10:21:06,982 | matplotlib.font_manager | findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans (u'*/miniconda3/envs/py2.7/lib/python2.7/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
DEBUG | 2021-11-22 10:25:57,023 | mzd.contact_map | Saving plot

cerebis · 2021-11-23T03:32:45Z

Thanks @orctyr.

None of that log stands out to me as problematic.

Switching to a development branch

I have overlooked that the bin3C master branch has fallen quite far behind my development work. To begin with, can you please checkout the Python 3 branch and see what you get as a result without any of the work below.

pip install git+https://github.com/cerebis/bin3C@py3

The use of bin3C should be the same as before.

Using this codebase will ensure that my suggestions below will be possible.

QA testing with qc3C.

Would you happen to have any quality metrics for your Hi-C library? Ideally, it would be great if you could run qc3C over your data. Since you have a BAM file already, I would suggest using:

qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out.

You can get qc3C from DockerHub: docker pull cerebis/qc3C and the command would then require mapping your data folder, something like:

docker run -v $PWD/mydata:/app cerebis/qc3c bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out

qc3C is also installable via conda on Linux.

qc3C produces a log but the output to the console is just as informative. If you could post that here too.

It will help on two fronts. 1) It will establish the quality of your Hi-C library and 2) it will give us an estimate of library fragment size. We can then use this information to restrict contact map generation to include only pairs with sufficiently large separation.

More stringent contact map generation

Taking the estimate for mean insert/fragment size from qc3C, we can restrict contact map generation to only including pairs with larger separation distance. This helps to exclude noise (spurious pairs) from incorperation. I would normally take 3x the estimated fragment size from qc3C -- or in a pinch (qc3C isn't working for you) try 1000 bp.

bin3C mkmap -e Sau3AI --min-insert 1000 --eta -v H54_mNGS_megahit.fa hic2ctg-3.bam bin3c_mkmap

orctyr · 2021-11-23T12:05:58Z

Hi Cerebis,

Your cmd: qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out.
"hic2ctg_unsorted.bam" or "hic2ctg.bam" ?
In your qc3C github website, it need sorted bam file
"bwa index ref.fa
bwa mem -5SP ref.fna.gz hic_reads.fq.gz | samtools view -bS - | samtools sort -n -o hic_to_ref.bam - "

cerebis · 2021-11-23T23:08:31Z

Yes, sorry.

In reality, both bin3C and qc3C expect a name sorted BAM file. I think the older release (master branch) of bin3C may not make an explicit check for ordering, whereas the newer py3 branch will object if your BAM does not declare a sorting order in the header.

Why this worked without explicitly sorting is because the default out of bwa is effectively name sorted. However, the header is not set to declare it. I am not sure if this ws a conscious developer decision or not -- possibly because bwa does not want to guarantee the order.

orctyr · 2021-11-24T04:39:37Z

Hi Cerebis,

cerebis · 2021-11-24T10:58:57Z

Metagenomic Hi-C libraries tend to have much less signal that clonal eukaryotic experiments, so low values (<10%) are not uncommon. The signal strength (proportion of informative Hi-C pairs) of your library, however, is a but lower than we'd like to see.

Because of this, I would not raise the minimum signal threshold, as there is not enough signal to spare.

A little breakdown of the qc3C log.

Of 95M accepted pairs, 65M mapped across contigs.

This is good, since you need pairs spanning contigs to bin, however this value is also biased by the degree of assembly fragmentation.

Of the 29M cis-mapping (same contig) pairs, 98% were short-range (that is, pair separation < 1000 bp ).

These basic statistics indicate that the majority of pairs are either not a product of Hi-C proximity ligation or are from very closely occuring loci. There is still utility in Hi-C pairs with small separation, but they lack the same power as long-range pairs in drawing contigs across a chromosome together. We are instead relying more on contacts between tips of adjacent contigs.

The fraction of pairs with observable read-thru is high.

Normally read-thru is a strong indicator of Hi-C products in a library, where the join between loci is traversed by one of the reads (or both) in a pair. The join contains an artefact -- which although can occur naturally -- is also a hint for Hi-C.

The size of the insert relative to read length obviously affects how often this occurs. Since you have 150bp reads and 280 bp mean insert size, we'd expect to see it sometimes. But usually more pronounced overlap is required for frequent observation. Further, seeing read-thru and a secondary split mapping of the remain part of the read is less common. Together, I would suspect you do have a good proportion of Hi-C pairs, only the majority of them short-range.

I see you are using Sau3AI, so I was curious if this library was produced using the Phase kit?

Going forward

I would suggest you use the py3 branch and try creating two contact maps with minimum insert sizes of 700 and then 840 (2.5x and 3x insert mean).

Hopefully you will eliminate the giant cluster with one of these runs.

orctyr · 2021-11-25T13:56:50Z

Hi Cerebis,

bin3C-python3 version could not be download successfully here.
Could I use bin3C-py2 to go forward?

install log:
Collecting git+https://github.com/cerebis/bin3C@py3
Cloning https://github.com/cerebis/bin3C (to revision py3) to /tmp/pip-req-build-1vyhcl8u
Running command git clone --filter=blob:none -q https://github.com/cerebis/bin3C /tmp/pip-req-build-1vyhcl8u
Running command git checkout -b py3 --track origin/py3
fatal: unable to access 'https://github.com/cerebis/bin3C/': Empty reply from server
WARNING: Discarding git+https://github.com/cerebis/bin3C@py3. Command errored out with exit status 128: git checkout -b py3 --track origin/py3 Check the logs for full command output.
ERROR: Command errored out with exit status 128: git checkout -b py3 --track origin/py3 Check the logs for full command output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One huge cluster #39

One huge cluster #39

orctyr commented Nov 22, 2021

cerebis commented Nov 23, 2021

orctyr commented Nov 23, 2021

cerebis commented Nov 23, 2021 •

edited

Loading

orctyr commented Nov 23, 2021

cerebis commented Nov 23, 2021

orctyr commented Nov 24, 2021

cerebis commented Nov 24, 2021 •

edited

Loading

orctyr commented Nov 25, 2021

One huge cluster #39

One huge cluster #39

Comments

orctyr commented Nov 22, 2021

cerebis commented Nov 23, 2021

orctyr commented Nov 23, 2021

cerebis commented Nov 23, 2021 • edited Loading

Switching to a development branch

QA testing with qc3C.

More stringent contact map generation

orctyr commented Nov 23, 2021

cerebis commented Nov 23, 2021

orctyr commented Nov 24, 2021

cerebis commented Nov 24, 2021 • edited Loading

A little breakdown of the qc3C log.

Going forward

orctyr commented Nov 25, 2021

cerebis commented Nov 23, 2021 •

edited

Loading

cerebis commented Nov 24, 2021 •

edited

Loading