Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One huge cluster #39

Open
orctyr opened this issue Nov 22, 2021 · 8 comments
Open

One huge cluster #39

orctyr opened this issue Nov 22, 2021 · 8 comments

Comments

@orctyr
Copy link

orctyr commented Nov 22, 2021

Hi cerebis,

I used bin3c to build mags from hi-c (50G WGS and 30G Hi-C).
bin3c two steps ran successfully.
But the mags status was a little strange. There are more than 200 mags (total 230Mb) and the first MAG contained 220Mb.
I had annotated each contig by CAT and obviously there were many different species within it.
I had try to change --min-signal from 5 to 10. But it did not work.
Is there any suggestion about this situation?

Best,
Orctyr

@cerebis
Copy link
Owner

cerebis commented Nov 23, 2021

Hi Orctyr, although I have seen large clusters myself, they've been on the order of a few times the expected size of a bacterial genome, rather than 100s of times.

  • Firstly, what software was used to create the BAM, and what was the precise command line.
  • Next, can you tell me how the reads were cleaned? It is important to do this for both shotgun (prior to assembly) and Hi-C reads (prior to mapping). I used to use BB-suite for this task, but more recently I have chosen fastp.

Can you post the bin3C log from mkmap and cluster here?

@orctyr
Copy link
Author

orctyr commented Nov 23, 2021

Hi cerebis,

  1. BAM file
    bwa mem -5SP contigs.fasta hic_paired.fastq.gz | samtools view -F 0x904 -bS - > hic2ctg_unsorted.bam
    samtools sort -@ 8 -n hic2ctg_unsorted.bam hic2ctg #final got hic2ctg.bam file for bin3c
  2. QC step
    Trimmomatic was used to remove low-quality reads and adaptors

bin3c_mkmap.log
DEBUG | 2021-11-22 09:00:21,456 | main | bin3C v0.1.1
DEBUG | 2021-11-22 09:00:21,456 | main | 2.7.15 | packaged by conda-forge | (default, Mar 5 2020, 14:56:06) [GCC 7.3.0]
DEBUG | 2021-11-22 09:00:21,456 | main | Command line: bin3C.py mkmap -e Sau3AI --eta -v H54_mNGS_megahit.fa hic2ctg-3.bam bin3c_mkmap
INFO | 2021-11-22 09:00:55,002 | mzd.contact_map | Reading sequences...
INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | Accepted 70244 sequences covering 426495174 bp
INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | References excluded: {'seq_missing': 0, 'too_short': 0}
INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | Counting reads in bam file...
INFO | 2021-11-22 09:04:11,240 | mzd.contact_map | BAM file contains 175038961 alignments
INFO | 2021-11-22 09:25:55,360 | mzd.contact_map | Pair accounting: OrderedDict([('poor_match', 23821762), ('not_tip', 0), ('short_insert', 0), ('ref_excluded', 0), ('accepted', 62833944), ('median_excluded', 0), ('end_buffered', 0)])
INFO | 2021-11-22 09:25:55,481 | mzd.contact_map | Total extent map weight 96513729
DEBUG | 2021-11-22 09:25:55,483 | mzd.contact_map | Setting primary acceptance mask with filtering criterion min_len: 1000 min_sig: 5
DEBUG | 2021-11-22 09:25:55,484 | mzd.contact_map | Minimum length threshold removing: 0
DEBUG | 2021-11-22 09:26:01,399 | mzd.contact_map | Minimum signal threshold removing: 35404
DEBUG | 2021-11-22 09:26:01,409 | mzd.contact_map | Accepted sequences: 34840
INFO | 2021-11-22 09:26:01,613 | main | Saving contact map instance

bin3c_clust.log
DEBUG | 2021-11-22 09:36:41,887 | main | bin3C v0.1.1
DEBUG | 2021-11-22 09:36:41,887 | main | 2.7.15 | packaged by conda-forge | (default, Mar 5 2020, 14:56:06) [GCC 7.3.0]
DEBUG | 2021-11-22 09:36:41,887 | main | Command line: bin3C.py cluster --no-spades -v bin3c_mkmap/contact_map.p.gz bin3c_clust2
INFO | 2021-11-22 09:36:41,887 | main | Generated random seed: 6675652
INFO | 2021-11-22 09:36:41,887 | main | Loading existing contact map from: bin3c_mkmap/contact_map.p.gz
DEBUG | 2021-11-22 09:36:51,569 | mzd.contact_map | Setting primary acceptance mask with filtering criterion min_len: 1000 min_sig: 5
DEBUG | 2021-11-22 09:36:51,569 | mzd.contact_map | Using existing mask
INFO | 2021-11-22 09:36:51,569 | mzd.contact_map | Preparing sequence map with full dimensions: (70244, 70244)
DEBUG | 2021-11-22 09:36:54,391 | mzd.contact_map | Doing site based normalisation
DEBUG | 2021-11-22 09:36:54,776 | mzd.contact_map | Map normalized
DEBUG | 2021-11-22 09:36:54,777 | mzd.contact_map | Balancing contact map
WARNING | 2021-11-22 09:36:59,343 | mzd.sparse_utils | treating 18534 zeros on diagonal as ones
DEBUG | 2021-11-22 09:40:47,934 | mzd.sparse_utils | It took 431 iterations to achieve bistochasticity
DEBUG | 2021-11-22 09:40:52,393 | mzd.contact_map | Map balanced
INFO | 2021-11-22 09:43:57,837 | mzd.contact_map | After removing filtered sequences map dimensions: (34840, 34840)
INFO | 2021-11-22 09:43:57,841 | mzd.cluster | Graph will have 34840 nodes
DEBUG | 2021-11-22 09:44:03,767 | mzd.cluster | Building graph from edges
INFO | 2021-11-22 09:45:24,617 | mzd.cluster | Finished: Name: contact_graph Type: Graph Number of nodes: 34840 Number of edges: 6085103 Average degree: 349.3170
INFO | 2021-11-22 09:45:24,677 | mzd.cluster | Clustering contact graph using method: infomap
INFO | 2021-11-22 10:17:01,627 | mzd.cluster | Clustering using infomap resulted in 3810 clusters
INFO | 2021-11-22 10:17:12,404 | mzd.cluster | Analyzing the contents of each cluster
INFO | 2021-11-22 10:17:12,404 | mzd.cluster | Building random access index for input FASTA sequences
INFO | 2021-11-22 10:17:38,473 | mzd.cluster | Writing output to the path: bin3c_clust2
DEBUG | 2021-11-22 10:17:40,724 | mzd.cluster | Writing full unordered FASTA for cluster 0 to bin3c_clust2/fasta/CL0001.fna
......
DEBUG | 2021-11-22 10:18:07,656 | mzd.cluster | Writing full unordered FASTA for cluster 3809 to bin3c_clust2/fasta/CL3810.fna
INFO | 2021-11-22 10:18:07,659 | mzd.cluster | Plotting heatmap of complete solution
INFO | 2021-11-22 10:18:07,663 | mzd.cluster | Clusters passing minimum extent criterion: 857
INFO | 2021-11-22 10:18:07,698 | mzd.cluster | Total number of sequences in the clustering: 18851
INFO | 2021-11-22 10:18:07,699 | mzd.cluster | After joining with active sequence mask map: 18851
INFO | 2021-11-22 10:19:51,509 | mzd.contact_map | After removing filtered sequences map dimensions: (18851, 18851)
DEBUG | 2021-11-22 10:19:53,439 | mzd.contact_map | Map reordered
WARNING | 2021-11-22 10:19:53,520 | py.warnings | compressed.py:708: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self[i, j] = values

DEBUG | 2021-11-22 10:20:14,344 | mzd.contact_map | Making raster image
DEBUG | 2021-11-22 10:21:06,982 | matplotlib.font_manager | findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans (u'*/miniconda3/envs/py2.7/lib/python2.7/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
DEBUG | 2021-11-22 10:25:57,023 | mzd.contact_map | Saving plot

@cerebis
Copy link
Owner

cerebis commented Nov 23, 2021

Thanks @orctyr.

None of that log stands out to me as problematic.

Switching to a development branch

I have overlooked that the bin3C master branch has fallen quite far behind my development work. To begin with, can you please checkout the Python 3 branch and see what you get as a result without any of the work below.

pip install git+https://github.com/cerebis/bin3C@py3

The use of bin3C should be the same as before.

Using this codebase will ensure that my suggestions below will be possible.

QA testing with qc3C.

Would you happen to have any quality metrics for your Hi-C library? Ideally, it would be great if you could run qc3C over your data. Since you have a BAM file already, I would suggest using:

qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out.

You can get qc3C from DockerHub: docker pull cerebis/qc3C and the command would then require mapping your data folder, something like:

docker run -v $PWD/mydata:/app cerebis/qc3c bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out

qc3C is also installable via conda on Linux.

qc3C produces a log but the output to the console is just as informative. If you could post that here too.

It will help on two fronts. 1) It will establish the quality of your Hi-C library and 2) it will give us an estimate of library fragment size. We can then use this information to restrict contact map generation to include only pairs with sufficiently large separation.

More stringent contact map generation

Taking the estimate for mean insert/fragment size from qc3C, we can restrict contact map generation to only including pairs with larger separation distance. This helps to exclude noise (spurious pairs) from incorperation. I would normally take 3x the estimated fragment size from qc3C -- or in a pinch (qc3C isn't working for you) try 1000 bp.

bin3C mkmap -e Sau3AI --min-insert 1000 --eta -v H54_mNGS_megahit.fa hic2ctg-3.bam bin3c_mkmap

@orctyr
Copy link
Author

orctyr commented Nov 23, 2021

Hi Cerebis,

Your cmd: qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out.
"hic2ctg_unsorted.bam" or "hic2ctg.bam" ?
In your qc3C github website, it need sorted bam file
"bwa index ref.fa
bwa mem -5SP ref.fna.gz hic_reads.fq.gz | samtools view -bS - | samtools sort -n -o hic_to_ref.bam - "

@cerebis
Copy link
Owner

cerebis commented Nov 23, 2021

Yes, sorry.

In reality, both bin3C and qc3C expect a name sorted BAM file. I think the older release (master branch) of bin3C may not make an explicit check for ordering, whereas the newer py3 branch will object if your BAM does not declare a sorting order in the header.

Why this worked without explicitly sorting is because the default out of bwa is effectively name sorted. However, the header is not set to declare it. I am not sure if this ws a conscious developer decision or not -- possibly because bwa does not want to guarantee the order.

@orctyr
Copy link
Author

orctyr commented Nov 24, 2021

Hi Cerebis,

qc3c bam module log file:
DEBUG | 2021-11-23 21:23:39,461 | main | qc3C 0.5
DEBUG | 2021-11-23 21:23:39,463 | main | 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) [GCC 9.4.0]
DEBUG | 2021-11-23 21:23:39,463 | main | Command line: qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg-3.bam --output-path qc3c_out
INFO | 2021-11-23 21:23:39,469 | qc3C.ligation | No cut-site database cache found for digest involving Sau3AI
INFO | 2021-11-23 21:23:39,808 | qc3C.ligation | Building cut-site database against reference sequence and Sau3AI
INFO | 2021-11-23 21:24:32,369 | qc3C.ligation | Saving cut-site database for reuse
INFO | 2021-11-23 21:24:37,782 | qc3C.bam_based | Accepting all usable reads
INFO | 2021-11-23 21:24:37,783 | qc3C.utils | Random seed was not set, using 20098262
INFO | 2021-11-23 21:24:37,783 | qc3C.bam_based | Beginning analysis...
INFO | 2021-11-23 21:24:37,783 | qc3C.bam_based | Counting alignments in hic2ctg-3.bam
INFO | 2021-11-23 21:27:30,532 | qc3C.bam_based | Found 175,038,961 alignments to analyse
INFO | 2021-11-23 22:06:08,170 | qc3C.bam_based | Number of parsed reads: 175,038,961
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of analysed reads: 175,038,961 (100.0% of all)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [unmapped]: 0 (0.000% of analysed)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [low mapq]: 16,349,134 (9.340% of analysed)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [ref length]: 0 (0.000% of analysed)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [secondary]: 0 (0.000% of analysed)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [supplementary]: 0 (0.000% of analysed)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [weak mapping]: 11,017,368 (6.294% of analysed)
INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [ref terminated]: 198,634 (0.1135% of analysed)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of reads filtered [duplicate]: 0 (0.000% of analysed)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of accepted reads: 147,473,825 (84.25% of analysed)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs analysed from accepted read pool: 95,256,963
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs accepted: 95,256,963
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs trans-mapping: 65,765,001 (69.04% of pairs)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs cis-mapping: 29,491,962 (30.96% of pairs)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of paired reads that fully align: 130,544,122 (68.52% of paired)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of paired reads whose alignment terminates early: 59,969,804 (31.48% of paired)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of paired reads not ending in cut-site remnant: 154,932,329 (81.32% of paired)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of short-range cis-mapping pairs: 28,798,625 (97.65% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [dangling end]: 24,972,678 (84.68% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [self-circle]: 2,940,463 (9.970% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [invalid FF/RR]: 84,408 (0.2862% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [religation]: 340,469 (1.154% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [valid FF/RR]: 214,956 (0.7289% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [valid RF]: 804,755 (2.729% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [valid FR]: 474,702 (1.610% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of uninformative pairs: 28,338,018 (96.09% of cis)
INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of informative pairs: 1,494,413 (5.067% of cis)
INFO | 2021-11-23 22:06:39,712 | qc3C.bam_based | Estimated insert size mean and median: 282nt 281nt
INFO | 2021-11-23 22:06:39,712 | qc3C.bam_based | Observed mean read length for paired reads: 147nt
INFO | 2021-11-23 22:06:40,098 | qc3C.bam_based | For observed insert size of 281nt, estimated unobserved fraction: 0.01068
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The enzyme Sau3AI has cut-site GATC
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The enzymatic combination Sau3AI/Sau3AI produces the 8nt ligation junction GATCGATC
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads which began with complete cut-site: 60,145,525 (31.57%)
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The digest contains the following possible remnants ['GATC']
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads whose alignment ends with cut-site remnant: 35,581,597 (18.68%)
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads with observable read-thru: 30,970,540 (16.26%)
INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads with read-thru and split alignment: 23,696,167 (12.44%)
INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Long-range intervals: 1000nt, 5000nt, 10000nt
INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Number of pairs: 693,337, 369,568, 341,458
INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Relative to all cis: 2.351%, 1.253%, 1.158%
INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Relative to accepted: 0.7279%, 0.388%, 0.3585%
INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | For Sau3AI/Sau3AI junction sequence GATCGATC found: 30970540
INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | For Sau3AI cutsite remnant GATC found: 35581597

@cerebis
Copy link
Owner

cerebis commented Nov 24, 2021

Metagenomic Hi-C libraries tend to have much less signal that clonal eukaryotic experiments, so low values (<10%) are not uncommon. The signal strength (proportion of informative Hi-C pairs) of your library, however, is a but lower than we'd like to see.

Because of this, I would not raise the minimum signal threshold, as there is not enough signal to spare.

A little breakdown of the qc3C log.

  1. Of 95M accepted pairs, 65M mapped across contigs.

This is good, since you need pairs spanning contigs to bin, however this value is also biased by the degree of assembly fragmentation.

  1. Of the 29M cis-mapping (same contig) pairs, 98% were short-range (that is, pair separation < 1000 bp ).

These basic statistics indicate that the majority of pairs are either not a product of Hi-C proximity ligation or are from very closely occuring loci. There is still utility in Hi-C pairs with small separation, but they lack the same power as long-range pairs in drawing contigs across a chromosome together. We are instead relying more on contacts between tips of adjacent contigs.

  1. The fraction of pairs with observable read-thru is high.

Normally read-thru is a strong indicator of Hi-C products in a library, where the join between loci is traversed by one of the reads (or both) in a pair. The join contains an artefact -- which although can occur naturally -- is also a hint for Hi-C.

The size of the insert relative to read length obviously affects how often this occurs. Since you have 150bp reads and 280 bp mean insert size, we'd expect to see it sometimes. But usually more pronounced overlap is required for frequent observation. Further, seeing read-thru and a secondary split mapping of the remain part of the read is less common. Together, I would suspect you do have a good proportion of Hi-C pairs, only the majority of them short-range.

I see you are using Sau3AI, so I was curious if this library was produced using the Phase kit?

Going forward

I would suggest you use the py3 branch and try creating two contact maps with minimum insert sizes of 700 and then 840 (2.5x and 3x insert mean).

Hopefully you will eliminate the giant cluster with one of these runs.

@orctyr
Copy link
Author

orctyr commented Nov 25, 2021

Hi Cerebis,

bin3C-python3 version could not be download successfully here.
Could I use bin3C-py2 to go forward?

install log:
Collecting git+https://github.com/cerebis/bin3C@py3
Cloning https://github.com/cerebis/bin3C (to revision py3) to /tmp/pip-req-build-1vyhcl8u
Running command git clone --filter=blob:none -q https://github.com/cerebis/bin3C /tmp/pip-req-build-1vyhcl8u
Running command git checkout -b py3 --track origin/py3
fatal: unable to access 'https://github.com/cerebis/bin3C/': Empty reply from server
WARNING: Discarding git+https://github.com/cerebis/bin3C@py3. Command errored out with exit status 128: git checkout -b py3 --track origin/py3 Check the logs for full command output.
ERROR: Command errored out with exit status 128: git checkout -b py3 --track origin/py3 Check the logs for full command output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants