Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigwigmerge will create regions beyond chrom map #78

Closed
donaldcampbelljr opened this issue Jan 29, 2025 · 9 comments
Closed

bigwigmerge will create regions beyond chrom map #78

donaldcampbelljr opened this issue Jan 29, 2025 · 9 comments

Comments

@donaldcampbelljr
Copy link
Contributor

Hey @jackh726 ,

I've run into an interesting issue while troubleshooting: databio/gtars#74

We are currently processing a bam file into a bigwig per chromosome and then merging these outputs into a final chromosome.

There are a couple of levers that we can pull that affects the regions and values passed to the bigwig creation, one is a smoothing parameter which will change the length of the regions and one is a scaling factor which will change how the intermediate bedGraph values are scaled. I'm finding that, depending on a combination of the two, this can lead to MergingValues returned from get_merged_values to contain regions that extend beyond the chromosome size and thus cause a failure when writing the final bigWig.

I believe that it is happening here:

let iters: Vec<_> = bws
.into_iter()
.map(|b| {
let f = ReopenableFile { file: File::open(&b.1)?, path: b.1 };
let b = BigWigRead::with_info(b.0, f);
b.get_interval_move(&chrom, 1, size).map(|i| i.map(|r| r.map_err(|e| MergingValuesError::BBIReadError(e))))
})
.collect::<Result<Vec<_>, _>>()?;
let mergingvalues = MergingValues::new(iters, threshold, adjust, clip);
Ok((chrom, size, mergingvalues))

Here is a link to a "good" and "bad" bw:
https://myuva-my.sharepoint.com/:u:/g/personal/zzz3fh_virginia_edu/EW51mMiOj1pFuaVIr2abeq0Bbpjkg8aMaJGrleAzwaM_uA?e=4bq5Rp

For reference the chromsizes file used for the map:

chr1	123556469

The good bw will have values that end:

max_val: 80
Here are starts ends values 123556200  123556341  0
Here are starts ends values 123556341  123556392  1
Here are starts ends values 123556392  123556428  0
Here are starts ends values 123556428  123556429  2

and whose last merged values are within range:

Value { start: 123556428, end: 123556429, value: 2.0 }

The bad bw

max_val: 0.000000798187159034569
Here are starts ends values 123556200  123556341  0
Here are starts ends values 123556341  123556392  0.00000000997734
Here are starts ends values 123556392  123556428  0
Here are starts ends values 123556428  123556429  0.00000001995468

and its merged values are outside of the range:

Value { start: 123550000, end: 123600000, value: 9.97734e-9 }

I'm wondering if this issue is happening within the get_intervals_move function? Is it not respecting the size parameter based on the chrom_map ?

I manually tested the bigwigs using the CLI bigwigmerge tool in addition to what we have implemented in gtars to ensure I could reproduce this failure outside of our own implementation.

Where we implement this in gtars:
https://github.com/databio/gtars/blob/5ce8434b5c03a74262df1c7ccdf70f335bae7844/gtars/src/uniwig/mod.rs#L896-L944

@jackh726
Copy link
Owner

Hmm, it's weird. Merging should be done at the value level, so its weird that it looks like the value here looks so cleanly split. I'd try to get to this to take a look, but I have some personal issues going on, so it might be delayed some.

@jackh726
Copy link
Owner

get_intervals_move only returns the raw data overlapping the provided interval (and supposed to be clipped for the start and end).

I'm really not sure without digging into it.

@jackh726
Copy link
Owner

jackh726 commented Feb 6, 2025

@donaldcampbelljr can you share the exact cli command used to reproduce? I just tested (with v0.5.4 and master) and couldn't reproduce:

bigtools bigwigmerge -b good.bw -b bad.bw target/merged.bw

@donaldcampbelljr
Copy link
Contributor Author

donaldcampbelljr commented Feb 7, 2025

@jackh726 Yes! It looks as though inputting only one of the bigwig files during merge (and if that bigwig file is the "bad" one), do I see the error!

Command:

drc@databio:~/GITHUB/bigtools$ ./target/release/bigwigmerge -b /home/drc/Downloads/gtars_74_issue/for_sharing/bad.bw /home/drc/bw_output/test.bw

I confirmed that inputting both the good and the bad bigwig files together causes the merge to succeed. Inputting only the good bigwig file is also successful.

The reason I was testing one bigwig file for merging is because we currently process in parallel via chromosome and then merge our bigwigs at the end. There could be a case where there is one chromosome processed and thus, downstream, we would try to merge a single bw file.

As a potential (quick) guardrail in gtars, I could check and see if we've only produced one bigwig and simply skip merging downstream.

@donaldcampbelljr
Copy link
Contributor Author

Actually, I just realized that this was occurring for multiple chromosomes in the original issue, so my suggested temporary fix would not work.

@jackh726
Copy link
Owner

jackh726 commented Feb 8, 2025

Ah...so there is for sure a bug in the code (causing data past the end of the chromosome), but the input data is difficult because anything less than 1.1920929E-7 (the machine epsilon) is essentially unreliable to work with.

Essentially, the issue is arising because to merge bigwigs, bigtools reads in the data for each in 50kb chunks, then goes through and converts this back into the values needed for the output. So, if you have values like [0.0, 0.0, 1.0, 2,0, 2.0, 0.0], the first two zeroes and the two 2.0s get merged. Well, when comparing 0.0 with 0.00000000997734 in your data, the difference is less the the machine epsilon, so they are as good as equal. Of course, they are discrete values, but because the way floats work, we don't know if those are have just rounded differently.

Besides that, the actual issue here is that when two values compare equal (difference less than epsilon), bigtools just adds +1 to the length of the value to be added. "Normally" (but not always!), you won't get a difference of zero at the end of chromosome because usually bigwigs don't store zero values and the data off the end of the chrom is zero. Of course, with either the current setup of checking diff. < epsilon, we'll extend the final value past the end of the chrom if that happens or if the final value is zero.

So, I'm doing three things:

  1. I'm going to store the float values as f64s instead of a f32s, this will reduce the epsilon down to ~1e-16. They will need to be converted back to f32s when writing to the file, but this should just increase precision.
  2. I'm removing the diff. <epsilon check entirely and instead checking for actual equality. This is less "correct", but it's also not necessarily wrong to store close floats in a bigwig. I may end up changing this back at some point given point 1.
  3. I'm fixing the bug with values extending pass the chromosome when the last value is 0.0.

@jackh726
Copy link
Owner

jackh726 commented Feb 8, 2025

Fixed in 529d9ed. I'll make a release this weekend.

@donaldcampbelljr
Copy link
Contributor Author

Excellent! Thank you for the detailed explanation of your findings and the quick fix.

@jackh726
Copy link
Owner

Published v0.5.5 which fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants