Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtars uniwig trying to create coordinates larger than chr #74

Open
ljmills opened this issue Jan 27, 2025 · 17 comments · Fixed by #85
Open

gtars uniwig trying to create coordinates larger than chr #74

ljmills opened this issue Jan 27, 2025 · 17 comments · Fixed by #85
Labels
bug Something isn't working uniwig

Comments

@ljmills
Copy link

ljmills commented Jan 27, 2025

I am using gtars uniwig as part of the PEPATAC pipeline line and I am consistently getting this error across all of my ATAC-seq samples. It seems to happen on chromosomes that have alignment near the end of the chromosomes. I have checked the alignments and I don't have reads aligned off the ends of these chromosomes so I think it is an issue with gtars.

Produce signal tracks (01-26 17:22:21) elapsed: 1118.0 TIME

Target to produce: /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y_exact/UMGC-Mod-084-1s-MN24-d1-D_S44_exact_shift.bw,/home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y/UMGC-Mod-084-1s-MN24-d1-D_S44_smooth_shift.bw,/home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y_exact/UMGC-Mod-084-1s-MN24-d1-D_S44_shift.bed

gtars uniwig -f /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/ljmills/shared/bin/pepatac/refgeneFolder/alias/canFam4Y/fasta/default/canFam4Y.chrom.sizes -m 0 -s 1 -t bam -y bw -p 16 -u shift -l /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y_exact/UMGC-Mod-084-1s-MN24-d1-D_S44_exact --bamscale 100227120.0 (3214339)
Begin bam processing workflow...

Merging all bigwig files...
Error opening file: No such file or directory (os error 2)
Error opening file: No such file or directory (os error 2)
Error opening file: No such file or directory (os error 2)
Error writing to BigWig file: Invalid bed graph: `50000` is greater than the chromosome (chrUn_JAAHUQ010000633v1) length (28993)
FINISHED

Command completed. Elapsed time: 0:00:49. Running peak memory: 5.237GB.
PID: 3214339; Command: gtars; Return code: 0; Memory used: 0.137GB

gtars uniwig -f /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/ljmills/shared/bin/pepatac/refgeneFolder/alias/canFam4Y/fasta/default/canFam4Y.chrom.sizes -m 25 -s 1 -t bam -y bw -p 16 -u shift -l /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN24-d1-D_S44/aligned_canFam4Y/UMGC-Mod-084-1s-MN24-d1-D_S44_smooth --bamscale 100227120.0 (3245409)

Begin bam processing workflow...
Merging all bigwig files...
Error opening file: No such file or directory (os error 2)
Error opening file: No such file or directory (os error 2)
Error opening file: No such file or directory (os error 2)
Error writing to BigWig file: Invalid bed graph: `123600000` is greater than the chromosome (chr1) length (123556469)
FINISHED

Command completed. Elapsed time: 0:00:40. Running peak memory: 5.237GB.
PID: 3245409; Command: gtars; Return code: 0; Memory used: 0.012GB

@donaldcampbelljr donaldcampbelljr added uniwig bug Something isn't working labels Jan 27, 2025
@donaldcampbelljr
Copy link
Member

Thank you @ljmills for posting and for sending some example files.

I am unable to get the exact same issue as above. I am seeing, what I believe to be, a separate issue in the problem files you sent over.

For testChr1.bam where chrom.sizes for chr1 are such:
chr1 123556469

Emulating gtars command from PEPATAC:
./target/release/gtars uniwig --file "/home/drc/Downloads/gtars_74_issue/input/testchr1.bam" -c /home/drc/Downloads/gtars_74_issue/input/canFam4Y.chrom.sizes -m 25 -s 1 -t bam -y bw -p 6 -u shift --bamscale 100227120.0 -l /home/drc/Downloads/gtars_74_issue/output/test

I do not get a completion. The program simply hangs during bigwig creation.

I'm seeing the intermediate lines being written as such:

chr1	0	123554979	0
chr1	123554979	123555028	0.00000000997734
chr1	123555028	123555030	0.00000001995468
chr1	123555030	123555051	0.00000000997734
chr1	123555051	123555079	0.00000001995468
chr1	123555079	123555102	0.00000000997734
chr1	123555102	123555120	0
chr1	123555120	123555122	0.00000000997734
chr1	123555122	123555163	0.00000001995468
chr1	123555163	123555171	0.000000029932018
chr1	123555171	123555173	0.00000001995468
chr1	123555173	123555214	0.00000000997734
chr1	123555214	123555291	0
chr1	123555291	123555342	0.00000000997734
chr1	123555342	123555394	0
chr1	123555394	123555428	0.00000001995468
chr1	123555428	123555432	0.000000029932018
chr1	123555432	123555445	0.00000003990936
chr1	123555445	123555479	0.00000001995468
chr1	123555479	123555483	0.00000000997734
chr1	123555483	123555529	0
chr1	123555529	123555549	0.00000000997734
chr1	123555549	123555550	0.00000001995468
chr1	123555550	123555573	0.000000029932018
chr1	123555573	123555580	0.00000003990936
chr1	123555580	123555600	0.000000029932018
chr1	123555600	123555601	0.00000001995468
chr1	123555601	123555675	0.00000000997734
chr1	123555675	123555700	0
chr1	123555700	123555727	0.00000000997734
chr1	123555727	123555751	0.00000001995468
chr1	123555751	123555778	0.00000000997734
chr1	123555778	123555824	0
chr1	123555824	123555853	0.00000000997734
chr1	123555853	123555875	0.00000001995468
chr1	123555875	123555904	0.00000000997734
chr1	123555904	123555996	0
chr1	123555996	123556047	0.00000000997734
chr1	123556047	123556111	0
chr1	123556111	123556135	0.00000000997734
chr1	123556135	123556162	0.00000001995468
chr1	123556162	123556186	0.00000000997734
chr1	123556186	123556341	0
chr1	123556341	123556392	0.00000000997734
chr1	123556392	123556428	0
chr1	123556428	123556429	0.00000001995468

The counts are not extending past the chrom size in this particular example (123556469).

However, I noticed that the 4th column's numbers are very small and that if I simply change the --bamscale number to something smaller, the process will no longer hang and will complete just fine.

Again, I'm having difficulties reproducing the exact same error.

If the files are hanging during bigwig creation, a temp solution could be to manipulate the pepatac.py lines (this occurs in a few places! - 4 total) :
https://github.com/databio/pepatac/blob/0d1aad08910aa3e7be1f410071a5c4e595884272/pipelines/pepatac.py#L1551

If one lowers the scaling factor by a few orders of magnitude, it will allow the bigwig conversion to complete.

@ljmills
Copy link
Author

ljmills commented Jan 28, 2025 via email

@donaldcampbelljr
Copy link
Member

Excellent. Thank you for providing these.

I was able to reproduce the issue. It is related to the smoothing argument -m 25 in the pipeline. It seems as though anything greater than 6 will cause upper bound of the smoothing to be beyond the chromosome size (for Chr1 in the above example).

This may be a regression as I thought we had fixed this in the past. I will need to investigate and issue a new gtars release, etc.

Some suggested solutions for the near future:

  1. Wait for next gtars release. This will probably be within the next week or so.

  2. Regarding the PEPATAC pipeline (which I also maintain), I believe the produced bigwig files are not used downstream, only the BED file. Therefore, a temporary solution would be to change the smoothing factor in your local PEPATAC for the produced bigwig:
    https://github.com/databio/pepatac/blob/0d1aad08910aa3e7be1f410071a5c4e595884272/pipelines/pepatac.py#L1541

  3. You can pull and use the previous version of PEPATAC: https://github.com/databio/pepatac/releases/tag/v0.11.3 It does not use gtars uniwig and instead relies on a python implementation (it is slower).

@ljmills
Copy link
Author

ljmills commented Jan 28, 2025 via email

@donaldcampbelljr
Copy link
Member

Some troubleshooting notes:

I confirmed that the bedGraph lines being reported are not extended beyond the chromosome during uniwig counting, regardless of the smooth size (-m 25 in the below example).

chr1	123556341	123556392	0.00000000997734

chr1	123556392	123556428	0

chr1	123556428	123556429	0.00000001995468

Merging all bigwig files...
Error writing to BigWig file: Invalid bed graph: `123600000` is greater than the chromosome (chr1) length (123556469)
FINISHED

But we still get an error when writing the merged bigwig file.

Checking the intermediate bigwig file (before merging), I used bigTools to convert it to a bedGraph and see that the bedGraph also does not contain regions beyond the chromsize:

chr1	123556341	123556392	9.97734e-9
chr1	123556392	123556428	0.0
chr1	123556428	123556429	1.995468e-8

The final step before it errors is is the merging of any bigwigs:

let threshold = 0.0; // default
let adjust = Some(0.0); // default
let clip = Some(100000000.0); // arbitrary but large because we don't want to clip
let (iter, chrom_map) = get_merged_vals(bigwigs, 10, threshold, adjust, clip)?;
let outb = BigWigWrite::create_file(combined_bw_file_name, chrom_map)?;
let runtime = if num_threads == 1 {
runtime::Builder::new_current_thread().build().unwrap()
} else {
runtime::Builder::new_multi_thread()
.worker_threads(num_threads as usize)
.build()
.unwrap()
};
let all_values = ChromGroupReadImpl {
iter: Box::new(iter),
};
//println!("WRITING COMBINED BW FILE: {}", combined_bw_file_name.clone());
// outb.write(all_values, runtime)?;
match outb.write(all_values, runtime) {
Ok(_) => {
eprintln!("Successfully wrote file: {}", final_file_path);
}
Err(err) => {
eprintln!("Error writing to BigWig file: {}", err);
// Delete the partially written file
std::fs::remove_file(final_file_path).unwrap_or_else(|e| {
eprintln!("Error deleting file: {}", e);
});
}
}

It is not clear why this step would be affected by the smooth_size parameter nor why it could change the merged bigwig in such a way that would cause a region to extend over the chrom_sizes.

@donaldcampbelljr
Copy link
Member

Ok, I believe I've figured out where this is happening. It is during the merging step when all of the individual chromosome bigwigs are merged. Due to a combination of region length and associated values, the merging step can produce a value that extends beyond the chromosome length and then fail. I will submit a bug for discussion in the Bigtools.

Some notes for reference:

Individually produced bw file and then manually attempt to merge using bigtools bigwigmerge

large bamscale, smooth = 25, fails bigwigmerge
drc@databio:~/GITHUB/gtars/gtars$ ./target/release/gtars uniwig -f /home/drc/Downloads/gtars_74_issue/input/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/drc/Downloads/gtars_74_issue/input/canFam4Y_2chrs.chrom.sizes -m 25 -s 1 -t bam -y bw -p 6 -u shift --bamscale 100227120.0 -l /home/drc/Downloads/gtars_74_issue/output/test
Begin bam processing workflow...
Merging all bigwig files...
max_val: 0.000000798187159034569
Here are starts ends values 123556200  123556341  0
Here are starts ends values 123556341  123556392  0.00000000997734
Here are starts ends values 123556392  123556428  0
Here are starts ends values 123556428  123556429  0.00000001995468
GET MERGED VALS
FINISHED
small bamscale, smooth = 25, passes bigwigmerge
drc@databio:~/GITHUB/gtars/gtars$ ./target/release/gtars uniwig -f /home/drc/Downloads/gtars_74_issue/input/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/drc/Downloads/gtars_74_issue/input/canFam4Y_2chrs.chrom.sizes -m 25 -s 1 -t bam -y bw -p 6 -u shift --bamscale 1.0 -l /home/drc/Downloads/gtars_74_issue/output/test
Begin bam processing workflow...
Merging all bigwig files...
max_val: 80
Here are starts ends values 123556200  123556341  0
Here are starts ends values 123556341  123556392  1
Here are starts ends values 123556392  123556428  0
Here are starts ends values 123556428  123556429  2
GET MERGED VALS
FINISHED
large bamscale, smooth = 5, passes bigwigmerge
drc@databio:~/GITHUB/gtars/gtars$ ./target/release/gtars uniwig -f /home/drc/Downloads/gtars_74_issue/input/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/drc/Downloads/gtars_74_issue/input/canFam4Y_2chrs.chrom.sizes -m 5 -s 1 -t bam -y bw -p 6 -u shift --bamscale 100227120.0 -l /home/drc/Downloads/gtars_74_issue/output/test
Begin bam processing workflow...
Merging all bigwig files...
max_val: 0.0000003192748749825114
Here are starts ends values 123556200  123556361  0
Here are starts ends values 123556361  123556372  0.00000000997734
Here are starts ends values 123556372  123556448  0
Here are starts ends values 123556448  123556449  0.00000001995468
Here are starts ends values 123556449  123556459  0.000000029932018
Here are starts ends values 123556459  123556460  0.00000000997734
GET MERGED VALS
FINISHED

Checking output after merge call (fn get_merged_vals):

large bamscale, smooth = 25, fails bigwigmerge
drc@databio:~/GITHUB/gtars/gtars$ ./target/release/gtars uniwig -f /home/drc/Downloads/gtars_74_issue/input/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/drc/Downloads/gtars_74_issue/input/canFam4Y_2chrs.chrom.sizes -m 25 -s 1 -t bam -y bw -p 6 -u shift --bamscale 100227120.0 -l /home/drc/Downloads/gtars_74_issue/output/test
Begin bam processing workflow...
Merging all bigwig files...
GET MERGED VALS
here is res: ChromInfo { name: "chr1", length: 123556469, id: 0 }
here is last value: Value { start: 123550000, end: 123600000, value: 9.97734e-9 }
FINISHED
small bamscale, smooth = 25, passes bigwigmerge
drc@databio:~/GITHUB/gtars/gtars$ ./target/release/gtars uniwig -f /home/drc/Downloads/gtars_74_issue/input/UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam -c /home/drc/Downloads/gtars_74_issue/input/canFam4Y_2chrs.chrom.sizes -m 25 -s 1 -t bam -y bw -p 6 -u shift --bamscale 1.0 -l /home/drc/Downloads/gtars_74_issue/output/test
Begin bam processing workflow...
Merging all bigwig files...
GET MERGED VALS
here is res: ChromInfo { name: "chr1", length: 123556469, id: 0 }
here is last value: Value { start: 123556428, end: 123556429, value: 2.0 }
FINISHED

So, depending on the region length and the values, the merging step can cause a start and end to extend beyond the chrom sizes in the provided chrom map.

Happening here:

let (iter, chrom_map) = get_merged_vals(bigwigs, 10, threshold, adjust, clip)?;

@donaldcampbelljr
Copy link
Member

This should be fixed with the next release of bigtools and the subsequent point release of gtars.

@ljmills
Copy link
Author

ljmills commented Feb 10, 2025 via email

@donaldcampbelljr
Copy link
Member

Newest version of gtars has been released and with it rust-gtars.

Therefore, I have also released a new version of PEPATAC with the newest rust-gtars: https://github.com/databio/pepatac/releases/tag/v0.12.2

@ljmills
Copy link
Author

ljmills commented Feb 14, 2025

updated gtars and tried to run PEPATAC again and gtars uniwig is now having a new issue when trying to create bed files. You can use the same example files from earlier to test this issue.

% gtars --version

gtars uniwig -f /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN09-d1-A_S1/aligned_canFam4Y/UMGC-Mod-084-1s-MN09-d1-A_S1_sort_dedup.bam -c /home/ljmills/shared/bin/pepatac/refgeneFolder/alias/canFam4Y/fasta/default/canFam4Y.chrom.sizes -m 25 -s 1 -t bam -y bed -p 16 -u shift -l /home/ljmills/ljmills/Modiano/ATAC/pepatac_canFam4Y/project84/results_pipeline/UMGC-Mod-084-1s-MN09-d1-A_S1/aligned_canFam4Y_exact/UMGC-Mod-084-1s-MN09-d1-A_S1 --bamscale 944184.96

Processing successful for chr5
Processing successful for chr1
thread 'main' panicked at src/uniwig/writing.rs:83:49:
called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
stack backtrace:
0: 0x56507296bb7a - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::ha4a311b32f6b4ad8
1: 0x565072991c33 - core::fmt::write::h1866771663f62b81
2: 0x5650729686b3 - std::io::Write::write_fmt::hb549e7444823135e
3: 0x56507296b9c2 - std::sys::backtrace::BacktraceLock::print::hddd3a9918ce29aa7
4: 0x56507296cacc - std::panicking::default_hook::{{closure}}::h791f75256b902d7d
5: 0x56507296c912 - std::panicking::default_hook::h82cc572fcb0d8cd7
6: 0x56507296d0a7 - std::panicking::rust_panic_with_hook::he21644cc2707f2c4
7: 0x56507296cf3a - std::panicking::begin_panic_handler::{{closure}}::h42f7c414fed3cad9
8: 0x56507296c059 - std::sys::backtrace::__rust_end_short_backtrace::ha26cf5766b4e8c65
9: 0x56507296cbcc - rust_begin_unwind
10: 0x5650726bde30 - core::panicking::panic_fmt::h74866b78e934b1c0
11: 0x5650726be236 - core::result::unwrap_failed::h899ed7ab2ccb8159
12: 0x5650726e0a97 - gtars::uniwig::writing::write_combined_files::h43e12055fd7c19b2
13: 0x565072748cb2 - gtars::uniwig::uniwig_main::hcac1e915edeef6af
14: 0x565072744c1e - gtars::uniwig::run_uniwig::h46a18fa425375681
15: 0x5650726c1b5b - gtars::main::h7eacb000d82a5ea9
16: 0x5650726bf2f3 - std::sys::backtrace::__rust_begin_short_backtrace::h454dce8820439958
17: 0x5650726bf24d - std::rt::lang_start::{{closure}}::he34506c836b6502b
18: 0x56507295f5f7 - std::rt::lang_start_internal::h78dd36c15a6b42b8
19: 0x5650726c1d95 - main
20: 0x7f7f3456f7e5 - __libc_start_main
21: 0x5650726be550 -
22: 0x0 -

There are chromosome specific bed files in the output directory. The last line of the Chr 1 bed file
chr1 123556428 123556478 N 0 -
extends past the size of the chromosome in the provided chromosomes sizes file
chr1 123556469

@nleroy917
Copy link
Member

Looking just upstream of that error, I see this:

for chrom in chromosomes.iter() {
        let file_name = format!(
            "{}{}_{}.{}",
            bwfileheader, chrom.chrom, location, output_type
        );
        inputs.push(file_name);
    }

What would cause one of these inputs to not exist, @donaldcampbelljr? Is it possible for a file to not exist? If so, could we just skip it?

@donaldcampbelljr
Copy link
Member

donaldcampbelljr commented Feb 14, 2025

I have this file UMGC-Mod-084-1s-MN24-d1-D_S44_sort_dedup.bam which does not produce the issue. It appears that the failing file is UMGC-Mod-084-1s-MN09-d1-A_S1_sort_dedup.bam.

Regarding the bed file's last row extending beyond the chrom size, I think that is unrelated to the above issue, but we should clip it in the bam_to_bed_no_counts based on input chrom.sizes like we do in the other counting functions.

@nleroy917 Good idea. By the time the program gets this far, it has checked that the chromosomes exist in the bam file. However, it could raise an error during writing to the bed file line by line, but we should see this error in the terminal (which I don't in this case). We could skip that file. As long as the program communicates that it was expecting a chromosome file but didn't find one that would be fine. Currently, the messaging is inadequate either way as I'm not sure which chrom failed from the backtrace. :) Also, I would think the consumer handle would produce an empty file either way by this point.

@ljmills
Copy link
Author

ljmills commented Feb 14, 2025 via email

@donaldcampbelljr
Copy link
Member

Ok, I was able to reproduce (was still using a smaller subset of chroms).

Added a print function immediately before opening each input file for the merge.

/home/drc/Downloads/gtars_74_issue/output/testchrY_NC_051844.1_shift.bed
thread 'main' panicked at src/uniwig/writing.rs:84:49:
called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

Checking the folder, I see a file: testchrY_NC_051844.bed. Checking chrom.sizes:

chrX	124992030
chrY_NC_051844.1	3937623
chrY_unplaced_NW_024010443.1	1569522

So, it appears to me that it is dropping the .1 when writing that particular chromosome to file.

Similar issue for the following chromosome: chrY_unplaced_NW_024010443.bed exists without the .1.

@donaldcampbelljr
Copy link
Member

The produced bed file is intended to be used downstream for macs3 peak calling. I checked and noticed that the original method (bamSitesToWig in PEPATAC python) would also create elements that extend beyond the chromosome size.

Example downstream call to macs3:
macs3 callpeak -t tutorial1_shift.bed -f BED --outdir /peak_calling_hg38 -n tutorial1 -g 2.7e9 --shift -75 --extsize 150 --nomodel --call-summits --nolambda --keep-dup all -p 0.01

Will this extension past the chromosome length cause issues downstream for macs3 peak calling and subsequent processing?

@donaldcampbelljr
Copy link
Member

Chromosome names with . should be solved with #91

@donaldcampbelljr
Copy link
Member

@ljmills the newest version of rust-gtars (0.2.2) was just merged into bioconda. Please try installing this newest version and let us know if you continue to have any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working uniwig
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants