Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing keys in npy_meta.json #75

Open
ClaudeHu opened this issue Jan 27, 2025 · 5 comments
Open

Missing keys in npy_meta.json #75

ClaudeHu opened this issue Jan 27, 2025 · 5 comments
Labels
bug Something isn't working uniwig

Comments

@ClaudeHu
Copy link
Member

I run uniwig on a sorted bed file with 13,853,899 regions. Then test it with this shell script

#!/bin/bash


# instal gtars from given branch
gtars_dir="/home/zh4nh/repo/gtars"
branch_name="dev_skip_sorting"

cd $gtars_dir
git checkout $branch_name
git pull

time cargo install --path="./gtars"

echo "Finished installing gtars from branch $branch_name"


# run uniwig

input_bed="$DATA/combined/sample_sorted.bed"
chrom_size="$JOBS/hg38.chrom.sizes"
output_dir="/scratch/zh4nh/trial/gtars/test_$branch_name"

mkdir -p $output_dir

time gtars uniwig -f $input_bed -c $chrom_size -m 50 -s 1 -l $output_dir/ -y npy

echo "Finished running uniwig, output written to $output_dir"

# check output result

python3 <<EOF
import json
import os
import re
import pprint
import numpy as np


def clean_chromosomes(filenames):
    cleaned_filenames = set()

    for filename in filenames:
        # Remove '.npy', '_end', '_core', '_start'
        cleaned_name = re.sub(r"(?:_end|_core|_start|\.npy)", "", filename)
        cleaned_filenames.add(cleaned_name)

    return list(cleaned_filenames)


def check_uniwig_npy(folder):
    # Define file paths for npy_meta.json in both folders
    meta_file = os.path.join(folder, "npy_meta.json")

    if not os.path.exists(meta_file):
        print(f"File not found in {folder}: {meta_file}")
        return

    with open(meta_file, "r") as f1:
        data = json.load(f1)

    npys = [fn for fn in os.listdir(folder) if fn.endswith(".npy")]
    if len(npys) != len(data) * 3:
        print(f"Number of npy files ({len(npys)}) should be 3 times of number of chromosomes ({len(data)})")
    for npy in npys:
        track = np.load(os.path.join(folder, npy))
        part = npy.replace(".npy", "").split("_")[-1]
        chrom = re.sub(r"(?:_end|_core|_start|\.npy)", "", npy)
        try:

            if len(track) + data[chrom][part] > data[chrom]["reported_chrom_size"]:
                print(
                    f"Size mismatch {chrom}_{part}: {len(track)} + {data[chrom][part]} > {data[chrom]['reported_chrom_size']}"
                )
        except KeyError:
            print(f"Key {part} not found in {chrom}, but its numpy file exists")


if __name__ == "__main__":
    check_uniwig_npy("$output_dir")

EOF

Here is the printout:

Key core not found in chr14_KI270724v1_random, but its numpy file exists
Key core not found in chrUn_KI270518v1, but its numpy file exists
Key start not found in chrUn_KI270538v1, but its numpy file exists
Key end not found in chrUn_KI270333v1, but its numpy file exists
Key end not found in chrUn_KI270748v1, but its numpy file exists
Key core not found in chrUn_KI270748v1, but its numpy file exists
Key core not found in chrUn_KI270747v1, but its numpy file exists
Key end not found in chrUn_GL000218v1, but its numpy file exists
Key end not found in chrUn_KI270744v1, but its numpy file exists
Key core not found in chrUn_KI270512v1, but its numpy file exists
Key core not found in chrUn_KI270588v1, but its numpy file exists
Key end not found in chr14_GL000225v1_random, but its numpy file exists
Key core not found in chrUn_KI270590v1, but its numpy file exists
@donaldcampbelljr donaldcampbelljr added uniwig bug Something isn't working labels Jan 27, 2025
@ClaudeHu
Copy link
Member Author

Threads parameter = 1 can avoid that problem:

time gtars uniwig -f $input_bed -c $chrom_size -m 50 -s 1 -l $output_dir/ -y npy -p 1

It reappear after switch threads parameter back to its default value (6).

@donaldcampbelljr
Copy link
Member

Related to #65 and specifically, the fact that I ran into parallel processing issues: #65 (comment)

However, I was under the impression that the current solution would not have these issues (which is why I implemented it this way).

@donaldcampbelljr
Copy link
Member

Hmm, so far I am unable to reproduce this with a BED file (4.8 mb, 37000 regions) with 24 chromosomes.

Using 1 core or 8, I get the same metadata in the .json file.

@donaldcampbelljr
Copy link
Member

Also unable to reproduce on a BED file with 2.5 mil region with 115 chromosomes.

@ClaudeHu
Copy link
Member Author

That problem won't occur when running locally on individual laptop. It only occurs on Rivanna (HPC).

Also unable to reproduce on a BED file with 2.5 mil region with 115 chromosomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working uniwig
Projects
None yet
Development

No branches or pull requests

2 participants