Skip to content

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Open
@percyfal

Description

@percyfal

When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The call_genotype data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions