Diversity calculations on windowed datasets generate lots of unmanaged memory

When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The `call_genotype` data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Diversity calculations on windowed datasets generate lots of unmanaged memory #1278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions