Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xdas 0.2: dask backend for the virtualization of non-HDF5 files (tdms, miniseed, ...) #18

Merged
merged 57 commits into from
Sep 18, 2024

Conversation

atrabattoni
Copy link
Contributor

@atrabattoni atrabattoni commented Sep 17, 2024

While dask "virtualization" has some limitations (slow for very big datasets, cannot be stored), it is the best we can do in the medium run to visualize non HDF5 formats. This mainly implies to find a way to store dask graphs. We propose to do it by using MessagePack. It allows to serialize a subset of dask graphs to binary format which is sufficient to store multi-file chunked data arrays. Binary Dask Graph can then be stored in the usual NetCDF format as an attribute of an empty variable with correct dimensions and coordinates (just no values are assigned to it). It makes every thing work almost out of the box.

The main difference between Dask virtualization and HDF5 one is that reopening a dask graph and appending things to and writing it to a new virtual file will store entirely the new graph, meaning that the previous file can be ignored. This is not true with HDF5 Virtualization that always use opened virtual files as regular files, meaning that it will use a reference to that file that cannot be deleted. We make both work the same but it means more work and it's not sure that this is required.

Add:

  • xdas.dask module which contains the following
  • xdas.dask.core: some routines for the new dask back-end
  • xdas.dask.serial: a way to serialize dask graphs
  • xdas.io.miniseed: miniseed support thanks to this new back-end
  • xdas.io.silixa: TDMS support thanks to this new back-end

Change:

  • xdas.concatenate, xdas.combine_by_coords & xdas.open_mfdataarray now allows to concatenate along new dimension (actually stacking). In case a scalar coordinate is found, this latter is used to form the new coordinate along that dimension.

Checklist:

  • Code
  • Tests
  • Documentation

@atrabattoni atrabattoni changed the title Xdas 0.2: add dask virtual backend for non-HDF5 files (tdms, miniseed) Xdas 0.2: dask virtual backend for non-HDF5 files (tdms, miniseed, ...) Sep 17, 2024
@atrabattoni atrabattoni changed the title Xdas 0.2: dask virtual backend for non-HDF5 files (tdms, miniseed, ...) Xdas 0.2: dask backend for non-HDF5 files virutalization (tdms, miniseed, ...) Sep 17, 2024
@atrabattoni atrabattoni changed the title Xdas 0.2: dask backend for non-HDF5 files virutalization (tdms, miniseed, ...) Xdas 0.2: dask backend for the virtualization of non-HDF5 files (tdms, miniseed, ...) Sep 17, 2024
@atrabattoni atrabattoni linked an issue Sep 17, 2024 that may be closed by this pull request
@atrabattoni atrabattoni self-assigned this Sep 17, 2024
@atrabattoni atrabattoni added the enhancement New feature or request label Sep 17, 2024
Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 54.91228% with 257 lines in your changes missing coverage. Please review.

Project coverage is 80.89%. Comparing base (8306384) to head (25d51fa).

Files with missing lines Patch % Lines
xdas/io/tdms.py 12.61% 194 Missing ⚠️
xdas/core/coordinates.py 75.64% 19 Missing ⚠️
xdas/core/dataarray.py 64.00% 18 Missing ⚠️
xdas/io/silixa.py 30.43% 16 Missing ⚠️
xdas/core/routines.py 94.44% 5 Missing ⚠️
xdas/io/miniseed.py 88.57% 4 Missing ⚠️
xdas/dask/serial.py 97.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev      #18      +/-   ##
==========================================
- Coverage   85.31%   80.89%   -4.42%     
==========================================
  Files          28       34       +6     
  Lines        3336     3837     +501     
==========================================
+ Hits         2846     3104     +258     
- Misses        490      733     +243     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@atrabattoni atrabattoni merged commit 0b30b20 into dev Sep 18, 2024
4 checks passed
@atrabattoni atrabattoni deleted the v0.2 branch September 18, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mseed format io
1 participant