-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xdas 0.2: dask backend for the virtualization of non-HDF5 files (tdms, miniseed, ...) #18
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Closed
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #18 +/- ##
==========================================
- Coverage 85.31% 80.89% -4.42%
==========================================
Files 28 34 +6
Lines 3336 3837 +501
==========================================
+ Hits 2846 3104 +258
- Misses 490 733 +243 ☔ View full report in Codecov by Sentry. |
Version 0.1.2
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While dask "virtualization" has some limitations (slow for very big datasets, cannot be stored), it is the best we can do in the medium run to visualize non HDF5 formats. This mainly implies to find a way to store dask graphs. We propose to do it by using MessagePack. It allows to serialize a subset of dask graphs to binary format which is sufficient to store multi-file chunked data arrays. Binary Dask Graph can then be stored in the usual NetCDF format as an attribute of an empty variable with correct dimensions and coordinates (just no values are assigned to it). It makes every thing work almost out of the box.
The main difference between Dask virtualization and HDF5 one is that reopening a dask graph and appending things to and writing it to a new virtual file will store entirely the new graph, meaning that the previous file can be ignored. This is not true with HDF5 Virtualization that always use opened virtual files as regular files, meaning that it will use a reference to that file that cannot be deleted. We make both work the same but it means more work and it's not sure that this is required.
Add:
xdas.dask
module which contains the followingxdas.dask.core
: some routines for the new dask back-endxdas.dask.serial
: a way to serialize dask graphsxdas.io.miniseed
: miniseed support thanks to this new back-endxdas.io.silixa
: TDMS support thanks to this new back-endChange:
xdas.concatenate
,xdas.combine_by_coords
&xdas.open_mfdataarray
now allows to concatenate along new dimension (actually stacking). In case a scalar coordinate is found, this latter is used to form the new coordinate along that dimension.Checklist: