Merge pull request #18 from xdas-dev/v0.2

Xdas 0.2: dask backend for the virtualization of non-HDF5 files (tdms, miniseed, ...)
xdas-dev · Sep 18, 2024 · 0b30b20 · 0b30b20
2 parents 8306384 + 91a6b21
commit 0b30b20
Show file tree

Hide file tree

Showing 27 changed files with 1,953 additions and 150 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -11,7 +11,7 @@
 author = "Alister Trabattoni"
 
 # The full version, including alpha/beta/rc tags
-release = "0.1.2"
+release = "0.2"
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/docs/release-notes.md b/docs/release-notes.md
@@ -1,10 +1,16 @@
 # Release notes
 
+## 0.2
+
+- Add Dask virtualization backend for non-HDF5 formats (@atrabattoni).
+- Add support for miniSEED format (@atrabattoni, @chauvetige). 
+- Add support for Silixa (TDMS) format (@atrabattoni, @Stutzmann). 
+
 ## 0.1.2
 
-- Add ZeroMQ streaming capabilities (@atrabattoni)
+- Add ZeroMQ streaming capabilities (@atrabattoni).
 - Add support of Terra15 format (@chauvetige).
-- Fix Febus engine (@ClaudioStrumia)
+- Fix Febus engine (@ClaudioStrumia).
 
 ## 0.1.1
 
@@ -14,4 +20,4 @@
 
 ## 0.1
 
-Initial stable version
+Initial stable version.
diff --git a/docs/user-guide/data-formats.md b/docs/user-guide/data-formats.md
@@ -20,15 +20,25 @@ os.chdir("../_data")
 
 ## Implemented file formats
 
-The formats that are currently implemented are: ASN, FEBUS, OPTASENSE, SINTELA and TERRA15. To read them you have to specifiy which one you want in the `engine` argument in {py:func}`xdas.open_dataarray` for a single file or {py:func}`xdas.open_mfdataarray` for multiple files:
-
-| DAS constructor   | `engine` argument |
-|:-----------------:|:-----------------:|
-| ASN               | `"asn"`           |
-| FEBUS             | `"febus"`         |
-| OPTASENSE         | `"optasense"`     |
-| SINTELA           | `"sintela"`       |
-| TERRA15           | `"terra15"`       |
+Here below the list of formats that are currently implemented. All HDF5 based formats support native virtualization. Other formats support Dask virtualization. Please refer to the [](virtual-datasets) section. To read a them you have to specify which one you want in the `engine` argument in {py:func}`xdas.open_dataarray` for a single file or {py:func}`xdas.open_mfdataarray` for multiple files.
+
+Xdas support the following DAS formats:
+
+| Constructor       | Instrument        | `engine` argument | Virtualization    |
+|:-----------------:|:-----------------:|:-----------------:|:-----------------:|
+| ASN               | OptoDAS           | `"asn"`           | HDF5              |
+| FEBUS             | A1                | `"febus"`         | HDF5              |
+| OptaSense         | OLA, ODH*, ...    | `"optasense"`     | HDF5              |
+| Silixa            | iDAS              | `"silixa"`        | Dask              |
+| SINTELA           | ONYX              | `"sintela"`       | HDF5              |
+| Terra15           | Treble            | `"terra15"`       | HDF5              |
+
+It also implements its own format and support miniSEED:
+
+| Format            | `engine` argument | Virtualization    |   
+|:-----------------:|:-----------------:|:-----------------:|           
+| Xdas              | `None`            | HDF5              |
+| miniSEED          | `"miniseed"`      | Dask              |
 
 ```{warning}
 Due to poor documentation of the various version of the Febus format, it is recommended to manually provide the required trimming and the position of the timestamps within each block. For example to trim 100 samples on both side of each block and to set the timestamp location at the center of the block for a block of 2000 samples:

diff --git a/docs/user-guide/faq.md b/docs/user-guide/faq.md
@@ -0,0 +1,10 @@
+# Frequently Asked Questions
+
+## Why not using Xarray and Dask?
+
+Originally, Xdas was meant to be a simple add-on to Xarray, taking advantage of its [Dask integration] (https://docs.xarray.dev/en/stable/user-guide/dask.html). But two main limitations forced us to create a parallel project:
+
+- Coordinates have to be loaded into memory as NumPy arrays. This is prohibitive for very long time series, where storing the time coordinate as a dense array with a value for each time sample leads to metadata that in some extreme cases cannot fit in memory.
+- Dask arrays become sluggish when dealing with a very large number of files. Dask is a pure Python package, and processing graphs of millions of tasks can take several seconds or more. Also, Dask does not provide a way to serialise a graph for later reuse.
+
+Because of this, and the fact that the Xarray object was not designed to be subclassed, we decided to go our own way. Hopefully, if the progress of Xarray allows it, we could imagine merging the two projects. Xdas tries to follow the Xarray API as much as possible. 
diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md
@@ -7,8 +7,10 @@ data-structure/index
 data-formats
 virtual-datasets
 interpolated-coordinates
+miniseed
 convert-displacement
 atoms
 processing
 streaming
+faq
 ```
diff --git a/docs/user-guide/miniseed.md b/docs/user-guide/miniseed.md
@@ -0,0 +1,131 @@
+---
+file_format: mystnb
+kernelspec:
+  name: python3
+---
+
+```{code-cell}
+:tags: [remove-cell]
+
+import os
+os.chdir("../_data")
+
+import warnings
+warnings.filterwarnings("ignore")
+
+import obspy
+import numpy as np
+
+np.random.seed(0)
+
+network = "NX"
+stations = ["SX001", "SX002", "SX003", "SX004", "SX005", "SX006", "SX007"]
+location = "00"
+channels = ["HHZ", "HHN", "HHE"]
+
+nchunk = 5
+chunk_duration = 60
+starttimes = [
+    obspy.UTCDateTime("2024-01-01T00:00:00") + idx * chunk_duration
+    for idx in range(nchunk)
+]
+delta = 0.01
+failure = 0.1
+
+for station in stations:
+    for starttime in starttimes:
+        if np.random.rand() < failure:
+            continue
+        for channel in channels:
+            data = np.random.randn(round(chunk_duration / delta))
+            header = {
+                "delta": delta,
+                "starttime": starttime,
+                "network": network,
+                "station": station,
+                "location": location,
+                "channel": channel,
+            }
+            tr = obspy.Trace(data, header)
+            endtime = starttime + chunk_duration
+            dirpath = f"{network}/{station}"
+            if not os.path.exists(dirpath):
+                os.makedirs(dirpath)
+            fname = f"{network}.{station}.{location}.{channel}__{starttime}_{endtime}.mseed"
+            path = os.path.join(dirpath, fname)
+            tr.write(path)
+
+```
+
+# Working with Large-N Seismic Arrays
+
+The virtualization capabilities of Xdas make it a good candidate for working with the large datasets produced by large-N seismic arrays.
+
+In this section, we will present several examples of how to handle long and numerous time series.
+
+```{note}
+This part encourages experimenting with seismic data. Depending on the most common use cases users find, this could lead to changes in development direction.
+```
+
+## Exploring a dataset
+
+We will start by exploring a synthetic dataset composed of 7 stations, each with 3 channels (HHZ, HHN, HHE). Each trace for each station and channel is stored in multiple one-minute-long files. Some files are missing, resulting in data gaps. The sampling rate is 100 Hz. The data is organized in a directory structure that groups files by station.
+
+To open the dataset, we will use the `open_mfdatatree` function. This function takes a pattern that describes the directory structure and file names. The pattern is a string containing placeholders for the network, station, location, channel, start time, and end time. Placeholders for named fields are enclosed in curly braces, while simple brackets are used for varying parts of the file name that will be concatenated into different acquisitions (meant for changes in acquisition parameters).
+
+Next, we will plot the availability of the dataset.
+
+```{code-cell}
+:tags: [remove-output]
+
+import xdas as xd
+
+pattern = "NX/{station}/NX.{station}.00.{channel}__[acquisition].mseed"
+dc = xd.open_mfdatatree(pattern, engine="miniseed")
+xd.plot_availability(dc, dim="time")
+```
+```{code-cell}
+:tags: [remove-input]
+from IPython.display import HTML
+fig = xd.plot_availability(dc, dim="time")
+HTML(fig.to_html())
+```
+
+We can see that indeed some data is missing. Yet, as often, the different channels are synchronized. We can therefore reorganize the data by concatenating the channels of each station.
+
+```{code-cell}
+:tags: [remove-output]
+
+dc = xd.DataCollection(
+    {
+        station: xd.concatenate(objs, dim="channel")
+        for station in dc
+        for objs in zip(*[dc[station][channel] for channel in dc[station]]) 
+    },
+    name="station",
+)
+xd.plot_availability(dc, dim="time")
+```
+```{code-cell}
+:tags: [remove-input]
+from IPython.display import HTML
+fig = xd.plot_availability(dc, dim="time")
+HTML(fig.to_html())
+```
+
+In our case, all stations are synchronized to GPS time. By selecting a time range where no data is missing, we can concatenate the stations to obtain an N-dimensional array representation of the dataset.
+
+```{code-cell}
+dc = dc.sel(time=slice("2024-01-01T00:01:00", "2024-01-01T00:02:59.99"))
+da = xd.concatenate((dc[station] for station in dc), dim="station")
+da
+```
+
+This is useful for performing array analysis. In this example, we simply stack the energy.
+
+```{code-cell}  
+trace = np.square(da).mean("channel").mean("station")
+trace.plot(ylim=(0, 3))
+```
+
+All the processing capabilities of Xdas can be applied to the dataset. We encourage readers to explore the various possibilities.
diff --git a/docs/user-guide/virtual-datasets.md b/docs/user-guide/virtual-datasets.md
@@ -14,33 +14,43 @@ os.chdir("../_data")
 
 # Virtual Datasets
 
-To deal with large multi-file dataset, *xdas* uses the flexibility offered by the 
-[virtual dataset](https://docs.h5py.org/en/stable/vds.html) capabilities of 
-netCDF4/HDF5. A virtual dataset is a file that contains pointers towards an arbitrary number of files that 
-can then be accessed seamlessly as a single, contiguous dataset. Since this is
-handled by HDF5 under the hood (which is C compiled) it comes with almost no overhead. 
+To deal with large multi-file dataset, *Xdas* uses the concept of virtual datasets. A virtual dataset is a file that contains pointers towards an arbitrary number of files that can then be accessed seamlessly as a single, contiguous dataset. 
+
+*Xdas* uses two types of virtualization:
+
+- For HDF5 based format, it leverages the performance offered by the [virtual datasets](https://docs.h5py.org/en/stable/vds.html) native capabilities of netCDF4/HDF5 which comes with almost no overhead (C compiled).
+- For other type of files, it leverage the flexibility of [Dask arrays](https://docs.Dask.org/en/stable/array.html).
+
+## HDF5 Virtualization
 
 ```{note}
-Because netCDF4 are valid HDF5 files, the virtual dataset feature of HDF5 can be used 
-with netCDF4 files.
+Because netCDF4 are valid HDF5 files, the virtual dataset feature of HDF5 can be used with netCDF4 files.
 ```
 
-In *xdas*, a {py:class}`VirtualSource` is a pointer towards a file, while a 
-{py:class}`VirtualLayout` is table linking multiple {py:class}`VirtualSource`s. Below is an
-example of a virtual dataset linking three files:
+In *xdas*, a {py:class}`VirtualSource` is a pointer towards a file, while a {py:class}`VirtualLayout` is table linking multiple {py:class}`VirtualSource`s. Below is an example of a virtual dataset linking three files:
 
 ![](/_static/virtual-datasets.svg)
 
-In most cases, users do not need to deal with this object directly. To handle individual files, multiple files, and virtual datasets, *xdas* offers the following routines:
-
-- {py:func}`xdas.open_dataarray` is used to open a single (virtual) data file, and create a {py:class}`xdas.DataArray` object out of it.
-- {py:func}`xdas.open_mfdataarray` is used to open multiple (virtual) data files at once, creating a single {py:class}`xdas.DataArray` object that can be written to disk as a single virtual data file.
+In most cases, users do not need to deal with this object directly. 
 
 ```{note}
-When opening a virtual dataset, this later will appear as a {py:class}`VirtualSource`. 
-This is because HDF5 treats virtual dataset as regular files.
+When opening a virtual dataset, this later will appear as a {py:class}`VirtualSource`. This is because HDF5 treats virtual dataset as regular files.
 ```
 
+## Use cases
+
+To handle individual files, multiple files, and virtual datasets, *xdas* offers the following routines:
+
+| Function                             | Output                           | Description                                                                 |
+|--------------------------------------|----------------------------------|-----------------------------------------------------------------------------|
+| {py:func}`xdas.open_dataarray`       | {py:class}`~xdas.DataArray`      | Open a (virtual) file.                                               |
+| {py:func}`xdas.open_mfdataarray`     | {py:class}`~xdas.DataArray`      | Open multiple (virtual) files and concatenate them.                         |
+| {py:func}`xdas.open_mfdatacollection`| {py:class}`~xdas.DataCollection` | Open multiple (virtual) files, grouping and concatenating compatible files. |
+| {py:func}`xdas.open_mfdatatree`      | {py:class}`~xdas.DataCollection` | Open a directory tree of files, organizing data in a data collection.       |
+| {py:func}`xdas.open_datacollection`  | {py:class}`~xdas.DataCollection` | Open a (virtual) collection.                                         |
+
+Please refer to the [](data-structure/datacollection.md) section for the functions that return a data collection.
+
 ## Linking multi-file datasets
 
 Multiple physical data files can be opened simultaneously with the {py:func}`xdas.open_mfdataarray`:
@@ -66,16 +76,17 @@ xd.open_dataarray("vds.nc")
 ```
 
 ```{hint}
-A virtual dataset can point to another virtual dataset. This can be beneficial for huge
-real time dataset where new data can be linked regularly by batches. Those batches can 
-then be linked in a master virtual dataset. This avoids relinking all the files. 
+A virtual dataset can point to another virtual dataset. This can be beneficial for huge real time dataset where new data can be linked regularly by batches. Those batches can then be linked in a master virtual dataset. This avoids relinking all the files. 
 ```
 
 ```{warning}
-When loading large part of a virtual dataset, you might end up with nan values. This
-normally happens when linked files are missing. But due to a 
-[known limitation](https://forum.hdfgroup.org/t/virtual-datasets-and-open-file-limit/6757) 
-of the HDF5 C library it can be due to the opening of too many files. Try increasing 
-the number of possible file to open with the `ulimit` command. Or load smaller chunk of 
-data. 
-```
+When loading large part of a virtual dataset, you might end up with nan values. This normally happens when linked files are missing. But due to a [known limitation](https://forum.hdfgroup.org/t/virtual-datasets-and-open-file-limit/6757) of the HDF5 C library it can be due to the opening of too many files. Try increasing the number of possible file to open with the `ulimit` command. Or load smaller chunk of data. 
+```
+
+## Dask Virtualization
+
+Other type of formats will be loaded as Dask arrays. Those latter are a N-dimensional stack of chunks. At each chunk is associated a task to complete to get the values of that chunk. It results in a computation graph that Xdas is capable to serialize and store within its native NetCDF format. To be able to serialize the graph, it must only contain xdas or Dask functions. 
+
+From an user point of view the use of this type of virtualization is very similar to HDF5 one. 
+
+The main difference is that when opening a dataset with Dask virtualization, the entire graph of pointers to the files is loaded, can be modified and saved again. In the HDF5 case, opening a virtual dataset is handled the same way as if it is a regular file meaning that the underlying mapping of pointers is hidden and cannot be modified. Dask graph can be slow when they start to become very big (more than one million tasks).
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,15 +4,17 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "xdas"
-version = "0.1.2"
+version = "0.2"
 requires-python = ">= 3.10"
 authors = [
     { name = "Alister Trabattoni", email = "alister.trabattoni@gmail.com" },
 ]
 dependencies = [
+    "dask",
     "h5netcdf",
     "h5py",
     "hdf5plugin",
+    "msgpack",
     "numba",
     "numpy",
     "obspy",