Skip to content

Commit

Permalink
Merge pull request #18 from xdas-dev/v0.2
Browse files Browse the repository at this point in the history
Xdas 0.2: dask backend for the virtualization of non-HDF5 files (tdms, miniseed, ...)
  • Loading branch information
atrabattoni authored Sep 18, 2024
2 parents 8306384 + 91a6b21 commit 0b30b20
Show file tree
Hide file tree
Showing 27 changed files with 1,953 additions and 150 deletions.
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
author = "Alister Trabattoni"

# The full version, including alpha/beta/rc tags
release = "0.1.2"
release = "0.2"


# -- General configuration ---------------------------------------------------
Expand Down
12 changes: 9 additions & 3 deletions docs/release-notes.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# Release notes

## 0.2

- Add Dask virtualization backend for non-HDF5 formats (@atrabattoni).
- Add support for miniSEED format (@atrabattoni, @chauvetige).
- Add support for Silixa (TDMS) format (@atrabattoni, @Stutzmann).

## 0.1.2

- Add ZeroMQ streaming capabilities (@atrabattoni)
- Add ZeroMQ streaming capabilities (@atrabattoni).
- Add support of Terra15 format (@chauvetige).
- Fix Febus engine (@ClaudioStrumia)
- Fix Febus engine (@ClaudioStrumia).

## 0.1.1

Expand All @@ -14,4 +20,4 @@

## 0.1

Initial stable version
Initial stable version.
28 changes: 19 additions & 9 deletions docs/user-guide/data-formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,25 @@ os.chdir("../_data")

## Implemented file formats

The formats that are currently implemented are: ASN, FEBUS, OPTASENSE, SINTELA and TERRA15. To read them you have to specifiy which one you want in the `engine` argument in {py:func}`xdas.open_dataarray` for a single file or {py:func}`xdas.open_mfdataarray` for multiple files:

| DAS constructor | `engine` argument |
|:-----------------:|:-----------------:|
| ASN | `"asn"` |
| FEBUS | `"febus"` |
| OPTASENSE | `"optasense"` |
| SINTELA | `"sintela"` |
| TERRA15 | `"terra15"` |
Here below the list of formats that are currently implemented. All HDF5 based formats support native virtualization. Other formats support Dask virtualization. Please refer to the [](virtual-datasets) section. To read a them you have to specify which one you want in the `engine` argument in {py:func}`xdas.open_dataarray` for a single file or {py:func}`xdas.open_mfdataarray` for multiple files.

Xdas support the following DAS formats:

| Constructor | Instrument | `engine` argument | Virtualization |
|:-----------------:|:-----------------:|:-----------------:|:-----------------:|
| ASN | OptoDAS | `"asn"` | HDF5 |
| FEBUS | A1 | `"febus"` | HDF5 |
| OptaSense | OLA, ODH*, ... | `"optasense"` | HDF5 |
| Silixa | iDAS | `"silixa"` | Dask |
| SINTELA | ONYX | `"sintela"` | HDF5 |
| Terra15 | Treble | `"terra15"` | HDF5 |

It also implements its own format and support miniSEED:

| Format | `engine` argument | Virtualization |
|:-----------------:|:-----------------:|:-----------------:|
| Xdas | `None` | HDF5 |
| miniSEED | `"miniseed"` | Dask |

```{warning}
Due to poor documentation of the various version of the Febus format, it is recommended to manually provide the required trimming and the position of the timestamps within each block. For example to trim 100 samples on both side of each block and to set the timestamp location at the center of the block for a block of 2000 samples:
Expand Down
10 changes: 10 additions & 0 deletions docs/user-guide/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Frequently Asked Questions

## Why not using Xarray and Dask?

Originally, Xdas was meant to be a simple add-on to Xarray, taking advantage of its [Dask integration] (https://docs.xarray.dev/en/stable/user-guide/dask.html). But two main limitations forced us to create a parallel project:

- Coordinates have to be loaded into memory as NumPy arrays. This is prohibitive for very long time series, where storing the time coordinate as a dense array with a value for each time sample leads to metadata that in some extreme cases cannot fit in memory.
- Dask arrays become sluggish when dealing with a very large number of files. Dask is a pure Python package, and processing graphs of millions of tasks can take several seconds or more. Also, Dask does not provide a way to serialise a graph for later reuse.

Because of this, and the fact that the Xarray object was not designed to be subclassed, we decided to go our own way. Hopefully, if the progress of Xarray allows it, we could imagine merging the two projects. Xdas tries to follow the Xarray API as much as possible.
2 changes: 2 additions & 0 deletions docs/user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ data-structure/index
data-formats
virtual-datasets
interpolated-coordinates
miniseed
convert-displacement
atoms
processing
streaming
faq
```
131 changes: 131 additions & 0 deletions docs/user-guide/miniseed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
file_format: mystnb
kernelspec:
name: python3
---

```{code-cell}
:tags: [remove-cell]
import os
os.chdir("../_data")
import warnings
warnings.filterwarnings("ignore")
import obspy
import numpy as np
np.random.seed(0)
network = "NX"
stations = ["SX001", "SX002", "SX003", "SX004", "SX005", "SX006", "SX007"]
location = "00"
channels = ["HHZ", "HHN", "HHE"]
nchunk = 5
chunk_duration = 60
starttimes = [
obspy.UTCDateTime("2024-01-01T00:00:00") + idx * chunk_duration
for idx in range(nchunk)
]
delta = 0.01
failure = 0.1
for station in stations:
for starttime in starttimes:
if np.random.rand() < failure:
continue
for channel in channels:
data = np.random.randn(round(chunk_duration / delta))
header = {
"delta": delta,
"starttime": starttime,
"network": network,
"station": station,
"location": location,
"channel": channel,
}
tr = obspy.Trace(data, header)
endtime = starttime + chunk_duration
dirpath = f"{network}/{station}"
if not os.path.exists(dirpath):
os.makedirs(dirpath)
fname = f"{network}.{station}.{location}.{channel}__{starttime}_{endtime}.mseed"
path = os.path.join(dirpath, fname)
tr.write(path)
```

# Working with Large-N Seismic Arrays

The virtualization capabilities of Xdas make it a good candidate for working with the large datasets produced by large-N seismic arrays.

In this section, we will present several examples of how to handle long and numerous time series.

```{note}
This part encourages experimenting with seismic data. Depending on the most common use cases users find, this could lead to changes in development direction.
```

## Exploring a dataset

We will start by exploring a synthetic dataset composed of 7 stations, each with 3 channels (HHZ, HHN, HHE). Each trace for each station and channel is stored in multiple one-minute-long files. Some files are missing, resulting in data gaps. The sampling rate is 100 Hz. The data is organized in a directory structure that groups files by station.

To open the dataset, we will use the `open_mfdatatree` function. This function takes a pattern that describes the directory structure and file names. The pattern is a string containing placeholders for the network, station, location, channel, start time, and end time. Placeholders for named fields are enclosed in curly braces, while simple brackets are used for varying parts of the file name that will be concatenated into different acquisitions (meant for changes in acquisition parameters).

Next, we will plot the availability of the dataset.

```{code-cell}
:tags: [remove-output]
import xdas as xd
pattern = "NX/{station}/NX.{station}.00.{channel}__[acquisition].mseed"
dc = xd.open_mfdatatree(pattern, engine="miniseed")
xd.plot_availability(dc, dim="time")
```
```{code-cell}
:tags: [remove-input]
from IPython.display import HTML
fig = xd.plot_availability(dc, dim="time")
HTML(fig.to_html())
```

We can see that indeed some data is missing. Yet, as often, the different channels are synchronized. We can therefore reorganize the data by concatenating the channels of each station.

```{code-cell}
:tags: [remove-output]
dc = xd.DataCollection(
{
station: xd.concatenate(objs, dim="channel")
for station in dc
for objs in zip(*[dc[station][channel] for channel in dc[station]])
},
name="station",
)
xd.plot_availability(dc, dim="time")
```
```{code-cell}
:tags: [remove-input]
from IPython.display import HTML
fig = xd.plot_availability(dc, dim="time")
HTML(fig.to_html())
```

In our case, all stations are synchronized to GPS time. By selecting a time range where no data is missing, we can concatenate the stations to obtain an N-dimensional array representation of the dataset.

```{code-cell}
dc = dc.sel(time=slice("2024-01-01T00:01:00", "2024-01-01T00:02:59.99"))
da = xd.concatenate((dc[station] for station in dc), dim="station")
da
```

This is useful for performing array analysis. In this example, we simply stack the energy.

```{code-cell}
trace = np.square(da).mean("channel").mean("station")
trace.plot(ylim=(0, 3))
```

All the processing capabilities of Xdas can be applied to the dataset. We encourage readers to explore the various possibilities.
63 changes: 37 additions & 26 deletions docs/user-guide/virtual-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,33 +14,43 @@ os.chdir("../_data")

# Virtual Datasets

To deal with large multi-file dataset, *xdas* uses the flexibility offered by the
[virtual dataset](https://docs.h5py.org/en/stable/vds.html) capabilities of
netCDF4/HDF5. A virtual dataset is a file that contains pointers towards an arbitrary number of files that
can then be accessed seamlessly as a single, contiguous dataset. Since this is
handled by HDF5 under the hood (which is C compiled) it comes with almost no overhead.
To deal with large multi-file dataset, *Xdas* uses the concept of virtual datasets. A virtual dataset is a file that contains pointers towards an arbitrary number of files that can then be accessed seamlessly as a single, contiguous dataset.

*Xdas* uses two types of virtualization:

- For HDF5 based format, it leverages the performance offered by the [virtual datasets](https://docs.h5py.org/en/stable/vds.html) native capabilities of netCDF4/HDF5 which comes with almost no overhead (C compiled).
- For other type of files, it leverage the flexibility of [Dask arrays](https://docs.Dask.org/en/stable/array.html).

## HDF5 Virtualization

```{note}
Because netCDF4 are valid HDF5 files, the virtual dataset feature of HDF5 can be used
with netCDF4 files.
Because netCDF4 are valid HDF5 files, the virtual dataset feature of HDF5 can be used with netCDF4 files.
```

In *xdas*, a {py:class}`VirtualSource` is a pointer towards a file, while a
{py:class}`VirtualLayout` is table linking multiple {py:class}`VirtualSource`s. Below is an
example of a virtual dataset linking three files:
In *xdas*, a {py:class}`VirtualSource` is a pointer towards a file, while a {py:class}`VirtualLayout` is table linking multiple {py:class}`VirtualSource`s. Below is an example of a virtual dataset linking three files:

![](/_static/virtual-datasets.svg)

In most cases, users do not need to deal with this object directly. To handle individual files, multiple files, and virtual datasets, *xdas* offers the following routines:

- {py:func}`xdas.open_dataarray` is used to open a single (virtual) data file, and create a {py:class}`xdas.DataArray` object out of it.
- {py:func}`xdas.open_mfdataarray` is used to open multiple (virtual) data files at once, creating a single {py:class}`xdas.DataArray` object that can be written to disk as a single virtual data file.
In most cases, users do not need to deal with this object directly.

```{note}
When opening a virtual dataset, this later will appear as a {py:class}`VirtualSource`.
This is because HDF5 treats virtual dataset as regular files.
When opening a virtual dataset, this later will appear as a {py:class}`VirtualSource`. This is because HDF5 treats virtual dataset as regular files.
```

## Use cases

To handle individual files, multiple files, and virtual datasets, *xdas* offers the following routines:

| Function | Output | Description |
|--------------------------------------|----------------------------------|-----------------------------------------------------------------------------|
| {py:func}`xdas.open_dataarray` | {py:class}`~xdas.DataArray` | Open a (virtual) file. |
| {py:func}`xdas.open_mfdataarray` | {py:class}`~xdas.DataArray` | Open multiple (virtual) files and concatenate them. |
| {py:func}`xdas.open_mfdatacollection`| {py:class}`~xdas.DataCollection` | Open multiple (virtual) files, grouping and concatenating compatible files. |
| {py:func}`xdas.open_mfdatatree` | {py:class}`~xdas.DataCollection` | Open a directory tree of files, organizing data in a data collection. |
| {py:func}`xdas.open_datacollection` | {py:class}`~xdas.DataCollection` | Open a (virtual) collection. |

Please refer to the [](data-structure/datacollection.md) section for the functions that return a data collection.

## Linking multi-file datasets

Multiple physical data files can be opened simultaneously with the {py:func}`xdas.open_mfdataarray`:
Expand All @@ -66,16 +76,17 @@ xd.open_dataarray("vds.nc")
```

```{hint}
A virtual dataset can point to another virtual dataset. This can be beneficial for huge
real time dataset where new data can be linked regularly by batches. Those batches can
then be linked in a master virtual dataset. This avoids relinking all the files.
A virtual dataset can point to another virtual dataset. This can be beneficial for huge real time dataset where new data can be linked regularly by batches. Those batches can then be linked in a master virtual dataset. This avoids relinking all the files.
```

```{warning}
When loading large part of a virtual dataset, you might end up with nan values. This
normally happens when linked files are missing. But due to a
[known limitation](https://forum.hdfgroup.org/t/virtual-datasets-and-open-file-limit/6757)
of the HDF5 C library it can be due to the opening of too many files. Try increasing
the number of possible file to open with the `ulimit` command. Or load smaller chunk of
data.
```
When loading large part of a virtual dataset, you might end up with nan values. This normally happens when linked files are missing. But due to a [known limitation](https://forum.hdfgroup.org/t/virtual-datasets-and-open-file-limit/6757) of the HDF5 C library it can be due to the opening of too many files. Try increasing the number of possible file to open with the `ulimit` command. Or load smaller chunk of data.
```

## Dask Virtualization

Other type of formats will be loaded as Dask arrays. Those latter are a N-dimensional stack of chunks. At each chunk is associated a task to complete to get the values of that chunk. It results in a computation graph that Xdas is capable to serialize and store within its native NetCDF format. To be able to serialize the graph, it must only contain xdas or Dask functions.

From an user point of view the use of this type of virtualization is very similar to HDF5 one.

The main difference is that when opening a dataset with Dask virtualization, the entire graph of pointers to the files is loaded, can be modified and saved again. In the HDF5 case, opening a virtual dataset is handled the same way as if it is a regular file meaning that the underlying mapping of pointers is hidden and cannot be modified. Dask graph can be slow when they start to become very big (more than one million tasks).
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,17 @@ build-backend = "setuptools.build_meta"

[project]
name = "xdas"
version = "0.1.2"
version = "0.2"
requires-python = ">= 3.10"
authors = [
{ name = "Alister Trabattoni", email = "alister.trabattoni@gmail.com" },
]
dependencies = [
"dask",
"h5netcdf",
"h5py",
"hdf5plugin",
"msgpack",
"numba",
"numpy",
"obspy",
Expand Down
Loading

0 comments on commit 0b30b20

Please sign in to comment.