-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #18 from xdas-dev/v0.2
Xdas 0.2: dask backend for the virtualization of non-HDF5 files (tdms, miniseed, ...)
- Loading branch information
Showing
27 changed files
with
1,953 additions
and
150 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Frequently Asked Questions | ||
|
||
## Why not using Xarray and Dask? | ||
|
||
Originally, Xdas was meant to be a simple add-on to Xarray, taking advantage of its [Dask integration] (https://docs.xarray.dev/en/stable/user-guide/dask.html). But two main limitations forced us to create a parallel project: | ||
|
||
- Coordinates have to be loaded into memory as NumPy arrays. This is prohibitive for very long time series, where storing the time coordinate as a dense array with a value for each time sample leads to metadata that in some extreme cases cannot fit in memory. | ||
- Dask arrays become sluggish when dealing with a very large number of files. Dask is a pure Python package, and processing graphs of millions of tasks can take several seconds or more. Also, Dask does not provide a way to serialise a graph for later reuse. | ||
|
||
Because of this, and the fact that the Xarray object was not designed to be subclassed, we decided to go our own way. Hopefully, if the progress of Xarray allows it, we could imagine merging the two projects. Xdas tries to follow the Xarray API as much as possible. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
--- | ||
file_format: mystnb | ||
kernelspec: | ||
name: python3 | ||
--- | ||
|
||
```{code-cell} | ||
:tags: [remove-cell] | ||
import os | ||
os.chdir("../_data") | ||
import warnings | ||
warnings.filterwarnings("ignore") | ||
import obspy | ||
import numpy as np | ||
np.random.seed(0) | ||
network = "NX" | ||
stations = ["SX001", "SX002", "SX003", "SX004", "SX005", "SX006", "SX007"] | ||
location = "00" | ||
channels = ["HHZ", "HHN", "HHE"] | ||
nchunk = 5 | ||
chunk_duration = 60 | ||
starttimes = [ | ||
obspy.UTCDateTime("2024-01-01T00:00:00") + idx * chunk_duration | ||
for idx in range(nchunk) | ||
] | ||
delta = 0.01 | ||
failure = 0.1 | ||
for station in stations: | ||
for starttime in starttimes: | ||
if np.random.rand() < failure: | ||
continue | ||
for channel in channels: | ||
data = np.random.randn(round(chunk_duration / delta)) | ||
header = { | ||
"delta": delta, | ||
"starttime": starttime, | ||
"network": network, | ||
"station": station, | ||
"location": location, | ||
"channel": channel, | ||
} | ||
tr = obspy.Trace(data, header) | ||
endtime = starttime + chunk_duration | ||
dirpath = f"{network}/{station}" | ||
if not os.path.exists(dirpath): | ||
os.makedirs(dirpath) | ||
fname = f"{network}.{station}.{location}.{channel}__{starttime}_{endtime}.mseed" | ||
path = os.path.join(dirpath, fname) | ||
tr.write(path) | ||
``` | ||
|
||
# Working with Large-N Seismic Arrays | ||
|
||
The virtualization capabilities of Xdas make it a good candidate for working with the large datasets produced by large-N seismic arrays. | ||
|
||
In this section, we will present several examples of how to handle long and numerous time series. | ||
|
||
```{note} | ||
This part encourages experimenting with seismic data. Depending on the most common use cases users find, this could lead to changes in development direction. | ||
``` | ||
|
||
## Exploring a dataset | ||
|
||
We will start by exploring a synthetic dataset composed of 7 stations, each with 3 channels (HHZ, HHN, HHE). Each trace for each station and channel is stored in multiple one-minute-long files. Some files are missing, resulting in data gaps. The sampling rate is 100 Hz. The data is organized in a directory structure that groups files by station. | ||
|
||
To open the dataset, we will use the `open_mfdatatree` function. This function takes a pattern that describes the directory structure and file names. The pattern is a string containing placeholders for the network, station, location, channel, start time, and end time. Placeholders for named fields are enclosed in curly braces, while simple brackets are used for varying parts of the file name that will be concatenated into different acquisitions (meant for changes in acquisition parameters). | ||
|
||
Next, we will plot the availability of the dataset. | ||
|
||
```{code-cell} | ||
:tags: [remove-output] | ||
import xdas as xd | ||
pattern = "NX/{station}/NX.{station}.00.{channel}__[acquisition].mseed" | ||
dc = xd.open_mfdatatree(pattern, engine="miniseed") | ||
xd.plot_availability(dc, dim="time") | ||
``` | ||
```{code-cell} | ||
:tags: [remove-input] | ||
from IPython.display import HTML | ||
fig = xd.plot_availability(dc, dim="time") | ||
HTML(fig.to_html()) | ||
``` | ||
|
||
We can see that indeed some data is missing. Yet, as often, the different channels are synchronized. We can therefore reorganize the data by concatenating the channels of each station. | ||
|
||
```{code-cell} | ||
:tags: [remove-output] | ||
dc = xd.DataCollection( | ||
{ | ||
station: xd.concatenate(objs, dim="channel") | ||
for station in dc | ||
for objs in zip(*[dc[station][channel] for channel in dc[station]]) | ||
}, | ||
name="station", | ||
) | ||
xd.plot_availability(dc, dim="time") | ||
``` | ||
```{code-cell} | ||
:tags: [remove-input] | ||
from IPython.display import HTML | ||
fig = xd.plot_availability(dc, dim="time") | ||
HTML(fig.to_html()) | ||
``` | ||
|
||
In our case, all stations are synchronized to GPS time. By selecting a time range where no data is missing, we can concatenate the stations to obtain an N-dimensional array representation of the dataset. | ||
|
||
```{code-cell} | ||
dc = dc.sel(time=slice("2024-01-01T00:01:00", "2024-01-01T00:02:59.99")) | ||
da = xd.concatenate((dc[station] for station in dc), dim="station") | ||
da | ||
``` | ||
|
||
This is useful for performing array analysis. In this example, we simply stack the energy. | ||
|
||
```{code-cell} | ||
trace = np.square(da).mean("channel").mean("station") | ||
trace.plot(ylim=(0, 3)) | ||
``` | ||
|
||
All the processing capabilities of Xdas can be applied to the dataset. We encourage readers to explore the various possibilities. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.