Version 0.1 (#12)

* Fix xdas.filter. * Typo in docs. * Do not generate output dir. * Speedup .to_netcdf for VirtualStack. * update docs. * Add DataArray.size len(DataArray) * Add dtype checking when combine_by_coords. * Make Collections pickable. * Ensure correct fillvalue for virtual datasets. * Fix virtual writing with dense non-dimensional coordinates. * replace netcdf4 by h5netcdf. * Fix to_xarray for scalar coordinate. * Allow for Ellipsis in DataArray.transpose. * Add DataArray.T * One more concat test. * Fix concatenate for unsorted coords. * Add randn_wavefronts and rename wavelet_wavefronts. * Add MLPicker. * Allow to pass chunk_dim to atoms in process. * Move Sequential * Make Sequential inherit from Atom. * Fix nasty bug. * auto chunk_dim for process. * rename kwargs --> flags. * Rename chunk -> chunk_dim * Put zeros for missing channels. * normalize inplace. * WIP. * Add coords handling. * refactor MLPicker. * Some more refactoring. * Make circular buffers as states. * Fix state initialization. * small refactoring. * Add find_picks. * format. * more formating. * Fix _find_picks_numeric axis handling. * Add chunk processing for find_picks_numeric. * Fix find_picks_numeric for 1d arrays. * test trigger on several chunk ago. * Add tests for find_picks. * Add TODO. * Implement offset argument to get absolute index location for chunks. * remove unused imports * Add trigger on chunks. * Small refactor of atoms.Partial. * Atomize find_picks. * Fix equals for nan. * Add DataFrameWriter. * tupos * Fix parse_dates in DataFrameWriter. * feat: Add find_picks trigger to signal module * Fix virtual writing of collections. * Add Atomic declaration of trigger. * Add some doc and do some refactoring. * chunk_dim flag must be provided or the state is reset. * Allow to pass atoms as input of atomized functions. * Add numba to requirements. * WIP: docs * Improve getting started. * Update Partial docstring. * update atomized docstring. * Remove damned pylance auto import. * Linting corrections. * restore netcdf4 dependency. * Fix ufunc by providing better broadcasting. * Use NDArrayOperatorsMixin for better arithmetics support. * Improve get_discontinuities. * update span -> delta * Add copy to datacollection.py * improves copy for datacollections. * Fix some bugs. * Add --force-reinstall for latest. * CI: initial commit. * CI: fix python verisons. * Pytest: import mode = importlib. * rename action: tests. * Add tests badge. * add coord.get_availabilities() * add plot_availability for DataArray. * Add availability plot for collections. * Update version.
xdas-dev · May 21, 2024 · a884007 · a884007
1 parent e4dcd07
commit a884007
Show file tree

Hide file tree

Showing 34 changed files with 2,344 additions and 607 deletions.
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -0,0 +1,26 @@
+name: tests
+
+on: [push]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+        cache: 'pip'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install '.[tests]'
+    - name: Test with pytest
+      run: |
+        pytest
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@
 -----------------
 
 [![Documentation Status](https://readthedocs.org/projects/xdas/badge/?version=latest)](https://xdas.readthedocs.io/en/latest/?badge=latest)
+[![Tests Status](https://github.com/xdas-dev/xdas/actions/workflows/tests.yaml/badge.svg)](https://github.com/xdas-dev/xdas/actions/workflows/tests.yaml)
 [![PyPI](https://img.shields.io/pypi/v/xdas)](https://pypi.org/project/xdas/)
 [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
 [![DOI](https://zenodo.org/badge/560867006.svg)](https://zenodo.org/badge/latestdoi/560867006)

diff --git a/docs/api/atoms.md b/docs/api/atoms.md
@@ -19,7 +19,6 @@ Attributes
 
 ```{eval-rst}
 .. autosummary::
-   :toctree: ../_autosummary
 
    Atom.state
    Atom.initialized
@@ -30,7 +29,6 @@ Methods
 
 ```{eval-rst}
 .. autosummary::
-   :toctree: ../_autosummary
 
    Atom.initialize
    Atom.initialize_from_state
@@ -55,11 +53,12 @@ Methods
 .. autosummary::
    :toctree: ../_autosummary
 
-   signal.ResamplePoly
-   signal.IIRFilter
-   signal.FIRFilter
-   signal.LFilter
-   signal.SOSFilter
-   signal.DownSample
-   signal.UpSample
+   DownSample
+   FIRFilter
+   IIRFilter
+   LFilter
+   ResamplePoly
+   SOSFilter
+   Trigger
+   UpSample
 ```
diff --git a/docs/api/synthetics.md b/docs/api/synthetics.md
@@ -8,5 +8,6 @@
 .. autosummary::
    :toctree: ../_autosummary
 
-   generate
+   wavelet_wavefronts
+   randn_wavefronts
 ```
diff --git a/docs/conf.py b/docs/conf.py
@@ -4,23 +4,14 @@
 # list see the documentation:
 # https://www.sphinx-doc.org/en/master/usage/configuration.html
 
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import os
-import sys
-
 # -- Project information -----------------------------------------------------
 
 project = "xdas"
 copyright = "2024, Alister Trabattoni"
 author = "Alister Trabattoni"
 
 # The full version, including alpha/beta/rc tags
-release = "0.1rc0"
+release = "0.1"
 
 
 # -- General configuration ---------------------------------------------------
@@ -101,13 +92,13 @@
 import numpy as np
 
 import xdas as xd
-from xdas.synthetics import generate
+from xdas.synthetics import wavelet_wavefronts
 
 dirpath = os.path.join(os.path.split(__file__)[0], "_data")
 if not os.path.exists(dirpath):
     os.makedirs(dirpath)
 
-da = generate()
+da = wavelet_wavefronts()
 chunks = xd.split(da, 3)
 da.to_netcdf(os.path.join(dirpath, "sample.h5"))
 da.to_netcdf(os.path.join(dirpath, "sample.nc"))

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -29,7 +29,7 @@ pip install xdas
 ````
 ````{tab-item} Latest
 ```bash
-pip install "git+https://github.com/xdas-dev/xdas.git@dev"
+pip install "git+https://github.com/xdas-dev/xdas.git@dev" --force-reinstall
 ```
 
 ````
@@ -68,7 +68,7 @@ Xdas only loads the metadata from each file and returns a {py:class}`~xdas.DataA
 Note that if you want to create a single data collection object for multiple acquisitions (i.e. different instruments or several acquisition with different parameters), you can use the [DataCollection](user-guide/data-structure/datacollection) structure.  
 
 ```{note}
-For Febus users, the current implementation is very slow when directly working with native files. This is due to the particular 3D layout of the Febus format that is for now virtually reshaped in a inefficient way. The current recommended workflow is to first convert each Febus file in the Xdas NetCDF format: `xdas.open_dataarray("path_to_febus_file.h5", engine="febus").to_netcdf("path_to_xdas_file.nc", virtual=False)`. Those converted file can then be linked as described above.
+For Febus users, converting native files into Xdas NetCDF format generally improves I/O operations and reduce the amount of data by a factor two. This can be done by looping over Febus files and running: `xdas.open_dataarray("path_to_febus_file.h5", engine="febus").to_netcdf("path_to_xdas_file.nc", virtual=False)`. The converted files can then be linked as described above.
 ```
 
 ### Fixing small gaps and overlaps
@@ -141,7 +141,7 @@ da.plot(yincrease=False, vmin=-0.5, vmax=0.5)
 ```
 
 
-## Processing
+## Signal processing
 
 DataArray can be processed without having to extract the underlying N-dimensional array. Most numpy functions can be applied while preserving metadata. Xdas also wraps a large subset of [numpy](https://numpy.org/) and [scipy](https://scipy.org/) function by adding coordinates handling. You mainly need to replace `axis` arguments by `dim` ones and to provides dimensions by name and not by position.
 
@@ -174,10 +174,10 @@ Bellow an example of spatial and temporal decimation:
 ```{code-cell}
 import xdas.signal as xs 
 
-da = xs.decimate(da, 2, ftype="fir", dim="distance", parallel=None)  # all cores by default
-da = xs.decimate(da, 2, ftype="iir", dim="time", parallel=8)  # height cores
+decimated = xs.decimate(da, 2, ftype="fir", dim="distance", parallel=None)  # all cores by default
+decimated = xs.decimate(decimated, 2, ftype="iir", dim="time", parallel=8)  # height cores
 
-da.plot(yincrease=False, vmin=-0.25, vmax=0.25)
+decimated.plot(yincrease=False, vmin=-0.25, vmax=0.25)
 ```
 
 Here how to compute a FK diagram. Note that the DataArray object can be used to represent any number and kind of dimensions:
@@ -190,7 +190,7 @@ fk = xs.taper(fk, dim="time")
 fk = xfft.rfft(fk, dim={"time": "frequency"})  # rename "time" -> "frequency"
 fk = xfft.fft(fk, dim={"distance": "wavenumber"}) # rename "distance" -> "wavenumber"
 fk = 20 * np.log10(np.abs(fk))
-fk.plot(xlim=(-0.004, 0.004), vmin=-40, vmax=20, interpolation="antialiased")
+fk.plot(xlim=(-0.004, 0.004), vmin=-30, vmax=30, interpolation="antialiased")
 ```
 
 ### Saving results
@@ -200,3 +200,94 @@ Processed data can be saved to NetCDF. This time, because the data was changed,
 ```{code-cell}
 fk.to_netcdf("fk.nc")
 ```
+
+
+## Massive processing using Atoms
+
+The usual [numpy](https://numpy.org/)/[scipy](https://scipy.org/) way of processing data works great when the data of interest fit in memory. To deal with huge datasets, xdas introduce {py:class}`~xdas.atoms.Atom` objects. 
+
+An {py:class}`~xdas.atoms.Atom` is a generic processing unit that takes one input and return one output. Atoms can store state information to ensure continuity from subsequent calls on contiguous chunks.
+There are three ways to make atoms with xdas:
+
+- Function can be *atomized* using the {py:class}`~xdas.atoms.Partial` class. All parameters except the input are fixed.
+- The {py:mod}`xdas.atoms` module contains a set of predefined atoms. In particular most stateful atoms are implemented in that module.
+- The user can subclass the {py:class}`~xdas.atoms.Atom` class and define its own atoms.
+
+### Transforming a classic workflow into an atomic pipeline
+
+Imagine you tested the following workflow on a small subset of your data:
+
+```{code-cell}
+from scipy.signal import iirfilter
+
+b, a = iirfilter(4, 0.1, btype="high")
+
+def process(da):
+  da = xs.decimate(da, 2, ftype="fir", dim="distance")  # not impacted by chunking 
+  da = xs.lfilter(b, a, da, dim="time")  # require state passing along time
+  da = np.square(da)  # already a unary operator
+  return da
+
+monolithic = process(da)
+```
+
+To convert your workflow into an atomic pipeline you need:
+
+1. to convert each processing step into an atom
+2. to bundle all steps into a {py:class}`~xdas.atoms.Sequential` atom.
+
+Converting each processing step into an atom depend on the nature of the step. In particular it depends wether the operation is **stateful** (it does rely on the history along the chunked dimension) or **stateless** (the operation can by applied separately on each chunk along the given dimension without any particular consideration). An example of a stateful operation is a recursive filter, passing on the state from t to t+1.Not that this stateful/less caracteristic depends on the chunking dimension.
+
+- unary operators that are not stateful (that do not rely on the history along the chunked axis) can be used as is.
+- functions that are not stateful must be wrapped with the {py:class}`~xdas.atoms.Partial` class.
+- functions that **are stateful** must be replaced by an equivalent stateful object.
+
+In practice, the atomized workflow can be implemented like below. The resulting atom is a callable that can be applied to any data array.
+
+```{code-cell}
+from xdas.atoms import Sequential, Partial, LFilter
+
+atom = Sequential(
+  [
+    Partial(xs.decimate, 2, ftype="fir", dim="distance"),  # use Partial when stateless
+    LFilter(b, a, dim="time"),  # use equivalent atom object if stateful
+    np.square,  # do nothing if unary and stateless
+  ]
+)
+
+atomic = atom(da)
+
+assert atomic.equals(monolithic) # works as `process` but can by applied chunk by chunk
+```
+
+### Applying an atom chunk by chunk
+
+While atoms can be used as an equivalent of functions to organize pipelines, their major selling points is their abilities to enable chunk processing. While chunk by chunk processing can be done manually, xdas provides the {py:mod}`xdas.processing` module to facilitate this operation. The user must define one data loader and one data writer. Then the {py:func}`~xdas.processing.process` function is used to run the computation. 
+
+```{code-cell}
+:tags: [remove-cell]
+
+!mkdir output
+```
+
+In the example below the data array is loaded by chunks of 100 samples along the `"time"` dimension. Each chunk is processed by the atom that was defined above and each resulting processed chunk is saved in the `output` folder. Once the computation is completed, the data loader return a unified view on the output chunks. 
+
+```{code-cell}
+:tags: [remove-output]
+
+from xdas.processing import process, DataArrayLoader, DataArrayWriter
+
+dl = DataArrayLoader(da, chunks={"time": 100})
+dw = DataArrayWriter("output")
+chunked = process(atom, dl, dw)
+
+assert chunked.equals(monolithic)  # again equal but could be applied to much bigger datasets
+```
+
+```{code-cell}
+:tags: [remove-cell]
+
+!rm -r output
+```
+
+This part was a short summary about atoms and chunk processing. To go deeper on the atom part you can head to the [](user-guide/atoms) section. To further study chunk processing you can head to the [](user-guide/processing) section.
diff --git a/docs/user-guide/atoms.md b/docs/user-guide/atoms.md
@@ -47,9 +47,9 @@ The last operation, `IIRFilter`, instantiates a specific class dedicated to chun
 Once the processing sequence has been defined, it can operate on data in memory by simply calling the sequence with the data array as the argument:
 
 ```{code-cell} 
-from xdas.synthetics import generate
+from xdas.synthetics import wavelet_wavefronts
 
-da = generate()
+da = wavelet_wavefronts()
 result = sequence(da)
 result.plot(yincrease=False)
 ```

diff --git a/docs/user-guide/data-formats.md b/docs/user-guide/data-formats.md
@@ -29,7 +29,7 @@ The formats that are currently implemented are: ASN, FEBUS, OPTASENSE and SINTEL
 | OPTASENSE         | `"optasense"`     |
 | SINTELA           | `"sintela"`       |
 
-## Exdending *xdas* with your file format
+## Extending *xdas* with your file format
 
 *xdas* insists on its extensibility, the power is in the hands of the users. Extending *xdas* usually consists of writing few-line-of-code-long functions. The process consists in dealing with the two main aspects of a {py:class}`xarray.DataArray`: unpacking the data and coordinates objects, eventually processing them and packing them back into a Database object. 
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,18 +4,21 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "xdas"
-version = "0.1rc0"
-requires-python = ">= 3.7"
+version = "0.1"
+requires-python = ">= 3.10"
 authors = [
     { name = "Alister Trabattoni", email = "alister.trabattoni@gmail.com" },
 ]
 dependencies = [
     "dask",
+    "h5netcdf",
     "h5py",
     "netcdf4",
+    "numba",
     "numpy",
     "obspy",
     "pandas",
+    "plotly",
     "scipy",
     "tqdm",
     "xarray",
@@ -34,11 +37,11 @@ docs = [
     "sphinx-copybutton",
     "sphinx",
 ]
-tests = ["pytest", "pytest-cov"]
+tests = ["pytest", "pytest-cov", "seisbench", "torch"]
 
 [tool.isort]
 profile = "black"
 
 [tool.pytest.ini_options]
-addopts = "--doctest-modules"
+addopts = ["--doctest-modules", "--import-mode=importlib"]
 doctest_optionflags = "NORMALIZE_WHITESPACE"