This project is targeting to experiment for Apache Spark utilization for seismic data processing needs. The longer-term target is running heavy seismic data imaging pipelines on big-scaled resources and providing the result in "real-time".
"Apache Spark is a unified analytics engine for large-scale data processing."
The main idea of the SeisSpark project is the representation of the seismic data processing graph as DAG of RDD operations in Apache Spark. The seismic data can is represented as key-value RDD, where the key is the gather id and value is the gather data.
The wiki page says:
"Seismic Un*x is an open source seismic processing package."
SeisSpark is actively using SU commands for running the actual data transformation. Also, SeisSpark uses the SU data format internally.
Additional information regarding the SU can be found in the SU documentation or in the SU GitHub repo.
The SU programs are invoked as subprocesses and all the data transferring is done via stdin and stdout. Practically it means that Spark Worker nodes need to have the SU package installed.
SeisSpark pipeline is a chain of SU programs wrapped by Python and executed in Spark. The translation of the SeisSpark pipeline into Spark RDD is done by the SeisSpark Service. Currently, only sequential chains are supported, but there are plans to extend SeisSpark for DAG support.
SeisSpark module is the node in SeisSpark pipeline. Each SeisSpark module is translated to at least one Spark transformation. Most of the SeisSpark modules are using SU programs for data transformation, but several of the modules are implemented directly with pyspark for performance needs. Each module describes its own parameters schema (JSON Schema was used), and parameters can be modified which is supposed to modify the module results.
SeisSpark uses Spark's RDDs for the representation of the data state at each step of the pipeline. To say it simpler each SeisSpark module receives an RDD and produces another one, each RDD is a list of key-value pairs.
SeisSpark's RDD value is a list of seismic traces. SeisSpark's RDD key is a gather id (or ensemble id). In most cases, the key is a trace header value, common in the list of traces.
SeisSpark deployment consist of two major components:
- SeisSpark Service
- Apache Spark cluster
SeisSpark Service is HTTP (mostly RESTful) service, which allows the building and management of SeisSpark pipelines. The gathers data can be requested from SeisSpark Service by the gather key - the Spark calculation will be triggered internally.
There's auto-generated Python client for SeisSpark Service. see api documentation
TODO
- Start Standalone container
cd docker
docker-compose build
docker-compose up
TODO