Skip to content
This repository has been archived by the owner on Sep 22, 2022. It is now read-only.

Latest commit

 

History

History
100 lines (65 loc) · 3.95 KB

README.md

File metadata and controls

100 lines (65 loc) · 3.95 KB

SeisSpark

This project is targeting to experiment for Apache Spark utilization for seismic data processing needs. The longer-term target is running heavy seismic data imaging pipelines on big-scaled resources and providing the result in "real-time".

Spark

"Apache Spark is a unified analytics engine for large-scale data processing."

The main idea of the SeisSpark project is the representation of the seismic data processing graph as DAG of RDD operations in Apache Spark. The seismic data can is represented as key-value RDD, where the key is the gather id and value is the gather data.

Seismic Unix (aka SU)

The wiki page says:

"Seismic Un*x is an open source seismic processing package."

SeisSpark is actively using SU commands for running the actual data transformation. Also, SeisSpark uses the SU data format internally.

Additional information regarding the SU can be found in the SU documentation or in the SU GitHub repo.

The SU programs are invoked as subprocesses and all the data transferring is done via stdin and stdout. Practically it means that Spark Worker nodes need to have the SU package installed.

SeisSpark pipeline

SeisSpark pipeline is a chain of SU programs wrapped by Python and executed in Spark. The translation of the SeisSpark pipeline into Spark RDD is done by the SeisSpark Service. Currently, only sequential chains are supported, but there are plans to extend SeisSpark for DAG support.

SeisSpark module

SeisSpark module is the node in SeisSpark pipeline. Each SeisSpark module is translated to at least one Spark transformation. Most of the SeisSpark modules are using SU programs for data transformation, but several of the modules are implemented directly with pyspark for performance needs. Each module describes its own parameters schema (JSON Schema was used), and parameters can be modified which is supposed to modify the module results.

SeisSpark data model

SeisSpark uses Spark's RDDs for the representation of the data state at each step of the pipeline. To say it simpler each SeisSpark module receives an RDD and produces another one, each RDD is a list of key-value pairs.

SeisSpark's RDD

SeisSpark's RDD value is a list of seismic traces. SeisSpark's RDD key is a gather id (or ensemble id). In most cases, the key is a trace header value, common in the list of traces.

Architecture

SeisSpark deployment consist of two major components:

  • SeisSpark Service
  • Apache Spark cluster

SeisSpark Service

SeisSpark Service is HTTP (mostly RESTful) service, which allows the building and management of SeisSpark pipelines. The gathers data can be requested from SeisSpark Service by the gather key - the Spark calculation will be triggered internally.

SeisSpark Python Client

There's auto-generated Python client for SeisSpark Service. see api documentation

API docs

Implementation details

TODO

Getting started

  1. Start Standalone container
cd docker
docker-compose build
docker-compose up
  1. Open http://localhost:9091

Alt text

Web UI

TODO