Playground for trying SparkExceptionLogger Locally

To try SparkExceptionLogger locally, this contains the details to setup a playground in your local, that is the combination of:

Spark Standalone Cluster,
MinIO Object Storage to store the logs, and
DuckDB to query the logs from MinIO Bucket.

This playground will simulate an environment just like with any Cloud Provider you are working with. For e.g., if you are on AWS this can be same as AWS EMR/Glue for running Apache Spark, AWS S3 as an Object Storage to write the logs and AWS Athena as a query engine to query the logs.

Building Playground

Creating Spark Standalone Cluster with MinIO

To create Spark Standalone Cluster with MinIO running on Docker, you can clone this spark-minio-project repo and follow the instructions present in README.md

This will get you running with Spark Standalone Cluster and a MinIO Service.

Setting up DuckDB

To setup DuckDB on your local, you can just run 2 commands to get it up and running.

You can follow the DuckDB installation instruction from DuckDB Official Docs

Running SparkExceptionLogger

Once Spark Cluster and MinIO is up and running,

Take test_spark_logger.py script and SparkExceptionLogger.py present in this repo.
Place it inside the spark_apps folder of the spark-minio-project repo you cloned.
Run the below mentioned command to run the script on Spark Standalone Cluster.

# Running the spark-submit command on spark-master container
docker exec spark-master spark-submit --master spark://spark-master:7077 --deploy-mode client --jars /opt/extra-jars/hadoop-aws-3.3.4.jar,/opt/extra-jars/aws-java-sdk-bundle-1.12.262.jar --py-files ./apps/test_scripts/SparkExceptionLogger.py ./apps/test_spark_logger.py

Once the Spark Job is completed. It's time to query the logs using DuckDB.

Querying Logs with DuckDB

To query logs written into MinIO Bucket, a DuckDB Secret needs to be setup, the same way it's required to be setup when querying from AWS S3.

You can follow the instructions below or run the sqls present in SQL File.

To setup DuckDB SECRET:

Start DuckDB on a terminal

duckdb minio_db.db

Create a SECRET by running the below SQL

CREATE SECRET minio_secrets (
    TYPE S3,
    KEY_ID 'admin', -- MINIO_USER
    SECRET 'password', -- MINIO_PASSWORD
    REGION 'us-east-1',
    ENDPOINT '127.0.0.1:9000', -- MINIO_ENDPOINT
    USE_SSL false,
    URL_STYLE 'path'
);

Once SECRET is created, run the below command to query the logs written

FROM read_parquet('s3a://warehouse/logging/*/*.parquet', hive_partitioning = true);

You can modify the test_spark_logger.py as you like to playaround for your specific scenario.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

README.md

README.md

Playground for trying SparkExceptionLogger Locally

Building Playground

Creating Spark Standalone Cluster with MinIO

Setting up DuckDB

Running SparkExceptionLogger

Querying Logs with DuckDB

Collapse file tree

Files

README.md

Latest commit

History

README.md

File metadata and controls

Playground for trying SparkExceptionLogger Locally

Building Playground

Creating Spark Standalone Cluster with MinIO

Setting up DuckDB

Running SparkExceptionLogger

Querying Logs with DuckDB