I am a self-driven proactive Principal Data Engineer with a keen focus on optimizing and architecting robust, scalable data solutions. I am a developer by ❤️.
I love making generalized, robust, scalable and cost-optimized solutions to a problem as a framework so that it can be used by wider audience.
I write a weekly tech newsletter called The Pragmatic Data Engineer's Playbook. The main purpose of this newsletter is to help people become a better Data Engineer by helping them in upgrading there skills to the next level.
PDEP covers the Deep Dives on Data Engineering Tech, Distributed Data Systems, Optimization Techniques, and Data Architecture.
- Apache Spark
- AWS
- Apache Airflow
- Apache Iceberg
- Apache Hudi
- Apache Arrow
- Apache Flink
- Apache Kafka
- Databricks
- Delta Lake
- Python
- Java
- Rust (learning)
- SparkExceptionLogger - A Lightweight Easy-to-Integrate Custom Spark Exception Logger written in Python to log all the exception details from a Spark Job into an S3 Location or a Table that can be queried via any Query Engine like Athena, Trino, DuckDB etc.
- Concepts-Library - A collection of practical examples and implementations showcasing key concepts for Apache Spark and Apache Airflow.
- spark-minio-project - Builds a Spark Standalone Cluster on Docker in local with MinIO integration.
- easy-alterator - A utility for altering v1 Parquet External tables which uses AWS Glue Catalog as Hive metastore. Implemeted using AWS Boto3.
- backfeed-generator - Airflow Workflow for generating a gzipped feed csv file from an Hive/Athena table along with checksum, DDL and control file. This implementation is via Python, Apache Spark and Bash script.
- WAP-Implementation - Write Audit Publish Data Quality Pattern implementation using Apache Spark with Apache Iceberg Tables on AWS with AWS Glue Catalog for both Icerberg version < 1.2.0 and version >= 1.2.0 . Also includes Auditing data using AWS PyDeequ.
- pydeequ-on-aws - Contains code exaples for PyDeequ to test your data quality at scale. Covers all the components present in PyDeequ.
- otf-concepts-via-code - Contains code examples for understanding different concepts of Open Table Formats like Apache Iceberg with Apache Spark.
- athena-view-boto3 - Contains utility code to generate Athena view programatically. Currently, there is no direct Boto3 API to generate the Athena views programatically.