A platform agnostic, containerized ETL pipeline which supports data transformation interchangeably between CSV, JSON, AVRO and PARQUET formats.
bumblebee.mp4
Bumblebee uses a client-server architecture. The client sends a request to the server which performs data transformations and generate output files in the desired formats.
Communication takes place using a REST API.
Provides endpoints and listens to incoming API requests.
- logging
- POST: Initiates conversion
- GET : Realtime status of the conversion process
- signup
- POST: Generate JWT authentication token
Bumblebee accepts a JSON payload sent along with the POST request that has information about the desired conversion types, the source and destination, and the desired transformations to be carried out.
{
// Required
// Specify the Source of the files
// Specify the output destination of the files
"DEFAULT": {
"source": "<source bucket/URL>",
"upload": "<destination bucket>"
},
// Required
// Add the required conversion formats
// Multiple conversion formats can be specified from the below list
"CONVERT": {
"formats": [
"avro_to_parquet",
"avro_to_csv",
"avro_to_json",
"parquet_to_csv",
"parquet_to_avro",
"parquet_to_json",
"json_to_csv",
"json_to_parquet"
]
},
// Required
// Add the list of the files to be processed
//These can be in .csv, .parquet, .json or .avro format
"FILES": {
"files": [
"<file1>",
"<file2>"
]
},
// Optional
// Add the below column section in the JSON if filtering out columns is necessary
// Add the required column names in the list
"COLUMNS": {
"columns": [
"<column1>",
"<column2>"
]
},
// Optional
// Add the below section in the JSON if a conversion from CSV to Avro is required
// This section holds the schema for the Avro file
"SCHEMA": {
}
}
Transforms the JSON payload into a configuration file, which is further used to run the transformation logic.
Parses the configuration file and calls resepctive functions to carry out the transformations.
- The users need to input a source bucket which contains the files they want to be converted, or the URL with the files
- An output destination is required
Pull the Bumblebee image from DockerHub:
Image Name: kopalc/bumblebee Image tag: 3.0
docker pull kopalc/bumblebee:3.0
Start the container:
docker run -d –privileged -p<port>:8080 kopalc/bumblebee:3.0
This would start the bumblebee server
- Token based authentication has been implemented as a security measure. The users need to generate a JWT token to establish identity.
- Hit a POST request to the signup endpoint. The token would be received as output. Use name and email as keys while generating the token.
http://<IP Address>:<Port>/signup
- Send a POST request to the logging endpoint with the JSON payload which contains various desired attributes to the URL.
- Copy the token generated from the above signup request and pass it as header along with the POST request.
The files converted successfully are uploaded to the output bucket. You should see a Success message upon process completion.
-
The present scope of this project supports transformation between CSV, Avro, Parquet and JSON formats.
-
It offers customization by enabling users to input a list of the columns desired in the output and excludes the rest.
-
Supported data sources: GCS bucket and GitHub Repo
-
Supported destination sources: GCS bucket
-
Build support for other clouds and local storage: Bumblebee is supported on the Google Cloud Platform as of now. The plan is to extend the support to other cloud platforms, minio and local storage as well.
-
Enhanced/Increased number of user customizations: The user would be able to specify features and the actions to be performed on the columns to get the desired data. For instance, replacing all the NULL values in a particular column with the required value.
-
Build support for more file formats: Bumblebee currently supports interconversion between CSV, Avro, Parquet and JSON. The plan is to include other formats as well over time.
-
Redis Queue for continuous processing of large datasets: Including Redis in the architecture would enable handling multiple requests simultaneously and processing large files, since multiple users may be using the tool at once.