Skip to content

Latest commit

 

History

History
208 lines (128 loc) · 7.12 KB

README.md

File metadata and controls

208 lines (128 loc) · 7.12 KB

Bumblebee

A platform agnostic, containerized ETL pipeline which supports data transformation interchangeably between CSV, JSON, AVRO and PARQUET formats.

bumblebee.mp4

Key Topics

Architecture

Bumblebee uses a client-server architecture. The client sends a request to the server which performs data transformations and generate output files in the desired formats. Communication takes place using a REST API. Picture 1

Server and Endpoints

Provides endpoints and listens to incoming API requests.

Endpoints:

  • logging
    • POST: Initiates conversion
    • GET : Realtime status of the conversion process
  • signup
    • POST: Generate JWT authentication token

Payload/Input

Bumblebee accepts a JSON payload sent along with the POST request that has information about the desired conversion types, the source and destination, and the desired transformations to be carried out.

JSON Payload:

{
    // Required 
    // Specify the Source of the files
    // Specify the output destination of the files

    "DEFAULT": {
      "source": "<source bucket/URL>",
      "upload": "<destination bucket>"
    },

    // Required
    // Add the required conversion formats 
    // Multiple conversion formats can be specified from the below list

    "CONVERT": {
      "formats": [
        "avro_to_parquet",
        "avro_to_csv",
        "avro_to_json",
        "parquet_to_csv",
        "parquet_to_avro",
        "parquet_to_json",
        "json_to_csv",
        "json_to_parquet"
      ]
    },

    // Required
    // Add the list of the files to be processed
    //These can be in .csv, .parquet, .json or .avro format

    "FILES": {
      "files": [
        "<file1>",
        "<file2>"
      ]
    },

    // Optional
    // Add the below column section in the JSON if filtering out columns is necessary
    // Add the required column names in the list

    "COLUMNS": {
      "columns":  [
      "<column1>",
      "<column2>"
        ]
    },

    // Optional
    // Add the below section in the JSON if a conversion from CSV to Avro is required
    // This section holds the schema for the Avro file

    "SCHEMA": {
    
  }
}

Parser

Transforms the JSON payload into a configuration file, which is further used to run the transformation logic.

Converter(Main function)

Parses the configuration file and calls resepctive functions to carry out the transformations.

Tools and Technologies

Setup

Source Bucket/Github Repo and Output Bucket

  • The users need to input a source bucket which contains the files they want to be converted, or the URL with the files
  • An output destination is required

Pasted Graphic 5

Bumblebee Server

Pull the Bumblebee image from DockerHub:

Note: Ensure docker login is successful

Image Name: kopalc/bumblebee Image tag: 3.0

docker pull kopalc/bumblebee:3.0

Start the container:

docker run -d –privileged -p<port>:8080 kopalc/bumblebee:3.0

This would start the bumblebee server

JWT Token generation

  • Token based authentication has been implemented as a security measure. The users need to generate a JWT token to establish identity.
  • Hit a POST request to the signup endpoint. The token would be received as output. Use name and email as keys while generating the token.
http://<IP Address>:<Port>/signup

Pasted Graphic 4

Initiate Conversion

  • Send a POST request to the logging endpoint with the JSON payload which contains various desired attributes to the URL.
  • Copy the token generated from the above signup request and pass it as header along with the POST request.

Pasted Graphic 6

Pasted Graphic 7

The files converted successfully are uploaded to the output bucket. You should see a Success message upon process completion.

Pasted Graphic 9

Pasted Graphic 10

Pasted Graphic 11

Current Scope

  • The present scope of this project supports transformation between CSV, Avro, Parquet and JSON formats.

  • It offers customization by enabling users to input a list of the columns desired in the output and excludes the rest.

  • Supported data sources: GCS bucket and GitHub Repo

  • Supported destination sources: GCS bucket

Future Scope

  • Build support for other clouds and local storage: Bumblebee is supported on the Google Cloud Platform as of now. The plan is to extend the support to other cloud platforms, minio and local storage as well.

  • Enhanced/Increased number of user customizations: The user would be able to specify features and the actions to be performed on the columns to get the desired data. For instance, replacing all the NULL values in a particular column with the required value.

  • Build support for more file formats: Bumblebee currently supports interconversion between CSV, Avro, Parquet and JSON. The plan is to include other formats as well over time.

  • Redis Queue for continuous processing of large datasets: Including Redis in the architecture would enable handling multiple requests simultaneously and processing large files, since multiple users may be using the tool at once.