This project was developed as a part of a test case created by one of German consultancies.
The objective was to implement an API that would handle POST requests with json-data attached to them. The API is supposed to store the raw data using an appropriate storing solution, apply data transformations, calculate some statistics and store the results in a database.
The API was implemented in Python using Flask.
It has one endpoint called /upload, which accepts JSON-files and uploads them to a data lake and a data warehouse.
Google Cloud Storage is used as a Data Lake to store raw data, and BigQuery is used as a Data Warehouse.
The API is containerized using Docker.
The docker image is registered in Google Artifact Registry, and is then deployed via Google Cloud Run.
I used DataForm for data transformations.
It’s a fully-managed scalable data transformations solution within BigQuery.
In order to schedule the data transformations, either Workflows with Cloud Scheduler, or Cloud Composer (Airflow) could be used
- app.py - Flask application with the API.
- Dockerfile - Docker container with the Flask API.
- API Documentation (Swagger).html - html-file with API documentation.
- requirements.txt - requirements file
- utils/ - modules used in the API
- definitions/ - data transformations implemented in DataForm.
The API documentation can be found in API Documentation (Swagger).html, or under this link: API Documentation