Skip to content

NgQuangHuyit/RetailChainDatawarehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Retail chain data pipeline for Analytics and Reporting

Project Overview

Prerequisites

  • Docker
  • Python 3.9 or later

Technologies Used

  • Python
  • Airflow
  • HDFS
  • Spark
  • Hive
  • Metabase
  • MySQL, Postgres

Data Modeling

Source database schema:

Star Schema:

Getting Started

Infrastructure setup:

  1. Clone project repository
git clone <link.com>
  1. Navigate to project directory
cd RetailChainDatawarehouse
  1. Build hadoopbase docker image
make build-hadoopbase
  1. Start up infrastructure
make up && make setup

Airflow Setup

Come to http://localhost:8081 to access Airflow web UI and login with:

  • username: airflow
  • password: airflow

Go to Admin -> Connections and create a new spark connection with the following values:

Come to Dags tab and click on the trigger button on daily_pipeline dag to run the pipeline.

Metabase Dashboard

Start Spark Thrift Server:

make start-thift

Come to http://localhost:4000 to access Metabase web UI and register new account.

Setup Spark Thrift Server connection to Metabase:

Access dashboad:

Others Service

About

Building data warehouse with hadoop - spark - hive

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published