Basic Pipeline for Exploratory Data Analysis

Business Problem

In this section, we will create a basic pipeline for exploratory data analysis(EDA) which can be applied to various datasets with one main function.

Kaggle Link of the Notebook

Data sources are added under the Examples on Datasets title

Current Features

Importing dataset and creating a DataFrame instance.
Presenting general information about the dataset (e.g. head, shape, description).
Separating the categorical and the numerical features of the dataset.
Analyzing the categorical and the numerical features of the dataset.
Visualizing the categorical and the numerical features of the dataset.
Analyzing the target for the each feature.
Analyzing the highly correlated features to improve the DataFrame.

Possible Future Improvements

Printing better reports to improve legibility.
More accessibility for the user.
~~Bypassing irrelevant features from the analysis (e.g. id, user_id).~~ COMPLETED
~~Categorizing numerical targets for better target analysis.~~ COMPLETED
Encoding for the incompatible data types.
Handling missing values in the dataset.

Examples on Datasets

Current examples of the pipeline for the following datasets respectively:

Breast Cancer Dataset

Description:
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.
Diabetes Dataset

Description:
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Hitters Dataset

Description:
This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
basic-pipeline-for-exploratory-data-analysis.ipynb		basic-pipeline-for-exploratory-data-analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Basic Pipeline for Exploratory Data Analysis

Business Problem

Current Features

Possible Future Improvements

Examples on Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Trigenaris/Basic_Pipeline_for_Exploratory_Data_Analysis

Folders and files

Latest commit

History

Repository files navigation

Basic Pipeline for Exploratory Data Analysis

Business Problem

Current Features

Possible Future Improvements

Examples on Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages