This project demonstrates a hands-on approach to customer segmentation using KMeans Clustering in Python. The goal is to provide actionable insights into customer behavior for an online retail business. By leveraging the RFM (Recency, Frequency, Monetary) model and clustering techniques, we classify customers into meaningful groups to aid business decision-making.
- Data Exploration: Handling missing data, outliers, and inspecting key features like Invoice Numbers, Stock Codes, and Customer IDs.
- Data Cleaning: Filtering out invalid or irrelevant data points to ensure accurate analysis.
- Feature Engineering: Calculating RFM features from transactional data.
- Clustering Analysis: Applying KMeans clustering and evaluating results using the Elbow Method and Silhouette Scores.
- Visualization: Insights through 3D plots, violin plots, and cluster labeling.
- Dataset: Online Retail II Dataset (2009-2011 UK-based retail transactions).
- Tools Used:
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
- OpenPyXL
-
Exploratory Data Analysis (EDA):
- Understanding data structure (columns like Invoice Number, Stock Code, Quantity, Unit Price, Customer ID, etc.).
- Dealing with missing values and inconsistencies (e.g., negative prices or quantities).
-
Data Cleaning:
- Removing null Customer IDs and irrelevant invoice types (e.g., cancellations or accounting entries).
- Addressing outliers in Quantity, Unit Price, and Stock Codes.
-
Feature Engineering:
- Creating RFM metrics:
- Recency: Days since the last purchase.
- Frequency: Number of purchases.
- Monetary: Total spending.
- Scaling data to standardize features for clustering.
- Creating RFM metrics:
-
Clustering Analysis:
- Using the Elbow Method to determine the optimal number of clusters.
- Validating cluster quality with Silhouette Scores.
- Assigning meaningful labels to clusters (e.g., Retain, Reward, Nurture, Re-engage).
-
Visualization:
- 3D scatter plots to visualize clusters.
- Violin plots for distribution analysis of RFM metrics.
- Python 3.7+
- Required Libraries:
pip install -r requirements.txt
Download the dataset from the UCI Machine Learning Repository and place it in the data/
folder.
- Clone this repository:
git clone https://github.com/yourusername/kmeans-clustering.git cd kmeans-clustering
- Install dependencies:
pip install -r requirements.txt
- Open the Jupyter notebook:
jupyter notebook online_retail_data_clustering.ipynb
- Run all cells to see the analysis.
|-- data/
| |-- OnlineRetail_2009-2010.xlsx
|-- notebooks/
| |-- online_retail_data_clustering.ipynb
|-- requirements.txt
|-- README.md
- Handling large datasets with over 500,000 records.
- Dealing with data inconsistencies, such as cancellations and invalid values.
- Optimizing cluster selection to balance granularity and interpretability.
- Extend analysis to include geolocation data (e.g., customer countries).
- Automate cluster labeling using advanced NLP techniques.
- Deploy clustering results to a dashboard for real-time business insights.
- Inspired by Trent's tutorial on clustering techniques.
- Dataset from UCI Machine Learning Repository.
This project is licensed under the MIT License.
For any questions or suggestions, feel free to contact me via GitHub or open an issue!