This project builds a Machine Learning pipeline to predict used car prices based on vehicle features. Using models like XGBoost, Random Forest, Ridge, and Lasso regression, the pipeline optimizes predictive performance across MAE, RMSE, and R² metrics. The dataset was sourced from Kaggle and preprocessed with robust techniques, including multivariate imputation and scaling.
- Objective: Predict used car prices using machine learning models.
- Dataset: Sourced from Kaggle, containing 4009 rows and 10 features.
- Methods:
- Preprocessing: Categorical encoding (OneHotEncoder), XGB-based imputation, StandardScaler.
- Model Evaluation: Looping over 5 random states, cross-validation, and GridSearch for hyperparameter tuning.
- Feature Importance: Global (permutation, gain, weight) and local (SHAP force plots) interpretability.
- Top Model: XGBoost Regressor
- Best Metrics:
- MAE: 0.0854
- RMSE: 0.1107
- R²: 0.9041
The project was developed using the following tools:
- Python: 3.12.5
- Key Packages:
pandas==2.2.2
numpy==1.26.4
scikit-learn==1.5.1
py-xgboost==2.1.1
shap==0.45.1
matplotlib==3.9.2
plotly==5.23.0
For easy setup, install the dependencies via the provided environment.yaml file.
-
Clone this repository:
git clone git@github.com:lshiyu4210/used_car_price_prediction.git cd used_car_price_prediction
-
Create a virtual environment:
conda env create -f environment_updated.yaml conda activate used_car
The updated environment file is provided as environment.yaml.
- Dataset: Used Car Price Prediction Dataset
- References:
- Valarmathi et al., 2023
- Yılmaz & Selvi, 2023