Crazy Taxi company has collected historical data on taxi orders at airports. The goal of this project is to predict the number of taxi orders for the upcoming hour to attract more drivers during peak hours. The model's performance will be measured by the RMSE (Root Mean Squared Error) metric on the test set, which should not exceed 48.
-
Data Loading and Resampling
- Download the dataset containing hourly taxi order information.
- Resample the original data so that each data point is represented as an interval of one hour.
-
Exploratory Data Analysis (EDA)
- Analyze the data to understand patterns, trends, and any necessary preprocessing.
- Check for any missing values, outliers, and the overall structure of the dataset.
-
Model Training and Validation
- Train different models with varying hyperparameters to predict the number of taxi orders.
- Use a test set comprising 10% of the initial dataset for evaluation.
- Evaluate model performance based on the RMSE metric.
-
Model Testing and Conclusion
- Test the trained models using the test set.
- Compare the models' RMSE values to identify the best-performing model.
- Provide insights and conclusions based on the model's performance on the test data.
- Python Libraries:
pandas
,scikit-learn
- Time Series Resampling: Resampling the data to hourly intervals for time-based analysis.
- Machine Learning Models: Multiple models were explored and tested to find the most accurate predictions.
- Model Evaluation Metrics: RMSE was the primary metric used to evaluate model performance.
1. Data Exploration and Resampling
- Original Data: The dataset contained hourly records of taxi orders, with
num_orders
as the target variable representing the number of taxi orders in a given hour. - Resampling: The data was resampled to hourly intervals, which allowed for more structured time series analysis and better model training.
- Trends Identified: Clear patterns and seasonality were observed in the data, such as increased taxi demand during certain times of day and days of the week.
2. Model Training and Validation
- Multiple models were trained to predict the number of taxi orders for the next hour, with hyperparameters tuned for each model:
- Linear Regression: Used as a baseline model for comparison. While it captured some general trends, the RMSE was higher than other models, indicating room for improvement.
- Decision Tree Regressor: Improved over linear regression by capturing more complex relationships in the data. However, it was prone to overfitting.
- Random Forest Regressor: Provided better performance than the decision tree, reducing the RMSE and improving generalization across different data points.
- Gradient Boosting Models: Gradient boosting models, such as
XGBoost
andLightGBM
, showed the best performance by effectively capturing complex patterns in the data and reducing prediction errors.
3. Model Performance
- The models were evaluated using RMSE on both the training and test sets. The goal was to keep the test RMSE below 48.
- Best Performing Model: The gradient boosting model (e.g.,
XGBoost
) achieved an RMSE of around 45, making it the best choice for predicting taxi demand. - Speed and Efficiency: The final model was able to make predictions quickly, meeting the requirements for real-time applications.
4. Hyperparameter Tuning
- Several hyperparameters were adjusted to improve model performance, including
max_depth
,learning_rate
,n_estimators
, andmin_samples_split
. - Validation Process: The models were validated using cross-validation techniques to ensure robustness and avoid overfitting.
5. Final Model Selection
- The XGBoost model was chosen as the final model due to its superior RMSE of around 45 on the test set, effectively meeting the target of being under 48.
- The model demonstrated both accuracy and efficiency in predicting the number of taxi orders for the upcoming hour.
Key Insights and Learnings
- Importance of Time Series Resampling: Resampling the data into hourly intervals was crucial for capturing trends and seasonality effectively.
- Feature Engineering: Incorporating time-based features such as day of the week, hour of the day, and holiday indicators helped improve model performance.
- Boosting Models Superiority: Gradient boosting algorithms like
XGBoost
significantly outperformed simpler models like linear regression, making them a suitable choice for time series prediction with complex patterns.
- Time Series Data Processing: Gained experience in resampling time series data to create appropriate intervals for analysis.
- Model Training and Evaluation: Developed a deeper understanding of how to train, tune, and evaluate machine learning models based on RMSE.
- Feature Engineering and Data Preparation: Improved skills in handling and preprocessing data to optimize model performance.
-
Clone the repository
git clone https://github.com/realdanizilla/Crazy-Taxi.git -
Install Dependencies:
Navigate to the repository directory and install required Python packages.
(cd Crazy-Taxi pip install -r requirements.txt) -
Run the Jupyter Notebook
Execute the.ipynb
notebook to see data preprocessing, model training, and demand forecasting steps. -
Experiment and Analyze
Adjust models, parameters, and data in the notebook to improve taxi demand predictions.