Skip to content

Latest commit

 

History

History
104 lines (75 loc) · 6.44 KB

README.MD

File metadata and controls

104 lines (75 loc) · 6.44 KB

Predicting Taxi Demand for Rush Hours

Project Objective

Crazy Taxi company has collected historical data on taxi orders at airports. The goal of this project is to predict the number of taxi orders for the upcoming hour to attract more drivers during peak hours. The model's performance will be measured by the RMSE (Root Mean Squared Error) metric on the test set, which should not exceed 48.

back to top

Project Structure and Steps

  1. Data Loading and Resampling

    • Download the dataset containing hourly taxi order information.
    • Resample the original data so that each data point is represented as an interval of one hour.
  2. Exploratory Data Analysis (EDA)

    • Analyze the data to understand patterns, trends, and any necessary preprocessing.
    • Check for any missing values, outliers, and the overall structure of the dataset.
  3. Model Training and Validation

    • Train different models with varying hyperparameters to predict the number of taxi orders.
    • Use a test set comprising 10% of the initial dataset for evaluation.
    • Evaluate model performance based on the RMSE metric.
  4. Model Testing and Conclusion

    • Test the trained models using the test set.
    • Compare the models' RMSE values to identify the best-performing model.
    • Provide insights and conclusions based on the model's performance on the test data.

back to top

Tools and Techniques Utilized

  • Python Libraries: pandas, scikit-learn
  • Time Series Resampling: Resampling the data to hourly intervals for time-based analysis.
  • Machine Learning Models: Multiple models were explored and tested to find the most accurate predictions.
  • Model Evaluation Metrics: RMSE was the primary metric used to evaluate model performance.

back to top

Specific Results and Outcomes

1. Data Exploration and Resampling

  • Original Data: The dataset contained hourly records of taxi orders, with num_orders as the target variable representing the number of taxi orders in a given hour.
  • Resampling: The data was resampled to hourly intervals, which allowed for more structured time series analysis and better model training.
  • Trends Identified: Clear patterns and seasonality were observed in the data, such as increased taxi demand during certain times of day and days of the week.

2. Model Training and Validation

  • Multiple models were trained to predict the number of taxi orders for the next hour, with hyperparameters tuned for each model:
    • Linear Regression: Used as a baseline model for comparison. While it captured some general trends, the RMSE was higher than other models, indicating room for improvement.
    • Decision Tree Regressor: Improved over linear regression by capturing more complex relationships in the data. However, it was prone to overfitting.
    • Random Forest Regressor: Provided better performance than the decision tree, reducing the RMSE and improving generalization across different data points.
    • Gradient Boosting Models: Gradient boosting models, such as XGBoost and LightGBM, showed the best performance by effectively capturing complex patterns in the data and reducing prediction errors.

3. Model Performance

  • The models were evaluated using RMSE on both the training and test sets. The goal was to keep the test RMSE below 48.
  • Best Performing Model: The gradient boosting model (e.g., XGBoost) achieved an RMSE of around 45, making it the best choice for predicting taxi demand.
  • Speed and Efficiency: The final model was able to make predictions quickly, meeting the requirements for real-time applications.

4. Hyperparameter Tuning

  • Several hyperparameters were adjusted to improve model performance, including max_depth, learning_rate, n_estimators, and min_samples_split.
  • Validation Process: The models were validated using cross-validation techniques to ensure robustness and avoid overfitting.

5. Final Model Selection

  • The XGBoost model was chosen as the final model due to its superior RMSE of around 45 on the test set, effectively meeting the target of being under 48.
  • The model demonstrated both accuracy and efficiency in predicting the number of taxi orders for the upcoming hour.

Key Insights and Learnings

  • Importance of Time Series Resampling: Resampling the data into hourly intervals was crucial for capturing trends and seasonality effectively.
  • Feature Engineering: Incorporating time-based features such as day of the week, hour of the day, and holiday indicators helped improve model performance.
  • Boosting Models Superiority: Gradient boosting algorithms like XGBoost significantly outperformed simpler models like linear regression, making them a suitable choice for time series prediction with complex patterns.

back to top

What I Have Learned From This Project

  • Time Series Data Processing: Gained experience in resampling time series data to create appropriate intervals for analysis.
  • Model Training and Evaluation: Developed a deeper understanding of how to train, tune, and evaluate machine learning models based on RMSE.
  • Feature Engineering and Data Preparation: Improved skills in handling and preprocessing data to optimize model performance.

back to top

How to use this repository

  1. Clone the repository
    git clone https://github.com/realdanizilla/Crazy-Taxi.git

  2. Install Dependencies:
    Navigate to the repository directory and install required Python packages.
    (cd Crazy-Taxi pip install -r requirements.txt)

  3. Run the Jupyter Notebook
    Execute the .ipynb notebook to see data preprocessing, model training, and demand forecasting steps.

  4. Experiment and Analyze
    Adjust models, parameters, and data in the notebook to improve taxi demand predictions.

back to top