NYC Taxi Trip Duration Prediction

Project Overview

This project aims to predict the duration of taxi trips in New York City using a variety of regression techniques, including Polynomial Linear Regression, Ridge Regression, and Lasso Regression for feature extraction. The dataset used is the New York City Taxi Trip Duration Dataset, which contains detailed records of taxi trips including pickup and dropoff locations, times, and other related features.

Project Structure

model_pipeline.py: Main script to run the feature engineering, data preprocessing, model training, evaluation, and prediction.
test.py: to make prediction and create Sample submission file.
README.md: Project documentation.
grid_search.pkl: Saved GridSearchCV object for the best model.
model.pkl: Trained model saved using joblib.
submission.csv: Sample submission file.

Dataset

The dataset includes the following key columns:

id: Unique identifier for each trip
vendor_id: ID of the taxi vendor
pickup_datetime: Date and time when the trip started
dropoff_datetime: Date and time when the trip ended
passenger_count: Number of passengers
pickup_longitude: Longitude where the trip started
pickup_latitude: Latitude where the trip started
dropoff_longitude: Longitude where the trip ended
dropoff_latitude: Latitude where the trip ended
store_and_fwd_flag: This flag indicates whether the trip record was sent to the vendor or held in vehicle memory before sending
trip_duration: Duration of the trip in seconds (target variable)

Feature Engineering

The following feature engineering techniques were applied to enrich the dataset:

Geographical Boundaries Filtering: Trips were filtered to stay within the specified geographical boundaries to remove erroneous data.
Clustering: KMeans clustering was applied to the pickup and dropoff coordinates to create cluster labels.
Trip Distance Calculation: The Haversine formula was used to calculate the great circle distance between pickup and dropoff points.
Bearing Calculation: The bearing between pickup and dropoff locations was calculated.
Datetime Features Extraction: Features like month, day of month, weekday, hour of day, and whether the trip occurred on a weekend or during rush hour were extracted.
Average Hourly Speed: An average hourly speed was mapped to the trips based on the hour of the day.
Distance to Points of Interest: Distances to the city center, JFK Airport, and LaGuardia Airport were calculated.
Manhattan Distance: The Manhattan distance between pickup and dropoff locations was calculated.
Interaction Features: Interaction features such as the product of trip distance and average hourly speed were created.
Time Features: Features like minute of the day and whether the trip happened during rush hour were added.

Model Training

A Ridge Regression model with polynomial features was used for the final prediction. GridSearchCV was employed to tune the hyperparameters of the model.

Training

The model was trained using the following metrics:

R² Score on Training Data: 0.6929

Validation

The model was evaluated on the validation set using the following metrics:

Validation RMSE: 268.05
Validation MAE: 203.69
Validation R² Score: 0.6623

Code Structure

The main functions and their purposes are outlined below:

Data Loading and Preprocessing
- load_data(file_path): Loads data from a CSV file.
- check_missing_data(train, validation): Checks for missing values in the train and validation data.
- preprocess_data(df, xlim, ylim, hour_to_speed, isTest=False): Integrates all preprocessing steps including feature engineering.
Feature Engineering
- add_average_hourly_speed(df, hour_to_speed): Adds average hourly speed to the dataframe.
- filter_geographical_boundaries(df, xlim, ylim): Filters data within specified geographical boundaries.
- apply_clustering(df, n_clusters=6): Applies KMeans clustering to pickup and dropoff coordinates.
- calculate_trip_distance(df): Adds trip distance and bearing to the dataframe.
- extract_datetime_features(df): Extracts features from datetime columns.
- calculate_distance_to_center(df, center_coordinates): Calculates the distance to city center.
- calculate_distance_to_airport(df, airport_coordinates, column_name): Calculates the distance to an airport.
- remove_outliers(df, column): Removes outliers from a specified column using the IQR method.
- add_time_features(df): Adds granular time features.
- add_manhattan_distance(df): Calculates the Manhattan distance between pickup and dropoff coordinates.
- add_interaction_features(df): Adds interaction features.
Modeling and Evaluation
- get_important_features(df, target_column, alpha=0.1): Extracts important features using Lasso regression.
- train_model(X_train, y_train): Trains the Ridge regression model with GridSearchCV.
- evaluate_model(model, X, y): Evaluates the model and prints cross-validated RMSE, MAE, and R² score.
- save_model(model, filename): Saves the trained model to a file.
- load_model(filename): Loads a model from a file.
- predict_and_save_submission(model, test_features, test_ids, filename): Predicts test data and saves the submission file.

Instructions to Run the Code

Install Dependencies Make sure you have the required libraries installed. You can install them using:
```
pip install pandas numpy scikit-learn joblib
```
Load Data Use the load_data function to load your dataset.
Preprocess Data Apply the preprocess_data function to your dataset.
Train the Model Use the train_model function to train the Ridge regression model.
Evaluate the Model Evaluate the trained model using the evaluate_model function.
Save the Model Save the trained model using the save_model function.
Load the Model Load the saved model using the load_model function.
Predict and Save Submission Use the predict_and_save_submission function to generate predictions on the test set and save the results.

Results

The final model achieved the following performance on the validation set:

Validation RMSE: 268.05
Validation MAE: 203.69
Validation R² Score: 0.6423

The project demonstrates a comprehensive approach to feature engineering and model training for predicting taxi trip durations in New York City.

References

NYC Taxi Trip Duration Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
__pycache__		__pycache__
.gitignore		.gitignore
NYC_Taxi_Trip_Duration_test.py		NYC_Taxi_Trip_Duration_test.py
NYC_Taxi_Trip_Duration_train.py		NYC_Taxi_Trip_Duration_train.py
NYC_Taxi_Trip_Duration_utils_data.py		NYC_Taxi_Trip_Duration_utils_data.py
NYC_Taxi_Trip_Duration_utils_eval.py		NYC_Taxi_Trip_Duration_utils_eval.py
New_York_Taxi_Trip_Duration.ipynb		New_York_Taxi_Trip_Duration.ipynb
README.md		README.md
grid_search.pkl		grid_search.pkl
model.pkl		model.pkl
model_pipeline.py		model_pipeline.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Trip Duration Prediction

Project Overview

Project Structure

Dataset

Feature Engineering

Model Training

Training

Validation

Code Structure

Instructions to Run the Code

Results

References

About

Releases

Packages

Languages

MahmoudNamNam/NYC_Taxi_Trip_Duration_Predection

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Trip Duration Prediction

Project Overview

Project Structure

Dataset

Feature Engineering

Model Training

Training

Validation

Code Structure

Instructions to Run the Code

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages