This project aims to predict the duration of taxi trips in New York City using a variety of regression techniques, including Polynomial Linear Regression, Ridge Regression, and Lasso Regression for feature extraction. The dataset used is the New York City Taxi Trip Duration Dataset, which contains detailed records of taxi trips including pickup and dropoff locations, times, and other related features.
model_pipeline.py
: Main script to run the feature engineering, data preprocessing, model training, evaluation, and prediction.test.py
: to make prediction and create Sample submission file.README.md
: Project documentation.grid_search.pkl
: Saved GridSearchCV object for the best model.model.pkl
: Trained model saved using joblib.submission.csv
: Sample submission file.
The dataset includes the following key columns:
id
: Unique identifier for each tripvendor_id
: ID of the taxi vendorpickup_datetime
: Date and time when the trip starteddropoff_datetime
: Date and time when the trip endedpassenger_count
: Number of passengerspickup_longitude
: Longitude where the trip startedpickup_latitude
: Latitude where the trip starteddropoff_longitude
: Longitude where the trip endeddropoff_latitude
: Latitude where the trip endedstore_and_fwd_flag
: This flag indicates whether the trip record was sent to the vendor or held in vehicle memory before sendingtrip_duration
: Duration of the trip in seconds (target variable)
The following feature engineering techniques were applied to enrich the dataset:
- Geographical Boundaries Filtering: Trips were filtered to stay within the specified geographical boundaries to remove erroneous data.
- Clustering: KMeans clustering was applied to the pickup and dropoff coordinates to create cluster labels.
- Trip Distance Calculation: The Haversine formula was used to calculate the great circle distance between pickup and dropoff points.
- Bearing Calculation: The bearing between pickup and dropoff locations was calculated.
- Datetime Features Extraction: Features like month, day of month, weekday, hour of day, and whether the trip occurred on a weekend or during rush hour were extracted.
- Average Hourly Speed: An average hourly speed was mapped to the trips based on the hour of the day.
- Distance to Points of Interest: Distances to the city center, JFK Airport, and LaGuardia Airport were calculated.
- Manhattan Distance: The Manhattan distance between pickup and dropoff locations was calculated.
- Interaction Features: Interaction features such as the product of trip distance and average hourly speed were created.
- Time Features: Features like minute of the day and whether the trip happened during rush hour were added.
A Ridge Regression model with polynomial features was used for the final prediction. GridSearchCV was employed to tune the hyperparameters of the model.
The model was trained using the following metrics:
- R² Score on Training Data: 0.6929
The model was evaluated on the validation set using the following metrics:
- Validation RMSE: 268.05
- Validation MAE: 203.69
- Validation R² Score: 0.6623
The main functions and their purposes are outlined below:
-
Data Loading and Preprocessing
load_data(file_path)
: Loads data from a CSV file.check_missing_data(train, validation)
: Checks for missing values in the train and validation data.preprocess_data(df, xlim, ylim, hour_to_speed, isTest=False)
: Integrates all preprocessing steps including feature engineering.
-
Feature Engineering
add_average_hourly_speed(df, hour_to_speed)
: Adds average hourly speed to the dataframe.filter_geographical_boundaries(df, xlim, ylim)
: Filters data within specified geographical boundaries.apply_clustering(df, n_clusters=6)
: Applies KMeans clustering to pickup and dropoff coordinates.calculate_trip_distance(df)
: Adds trip distance and bearing to the dataframe.extract_datetime_features(df)
: Extracts features from datetime columns.calculate_distance_to_center(df, center_coordinates)
: Calculates the distance to city center.calculate_distance_to_airport(df, airport_coordinates, column_name)
: Calculates the distance to an airport.remove_outliers(df, column)
: Removes outliers from a specified column using the IQR method.add_time_features(df)
: Adds granular time features.add_manhattan_distance(df)
: Calculates the Manhattan distance between pickup and dropoff coordinates.add_interaction_features(df)
: Adds interaction features.
-
Modeling and Evaluation
get_important_features(df, target_column, alpha=0.1)
: Extracts important features using Lasso regression.train_model(X_train, y_train)
: Trains the Ridge regression model with GridSearchCV.evaluate_model(model, X, y)
: Evaluates the model and prints cross-validated RMSE, MAE, and R² score.save_model(model, filename)
: Saves the trained model to a file.load_model(filename)
: Loads a model from a file.predict_and_save_submission(model, test_features, test_ids, filename)
: Predicts test data and saves the submission file.
-
Install Dependencies Make sure you have the required libraries installed. You can install them using:
pip install pandas numpy scikit-learn joblib
-
Load Data Use the
load_data
function to load your dataset. -
Preprocess Data Apply the
preprocess_data
function to your dataset. -
Train the Model Use the
train_model
function to train the Ridge regression model. -
Evaluate the Model Evaluate the trained model using the
evaluate_model
function. -
Save the Model Save the trained model using the
save_model
function. -
Load the Model Load the saved model using the
load_model
function. -
Predict and Save Submission Use the
predict_and_save_submission
function to generate predictions on the test set and save the results.
The final model achieved the following performance on the validation set:
- Validation RMSE: 268.05
- Validation MAE: 203.69
- Validation R² Score: 0.6423
The project demonstrates a comprehensive approach to feature engineering and model training for predicting taxi trip durations in New York City.