The goal of this project is to develop a machine learning model that can accurately predict the price of a vehicle based on various features such as make, model, year, fuel type, transmission, mileage, and more. The dataset contains numerical and categorical features, which require preprocessing before training the model.
The dataset used for vehicle price prediction consists of 17 columns with 1002 entries. Below is the detailed description of each column:
# | Column Name | Non-Null Count | Data Type | Description |
---|---|---|---|---|
0 | name | 1002 non-null | object | Name of the vehicle listing |
1 | description | 946 non-null | object | Description of the vehicle |
2 | make | 1002 non-null | object | Manufacturer of the vehicle |
3 | model | 1002 non-null | object | Model name of the vehicle |
4 | year | 1002 non-null | int64 | Manufacturing year of the vehicle |
5 | price | 979 non-null | float64 | Price of the vehicle (Target variable) |
6 | engine | 1000 non-null | object | Engine type of the vehicle |
7 | cylinders | 897 non-null | float64 | Number of cylinders in the engine |
8 | fuel | 995 non-null | object | Type of fuel used |
9 | mileage | 968 non-null | float64 | Mileage of the vehicle (in miles per gallon) |
10 | transmission | 1000 non-null | object | Type of transmission (Automatic/Manual) |
11 | trim | 1001 non-null | object | Specific trim/version of the vehicle model |
12 | body | 999 non-null | object | Body type of the vehicle (SUV, Sedan, etc.) |
13 | doors | 995 non-null | float64 | Number of doors in the vehicle |
14 | exterior_color | 997 non-null | object | Color of the vehicle's exterior |
15 | interior_color | 964 non-null | object | Color of the vehicle's interior |
16 | drivetrain | 1002 non-null | object | Type of drivetrain (FWD, AWD, etc.) |
The target variable for our prediction task is price
, which we aim to predict based on the other vehicle attributes.
This project follows a structured approach to ensure efficient data handling and model training. Below is an overview of the key steps:
- Project Setup & Installation
- Exploratory Data Analysis (EDA)
- Feature Engineering & Encoding
- Scaling Numerical Features
- Splitting Data into Train & Test Sets
- Training Various Machine Learning Models
- Hyperparameter Tuning using GridSearchCV
- Evaluating & Comparing Model Performance
The following is the directory structure of the Vehicle Price Prediction project:
Vehicle Price Prediction/
|-- .idea/ # IDE Configuration Files (Optional)
|-- catboost_info/ # CatBoost Model Training Logs
| |-- learn/ # Learning Data
| |-- tmp/ # Temporary Files
| |-- catboost_training.json # CatBoost Training Metadata
| |-- learn_error.tsv # CatBoost Learning Error Log
| |-- time_left.tsv # Remaining Training Time
|-- dataset.csv # Vehicle Dataset
|-- Predict Vehicle Prices.pdf # Project PDF
|-- price_prediction.ipynb # Jupyter Notebook with Code & Analysis
|-- README.md # Project README File
|-- requirements.txt # Dependencies Required for the Project
|-- LICENSE # License File
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/your-repo-name.git
-
Navigate to the project directory:
cd vehicle-price-prediction
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On MacOS/Linux venv\Scripts\activate # On Windows
-
Install dependencies:
pip install -r requirements.txt
-
Run the Jupyter Notebook:
jupyter price_prediction
EDA helps in understanding the dataset by analyzing distributions, relationships between features, and identifying potential data quality issues.
missing_values = vehicle_data.isnull().sum()
print(missing_values[missing_values > 0])
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(vehicle_data['mileage'], kde=True)
plt.title('Mileage Distribution')
plt.show()
sns.countplot(x=vehicle_data['fuel'])
plt.title('Fuel Type Distribution')
plt.show()
import seaborn as sns
plt.figure(figsize=(10,6))
sns.heatmap(vehicle_data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()
Data preprocessing is a crucial step where we prepare the data for model training by handling missing values, removing outliers, encoding categorical variables, and scaling numerical features.
- Handling Missing Values: Dropped missing values to avoid inconsistencies.
- Outlier Detection & Removal: Used the IQR method to identify and remove extreme values.
- Encoding Categorical Features: Applied Label Encoding for high-cardinality categorical features and One-Hot Encoding for others.
- Scaling Numerical Features: Used StandardScaler to normalize numerical variables.
def detect_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] < lower_bound) | (df[column] > upper_bound)]
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
refined_data['make'] = le.fit_transform(refined_data['make'])
refined_data['model'] = le.fit_transform(refined_data['model'])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = ['year', 'cylinders', 'mileage', 'doors']
refined_data[scaled_features] = scaler.fit_transform(refined_data[scaled_features])
Model training involves splitting the dataset into training and testing sets, training various machine learning models, and evaluating their performance.
from sklearn.model_selection import train_test_split
X = refined_data.drop('price', axis=1)
y = refined_data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
from sklearn.ensemble import GradientBoostingRegressor
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
from catboost import CatBoostRegressor
cat_model = CatBoostRegressor(iterations=500, learning_rate=0.1, depth=6, verbose=False, random_state=42)
cat_model.fit(X_train, y_train)
cat_pred = cat_model.predict(X_test)
from sklearn.ensemble import StackingRegressor
stack_model = StackingRegressor(
estimators=[
('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
('gb', GradientBoostingRegressor(n_estimators=100, random_state=42))
],
final_estimator=LinearRegression()
)
stack_model.fit(X_train, y_train)
stack_pred = stack_model.predict(X_test)
Model | RΒ² Score | RMSE |
---|---|---|
Tuned Gradient Boosting | 0.8664 | 6061.99 |
CatBoost | 0.8634 | 6130.54 |
Gradient Boosting | 0.8612 | 6178.28 |
Stacking Regressor | 0.8616 | 6180.65 |
Random Forest | 0.8411 | 6611.37 |
Linear Regression | 0.6999 | 9086.25 |
β Tuned Gradient Boosting emerged as the best model, achieving the highest RΒ² Score (0.8664) and the lowest RMSE (6061.99), making it the most accurate predictor of vehicle prices.
β CatBoost and Gradient Boosting also delivered strong results, confirming that ensemble learning techniques are highly effective for this task.
β Stacking Regressor, which combines multiple models, performed almost as well as individual boosting models but didnβt outperform the tuned gradient boosting.
β Random Forest provided good performance but lagged behind boosting models, indicating that gradient boosting is more suitable for this type of structured data.
β Linear Regression, a simpler model, had the lowest accuracy. This suggests that vehicle pricing is a complex problem requiring non-linear models to capture interactions between features.
- We built a vehicle price prediction model by following a structured machine learning pipeline.
- Several regression models were tested, and Gradient Boosting with hyperparameter tuning performed the best with an RΒ² Score of 0.8664 and RMSE of 6061.99.
- If computational efficiency is a concern, CatBoost provides a great balance between performance and speed.
- Stacking models can be explored further for possible performance improvements.
Future improvements can include:
- Trying more feature engineering techniques
- Experimenting with deep learning models
- Deploying the model using Flask or FastAPI
We welcome contributions from the community! If you would like to improve this project, feel free to:
- Fork the repository π΄
- Make enhancements π§
- Fix issues and bugs π
- Optimize model performance π
- Suggest new features π
If you find any bugs or issues, please raise them in the Issues section of this repository.
If you have any questions or want to collaborate, feel free to reach out:
π§ Email: jaspreetsingh01110@gmail.com
This project is licensed under the MIT License.
π If you found this project helpful, consider giving it a β on GitHub!