This project uses the Wisconsin Breast Cancer Dataset to apply and compare the performance of Decision Tree and Random Forest classifiers. The goal is to predict whether a tumor is benign or malignant based on various continuous attributes.
The Wisconsin Breast Cancer Dataset consists of 10 continuous attributes and 1 target class attribute:
- Attributes: 10 continuous features related to tumor characteristics.
- Target Class: Diagnosis ('B' for benign and 'M' for malignant).
- Loaded the dataset from a CSV file.
- Removed the irrelevant Id column.
- Scaled the features using Z-scores for normalization.
- Split the data into training and testing sets (70/30 split).
- Applied Decision Tree and Random Forest classifiers.
- Calculated and displayed performance metrics including accuracy, confusion matrix, and classification report for both models.
- Used cross-validation to compare the accuracy of both algorithms.
- Accuracy, confusion matrix, and classification report for each model were displayed.
- A boxplot was generated to compare the accuracy scores of both models.
To run this project, you need the following libraries:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn
You can install them using the following:
pip install numpy pandas matplotlib seaborn scikit-learnimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_reportdata = pd.read_csv('/content/wisc_bc_data.csv')
data.head()- Drop the Id column.
- Scale features using Z-scores.
features = data.drop('diagnosis', axis=1)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
scaled_features_df = pd.DataFrame(scaled_features, columns=features.columns)X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)- Train and evaluate the Decision Tree and Random Forest models.
- Generate accuracy, confusion matrix, and classification report.
for name, model in algorithms:
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
cm = confusion_matrix(Y_test, Y_pred)
report = classification_report(Y_test, Y_pred)
results[name] = {'accuracy': accuracy, 'confusion_matrix': cm, 'classification_report': report}for name, model in algorithms:
scores = cross_val_score(model, X, Y, cv=10, scoring='accuracy')
results[name] = scores
flat_accuracy_scores = [score for name in results for score in results[name]]
flat_algorithms = [name for name in results for _ in results[name]]
data_for_plot = pd.DataFrame({'Algorithm': flat_algorithms, 'Accuracy': flat_accuracy_scores})
sns.boxplot(x='Algorithm', y='Accuracy', data=data_for_plot)The results for Decision Tree and Random Forest classifiers include:
- Accuracy
- Confusion Matrix
- Classification Report Additionally, a boxplot is generated to visually compare the accuracy distribution of both models.
The project demonstrates how Decision Tree and Random Forest classifiers perform on the Wisconsin Breast Cancer Dataset and compares their accuracy through cross-validation. Based on the metrics, you can select the best performing model for further analysis or deployment.