Projects and pratical assignments for data mining course at AUT, spring 2022
- Preprocessing iris flower dataset
- Neural network
- Clustring
- Assosiation rules
- Final project
Various pre-processings were done in this project.
- Missing values were recognized and dropped
- One-hot encoding have been used for encoding categorical features
- Numerical features were normalized using StandardScaler
- Principal Component Analysis (PCA) have been used for dimensionality reduction (4D to 2D)
- Reduced features have been visualized
- Original Dataset(without NaN-values) have been visualized using Box plot
The final goal of this project was to make and tune a suitable ANN for classifying 2D-data in form of circles. Different steps were done in order to grasp a better understanding of ANNs. In each step, model accuracy and loss were plotted for both test and train datasets.
Simple ANN on Fashion MNIST dataset using tensorflow.
Train accuracy: 0.8821
Test accuracy: 0.8820
Confusion matrix
Using k-means and DBSCAN for clustring given datasets
- K-means have been used to cluster a given dataset.
- The elbow method have been used to obtain the optimal value of K in k-means algorithm
- K-means have been used on digits dataset, at first dimensionality reduction was done using Isomap then k-means was performed on redusced dataset.
- KNN have been used to determine the optimal value for epsilon in DBSCAN algorithm
- Different values for MinPts and epsilon have been tested in order to find the best hyperparameters for DBSCAN on the dataset
Results: k-means performs well when the data is linearly separable. But in other cases, due to the linearity of k-means clustering, the result is not desirable. The DBSCAN algorithm is a density-based clustering algorithm that performs better than k-means when our dataset is not linearly separable.
In this assignment, association rules have been extracted from a given dataset using apriori algorithm.
The general purpose of this project was to implement a classifier which finds symptoms of diabetes or pre-diabetes for the given patients information based on a CDC dataset. XGBoost was used to implement classification model.
-
Preprocessing
- Null values / meaningless values have been removed
- Numerical features were normalized
- Categorical features have been changed to one-hot-encoding
- train/test dataset have been created
-
Model creation
- classification model was defined using XGBClassifier
-
Model evaluation
- Accuracy, persicion and recal have been calculated for train and test datasets
- ROC-AUC score has been calculated
- Confusion matrix has been plotted
-
Hyperparameter tuning
- Best hyperparameters for our XGBClassifier have been found using GridSearchCV
- Hyperparameter changes have been plotted
Best hyperparameters: {'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 300}
Test Accuracy: 75.70%
Train Accuracy: 76.13%
Test precision: 0.759
Train precision: 0.763
Test recall: 0.757
Train recall: 0.761
ROC score: 0.840