Using Machine Learning to predict quality of wines, based on the given features in the dataset.
Tools used: Python (Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn)
Table of contents:
Summary:
We will explore the data, its distributions and correlations, deal with outliers, select features, form hypotheses and test them, try different classification algorithms, and finally tune the model with the best performance.
Result: A Random Forest Classifier model with 71.33% accuracy in predicting the quality of a wine. After some adjustments with the help of wine professionals, this model may be able to help wine companies or individuals to accurately judge the quality of their wines.
Features:
fixed_acidity
- Fixed acidity of the wine.
volatil_acidity
- Volatile acidity of the wine (I think it is misspelled).
citric_acid
- Amount of citric acid in the wine.
residual_sugar
- Amount of residual sugar in the wine.
chlorides
- Amount of Chlorides in the wine.
free_sulfur_dioxide
- The amount of Sulfur Dioxide in the wine that is free.
total_sulfur_dioxide
- Amount of Sulfur Dioxide present in the wine.
density
- Density of the wine
pH
- pH level of the wine.
sulphates
- Amount of sulphates in the wine.
alcohol
- Alcohol content in the wine.
quality
- Quality of the wine. This is our target variable.
Opening the dataframe using the Variable Explorer in Spyder:
All of the features are dumped into a single column, seperated by semicolons. First I seperate it into different columns by applying a lambda function, then I drop this initial column.
Much better (and now actually usable).
First let's look at the correlation between the features.
Looking at the correlation between just the target variable and features:
We can see that alcohol content and density have the biggest impact on the quality of a wine. So in general, the higher the alcohol content, the better the quality, and the lower the density the better the quality. Sulphates seem to have the least impact on quality.
Let's look at the top four features that impact quality (alcohol
, density
, chlorides
, and volatil_acidity
):
Since alcohol and density were difficult to read as their changes are so small, I closed up on them by setting a smaller y limit.
alcohol
- Wines with quality between 6 to 8 have the most alcohol. Interestingly, 9s have almost the same alcohol as 3s.
density
- Normally distributed. Wines with quality 6 seem to be the most dense. Highest quality wines have the lowest density.
chlorides
- Since this is negatively correlated, we can see that the best wines have the lowest chlorides.
volatil_acidity
- Similar to chlorides.
Let's take a look at their distributions:
alcohol
- Fairly normal distribution. Skewed to the right.
density
, chlorides
,volatil_acidity
- Non normal distribution. There seem to be a lot of outliers. These will probably have to be scaled depending on the model.
Let's take a look at their box plots as well:
We can see that there are a lot of outliers. These could affect model performance negatively.
Since alcohol
is the most correlated feature, it might make sense to have a look at its relation with the other features.
Let's also have a look at the relations between it's top 4 highest correlated variables, density
, residual_sugar
, total_sulfur_dioxide
and chlorides
(I shall not include quality
as it is not continuous).
Wow. This makes me want to try building a model to predict the alcohol amount of a wine as well. I wonder though if the high correlation between density and alcohol could have a negative impact on the quality prediction.
Since all of these features are negatively correlated; we can see that in general the lower the presence of these features, the better the quality tends to be.
I'm going to make models with four different variations of this dataset:
-
All of the features in the raw data - I expect the models built on this to perform the worst. But it should give an obvious baseline performance to beat.
-
Highly correlated features in the raw data - I expect models on this to perform better than the previous one.
-
All of the features without outliers - I would expect this one to perform somewhat similar to the previous one. But I might be surprised.
-
Highly correlated features without outliers - I expect models built on this one to perform the best.
Since I am not exactly the most knowledgable about wines, it is difficult to say at what point outliers can be safely discarded. So I will be trying my best to only remove extreme cases. Still, it should be kept in mind that this model is not going to be the best to use in real-world scenarios.
For most of the highly correlated features, I have discarded values over the .995 percentile, and under the .005 percentile.
The four highest correlated features in the new dataset:
Bars:
Distributions:
Boxplots:
Histograms:
The data is now slightly more normalized. Since I will be using Random Forest to build the model, there's no need to scale the data.
First I import the necessary modules, load the original dataset and the one without outliers, and then I select the features that will be used:
Next, I split them into different training and target sets and import Random Forest Classifier:
Training and scoring:
Interestingly, the model fitted with the unaltered data with all of the features performs the best (70.61%), while the one with removed outliers and selected highly correlated features performs the worst. So, my hypothesis was completely wrong.
Now we are going to take the best model and increase its performance by tuning the hyperparameters. I'll first use RandomizedSearchCV, and then GridSearchCV on the resulting RSCV model, and tune it further.
First I am going to import randomized search, then set a range for the hyperparameters I want to test. And I fit a new random forest into the RSCV and get the best params.
Then, I instantiate a new model using the best RSCV parameters, fit the training data into it and score the test data:
Result: The resulting model has a 70.82% score. A 0.21% increase in model performance.
Next, I am going to run a grid search for the values around the best RSCV parameters:
Result: The final model has a score of 71.33%. An overall increase of 0.72% in performance from the original model.
Confusion Matrix:
We have a Random Forest Classifier model with 71.33% accuracy in predicting the quality of a wine.
While I would like to try and find more ways to improve the performance of this model, I have spent entirely too much time and effort on this project. And have now realized that I am extremely dispassionate about wines. I wanted to stop midway, but I couldn't bring myself to just abandon something I started.
I was certainly forced to learn a lot of things because of this project, as this is one of my very first ones. And even though this is not one of my favorite projects, I will remember it gratefully for this.