ANALYZING AND PREDICTING WHITE WINE DATA
It’s the weekend people!! All I can think of right now is how I can’t wait to unwind with a really good bottle of wine and a good series, preferably Korean. Talking of wines, this post focuses on using machine learning to predict the quality of a sample wine data. Sounds interesting right? Let’s get right into it!
Here’s a link to my github page analyzing this data https://github.com/Sofuwa/She-Code-Africa/blob/main/White_Wine_Quality.ipynb
Below are the steps I used in analyzing and predicting my dataset. I’ll be giving a brief explanation of what I did under each step.
- Import Libraries
- Data Preprocessing
- Exploring and Dealing with Outliers
- Exploratory Data Analysis (EDA)
- Splitting and transforming the data
- Dealing with Data Imbalance
- Standardizing the data
- Machine Learning Algorithm
Importing the libraries that I used was the first step. I usually prefer to state all the libraries imported at the very top of my code because it’s more readable than leaving them in between other codes.
Data preprocessing was the second and the most important step. Data preprocessing if not done properly can lead to misleading insights and affect your predicted values. These errors can lead to significant negative effects especially if this is done in the context of a business as these insights could be leveraged in making decisions. I started off by checking if there was any null values. There are different ways to achieve this but I typically use the describe function which gives a descriptive statistics of the data including count of values, minimum and maximum values, mean etc. The data had no missing values but I noticed that it had some extreme values for certain variables eg the min and max for free sulfur dioxide is 2 and 289 respectively.
These extreme values are referred to as outliers and there are a number of ways to deal with this. I made use of the box plot and IQR methods which allowed me to visualize the outliers as well as exclude the lower and upper bound outliers. Caution should be taken when dealing with outliers so as not to exclude a significant percentage of the data.
The machine learning algorithm I utilized was a classification algorithm. I created bins from the quality variables which I used in creating another variable called wine grade. This grade segments wine into bad, good and excellent. I also did a value counts to get how the data is distributed between the three classes and it showed that there was a significant level of imbalance in the dataset.
Exploratory data analysis is done to give an idea of what the data is like. You can get insights in distribution of the data, relationship between variables etc. I examined the relationship between the quality of wine or wine grade with four other variables I found really interesting: citric acid, volatile acidity, residual sugar and density.
The next step was to encode and split my data intro training and test sets. In machine learning, variables inputted into the algorithm have to be in numbers, that is, any text would have to be coded as number. The label encoder converts texts to number. The only text in my data was the wine grade. The label encoder works by assigning numbers to the text after sorting in the texts in an ascending order. This means the wine grade will be sort as bad, excellent, good and encoded as 0,1,2.
Splitting the data is another necessary step. The data needs to be split so that it can be trained properly. By training the data properly, the model would be able to make good or great predictions on unseen data. The test data acts as a form of unseen data to measure how accurate the machine learning algorithm can make accurate predictions after it has been trained with the training data. To deal with the imbalance I made use of SMOTE(Synthetic Minority Over-sampling Technique). SMOTE helps in dealing with data imbalance such as we have in this data. You can over sample or under sample. SMOTE over samples the data, in our cases the bad and excellent wine grade which have significantly less data than good wines, by introducing synthetic data to the classes with lower number of values. This will introduce more data to the classes and deal with the problem is imbalance.
Finally, making predictions using machine learning algorithms. In choosing the best suited machine learning algorithm for my data, I test my data with different models and select the best model based on the prediction results.
After running the different machine learning models against the data, most of the results give an accuracy, precision, recall and f1 score of 1 stating the these models accurately predicted the test data values 100%. Accuracy should be the only metrics to determine how good your model is. Other metrics such as precision, recall and f1 score are also important. The closer accuracy, precision, recall and f1 score are to 1 the better your model. The closer to 0, the worse your model becomes.
Scikit-learn explains precision, recall and f1 score as:
The precision is the ratio tp / (tp + fp)
where tp
is the number of true positives and fp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn)
where tp
is the number of true positives and fn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
This should be something I’m dancing about but a 100% accuracy just tells me that my model is overfitted. This means that bringing in any other new data asides the test data would give us inaccurate results. Accuracy that fall between 80% and 95% however are great prediction result. I decided to employ the KNN model as it wasn’t overfitted and still gave me a great accuracy of about 84%.
That bring me to the end of my mentorship program with She Code Africa. I look forward to learning more and utilizing all I have learnt in the future.
Till next time
Bye!