# Wine Quality Classification

<table align="left">

<a href="https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/classification/logistic_regression/wine_quality_classification_mlr.ipynb">
<img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
View on GitHub
</a>
</td>
<td>
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/classification/logistic_regression/wine_quality_classification_mlr.ipynb">
<img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
Open in Vertex AI Workbench
</a>
</td>
</table>

## Overview

### Multinomial Logistic Regression

<strong>Type</strong>: Classification </p>
<strong>UCI Open Source Dataset</strong>: [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) </p>

This dataset contains red and white vinho verde wine samples, from the north of Portugal, and wine quality data based on physicochemical tests [Cortez et al., 2009](http://www3.dsi.uminho.pt/pcortez/wine/). 

<strong>Problem</strong>: Imagine you are a wine specialist who is looking for an automated way to categorize the wines you find based on wine quality data from physicochemical tests. You could use a machine learning algorithm to train a model that would be able to predict the quality of a wine based on its physicochemical properties. This would allow you to quickly and easily categorize new wines that you find, without having to manually taste them.

Here are some of the benefits of using an automated wine categorization system:

- <strong>Speed</strong>: An automated system can categorize wines much faster than a human can. This is especially beneficial for wine retailers and distributors who need to quickly categorize large numbers of wines.
- <strong>Accuracy</strong>: An automated system can be more accurate than a human when it comes to categorizing wines. This is because the system is not influenced by personal biases or preferences.
- <strong>Consistency</strong>: An automated system will consistently categorize wines in the same way, which can help to ensure that customers are getting the wines they expect.

If you are a wine specialist who is looking for an efficient and accurate way to categorize wines, then an automated system may be the perfect solution for you.

## Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, countDistinct, isnan, when, count, udf
from pyspark.sql.types import StringType

from pyspark.mllib.stat import Statistics
from pyspark.ml.feature import StringIndexer, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors

### Load dataset from public metastore

## Exploratory Data Analysis

In [None]:
spark = SparkSession.builder \
    .appName("Multinomial logistic regression Wine Quality") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
df = spark.read.option("header", True).csv("gs://dataproc-metastore-public-binaries/winequality_white/")

|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
|-------------|----------------|-----------|--------------|---------|-------------------|--------------------|-------|----|---------|-------|-------|
|          7.0|            0.27|       0.36|          20.7|    0.045|               45.0|               170.0|  1.001| 3.0|     0.45|    8.8|      6|
|          6.3|             0.3|       0.34|           1.6|    0.049|               14.0|               132.0|  0.994| 3.3|     0.49|    9.5|      6|
|          8.1|            0.28|        0.4|           6.9|     0.05|               30.0|                97.0| 0.9951|3.26|     0.44|   10.1|      6|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|               186.0| 0.9956|3.19|      0.4|    9.9|      6|
|          7.2|            0.23|       0.32|           8.5|    0.058|               47.0|               186.0| 0.9956|3.19|      0.4|    9.9|      6|

### DataFrame Column Data Types

In [None]:
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
df = df.withColumn("quality", col("quality").cast("int"))

In [None]:
df.printSchema()

### Summary Statistics 

At this point, we have all columns contains numerical values. For features which contain numerical values, we are often interested in various statistical measures relating to those values.

In [None]:
df.describe().show(5,8)

Let's investigate a bit more of our target data by using the .groupby() function.

In [None]:
df.groupby(
    col('quality')).\
    count().\
    show(5,50)

We can see here that the data is <b>imbalanced</b> for our target. <b>Imbalanced</b> data is a common problem in machine learning, where the number of samples in one class is much larger than the number of samples in another class. This can make it difficult to train a model that can accurately predict the minority class. There are a number of techniques that can be used to handle imbalanced data, including:

- <b>Resampling</b>: This involves increasing the number of samples in the minority class or decreasing the number of samples in the majority class. This can be done by oversampling the minority class (creating new samples), undersampling the majority class (removing samples), or a combination of both.
- <b>Cost-sensitive learning</b>: This involves assigning different costs to misclassifications of different classes. This can help to focus the model on correctly classifying the minority class.
- <b>Ensemble learning</b>: This involves training multiple models on different subsets of the data and then combining the predictions of the models. This can help to improve the accuracy of the model on the minority class.

We need to <b>resample</b> the data to balance the dataset. However, before we do that, we need to check if there are any issues with the data that need to be resolved. For example, we need to make sure that there are no missing values in the data. We also need to make sure that the data is not corrupted. Once we have resolved any issues with the data, we can then resample it to balance the dataset.

### Let's summarize our data by row, column, features, unique, and missing values.

In [None]:
print ("Rows     : " ,df.count())
print ("Columns  : " ,len(df.columns))
print ("\nFeatures : \n" ,df.columns)
print ("\n Count Distinct values : ", "")
expression = [countDistinct(c).alias(c) for c in df.columns]
print ("\nUnique values :  \n", df.select(*expression).show())
print ("\nMissing values :  ", "")
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

There no missing values, or other data issue. So we can ressample the data.

### Distribution of Features

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(20,10))
st = fig.suptitle("Distribution of Features",
                  fontsize=20,
                  verticalalignment='center')

for col,num in zip(df.toPandas().describe().columns, range(1,11)):
    ax = fig.add_subplot(3,4,num)
    ax.hist(df.toPandas()[col])
    plt.grid(False)
    plt.xticks(rotation=45,fontsize=10)
    plt.yticks(fontsize=10)
    plt.title(col.upper(),fontsize=20)

plt.tight_layout()
st.set_y(0.95)
fig.subplots_adjust(top=0.85,hspace = 0.4)
plt.show()

Great part of freatures had a normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is widely used in statistical modeling and machine learning. It is a bell-shaped curve that is symmetrical around its mean and is characterized by its mean and standard deviation.

In machine learning, data that is normally distributed is beneficial for model building because it makes the math easier. Many machine learning algorithms, such as linear regression and logistic regression, are explicitly calculated from the assumption that the distribution is a bivariate or multivariate normal. Additionally, sigmoid functions work most naturally with normally distributed data.

### Pearson Correlation

In [None]:
import pandas as pd

col_names = df.columns
features = df.rdd.map(lambda row: row[0:])
corr_mat= Statistics.corr(features, method="pearson")
corr_df = pd.DataFrame(corr_mat)
corr_df.index, corr_df.columns = col_names, col_names

corr_df

A Pearson correlation coefficient of 0.7 or greater is generally considered to be a strong correlation. This means that there is a high degree of linear relationship between the two variables. For example, if the correlation coefficient between height and weight is 0.7, this means that taller people tend to be heavier, on average.

However, it is important to note that the strength of a correlation coefficient can vary depending on the specific variables being measured. For example, a correlation coefficient of 0.7 might be considered strong for some variables, but weak for others.

Here we could see that residual_sugar have a <b>strong correlation</b> with <b>density</b>.

## Feature engineering

<b>Feature engineering</b> is the process of transforming raw data into features that are more informative and useful for machine learning algorithms. This can involve a variety of tasks, such as:

- <b>Data transformation</b>: This involves transforming the data into a format that is more suitable for machine learning algorithms. For example, categorical data can be encoded as numerical data, and continuous data can be discretized.
- <b>Feature selection</b>: This involves selecting the most important features from the data set. This can be done using a variety of techniques, such as statistical significance tests and feature importance scores.
- <b>Feature creation</b>: This involves creating new features from the existing data. This can be done by combining existing features, or by creating derived features that are based on the relationships between different features.

### Bucketing a quality label

- Poor for less than 3
- Poor between 4 and 5 (inclusive)
- Good between 6 and 7
- Excelent for more than 7

In [None]:
@udf(returnType=StringType())
def quality_score_to_label(score: int):
    if score >=0 and score < 4:
        return 'poor'
    elif score >=4 and score <=5:
        return 'normal'
    elif score >=6 and score <=7:
        return 'good'
    else: 
        return 'excelent'

In [None]:
df.show()

In [None]:
df.printSchema()

In [None]:
df_buc = df.withColumn('quality_label', quality_score_to_label('quality'))
df_buc = df_buc.drop('quality')
df_buc.show()

### LabelIndexer

In [None]:
label_indexer = StringIndexer()\
            .setInputCol ("quality_label")\
            .setOutputCol ("quality")

label_indexer_model = label_indexer.fit(df_buc)
label_indexer_df = label_indexer_model.transform(df_buc)

label_indexer_df = label_indexer_df.drop('quality_label')

label_indexer_df.show(5,50)

In [None]:
label_indexer_df.groupby('quality').\
    count().\
    show(5,50)

### Resampling [Optional step]

In [None]:
from imblearn.over_sampling import SMOTE

X, y = label_indexer_df.toPandas().iloc[:, :-1], label_indexer_df.toPandas().iloc[:, [-1]]
sm = SMOTE(k_neighbors=6)
X_res, y_res = sm.fit_resample(X, y)

In [None]:
df_res = spark.createDataFrame(pd.concat([X_res, y_res], axis=1))
df_res.show(5,50)

In [None]:
df_res.groupby('quality').\
    count().\
    show(5,50)

### Vectoring to prepare Data for Machine Learning

In [None]:
df_vec= df_res.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["vectorized_features", "label"])
df_vec.show(5,50)

### Standardization

In [None]:
scaler = StandardScaler()\
         .setInputCol ("vectorized_features")\
         .setOutputCol ("features")
        
scaler_model = scaler.fit(df_vec)
scaler_df = scaler_model.transform(df_vec)

scaler_df = scaler_df.select('features', 'label')
scaler_df.show(5,50)

## Model Choice

<b>Multinomial logistic regression</b> is a type of logistic regression that can be used for multi-class classification problems. In the case of wine quality classification, there are 4 classes (poor, normal, good and excelent) so multinomial logistic regression is a good choice for modeling this problem.

The physicochemical tests can be used to measure the various properties of wine, such as acidity, alcohol content, and sugar content. These properties can then be used as features in the multinomial logistic regression model.

Here are some of the advantages of using multinomial logistic regression for wine quality classification:

- It is a relatively simple model that is easy to understand and interpret.
- It is a very flexible model that can be used to model a variety of different types of data.
- It is a very efficient model that can be estimated quickly and easily.

Here is some of the disadvantages of using multinomial logistic regression for wine quality classification:

- It may not be as accurate as some other models, such as support vector machines or decision trees.
- It may not be able to capture the nonlinear relationships between the features and the class labels.

In addition to multinomial logistic regression, there are a number of other models that could be used for wine quality classification. Some of these other models include support vector machines, decision trees, and random forests. However, multinomial logistic regression is a good starting point for wine quality classification because it is a simple, flexible, and efficient model.


## Model Training

### Train/Test Split data  

In [None]:
# Split training and test data
training, test = scaler_df.randomSplit([0.8, 0.2])
print ("Training instances", training.count(), "Test instances", test.count())

### Model Training phase

In [None]:
lr = LogisticRegression(
    family="multinomial", 
    featuresCol = 'features', 
    labelCol = 'label',
    maxIter=200,
    elasticNetParam=1.0, 
    tol=1e-6, 
    standardization=True,
    fitIntercept=True
)
    
# Fit the model
lrModel = lr.fit(training)

# Print the coefficients and intercept for multinomial logistic regression
print("Coefficients: \n" + str(lrModel.coefficientMatrix))
print("Intercept: " + str(lrModel.interceptVector))

trainingSummary = lrModel.summary

# for multiclass, we can inspect metrics on a per-label basis
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

## Model Evaluation

In [None]:
predictions = lrModel.transform(test)

### Confusion Matrix

In [None]:
class_names=list([0,1,2,3])

import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np

y_true = predictions.select("label")
y_true = y_true.toPandas()

y_pred = predictions.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(
    y_true,
    y_pred,
    labels=class_names
)

plt.figure()
plot_confusion_matrix(
    cnf_matrix,
    classes=class_names,
    title='Confusion matrix'
)

plt.show()

## Prediction

In [None]:
predictions = predictions.withColumn('prediction', quality_score_to_label('prediction'))

In [None]:
predictions.select(
    'label',
    'features',
    'rawPrediction',
    'prediction',
    'probability'
).toPandas()\
.head(10)