Extra material: Wine quality prediction with \(L_1\) regression#

Preamble: Install Pyomo and a solver#

The following cell sets and verifies a global SOLVER for the notebook. If run on Google Colab, the cell installs Pyomo and the HiGHS solver, while, if run elsewhere, it assumes Pyomo and HiGHS have been previously installed. It then sets to use HiGHS as solver via the appsi module and a test is performed to verify that it is available. The solver interface is stored in a global object SOLVER for later use.

import sys
 
if 'google.colab' in sys.modules:
    %pip install pyomo >/dev/null 2>/dev/null
    %pip install highspy >/dev/null 2>/dev/null
 
solver = 'appsi_highs'
 
import pyomo.environ as pyo
SOLVER = pyo.SolverFactory(solver)

assert SOLVER.available(), f"Solver {solver} is not available."

Problem description#

Regression analysis aims to fit a predictive model to a dataset, and when executed successfully, this model can generate valuable forecasts for new data points. This notebook demonstrates how linear programming techniques coupled with Least Absolute Deviation (LAD) regression can construct a linear model to predict wine quality based on its physicochemical attributes. The example uses a well known data set from the machine learning community.

In this 2009 article by Cortez et al. comprehensive set of physical, chemical, and sensory quality metrics was gathered for an extensive range of red and white wines produced in Portugal. This dataset was subsequently contributed to the UCI Machine Learning Repository.

The next code cell downloads the red wine data directly from this repository.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

wines = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
    sep=";",
)
display(wines)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

Mean Absolute Deviation (MAD)#

Given \(n\) repeated observations of a response variable \(y_i\) (in this case, the wine quality), the mean absolute deviation (MAD) of \(y_i\) from the mean value \(\bar{y}\) is

\[\text{MAD}\,(y) = \frac{1}{n} \sum_{i=1}^n | y_i - \bar{y}|\]
def MAD(df):
    return (df - df.mean()).abs().mean()


print(f'MAD = {MAD(wines["quality"]):.5f}')

plt.rcParams["font.size"] = 14
colors = plt.get_cmap("tab20c")
fig, ax = plt.subplots(figsize=(12, 7))
ax = wines["quality"].plot(alpha=0.7, color=colors(0.0))
ax.axhline(wines["quality"].mean(), color=colors(0.2), ls="--", lw=3)

mad = MAD(wines["quality"])
ax.axhline(wines["quality"].mean() + mad, color=colors(0.4), lw=3)
ax.axhline(wines["quality"].mean() - mad, color=colors(0.4), lw=3)
ax.legend(["data", "mean", "MAD"])
ax.set_xlabel("Observations")
ax.set_ylabel("Quality")
plt.tight_layout()
plt.show()
MAD = 0.68318
../../_images/cf48d835c07b23460b16ce5bb81b16f7740ceb09371e2badb00bb4b5999f7a54.png

A preliminary look at the data#

The data consists of 1,599 measurements of eleven physical and chemical characteristics plus an integer measure of sensory quality recorded on a scale from 3 to 8. Histograms provides insight into the values and variability of the data set.

plt.rcParams["font.size"] = 13
colors = plt.get_cmap("tab20c")
fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True)
for ax, column in zip(axes.flatten(), wines.columns):
    wines[column].hist(ax=ax, bins=30, color=colors(0.0), alpha=0.7)
    ax.axvline(wines[column].mean(), color=colors(0.2), label="mean")
    ax.set_title(column)
    ax.legend()

plt.tight_layout()
../../_images/981b30a6f694d2aea2557a7dbb28c6cdf41803e612bfb1b9a1d424a0ecf8eb7a.png

Which features influence reported wine quality?#

The art of regression is to identify the features that have explanatory value for a response of interest. This is where a person with deep knowledge of an application area, in this case an experienced onenologist will have a head start compared to the naive data scientist. In the absence of the experience, we proceed by examining the correlation among the variables in the data set.

_ = wines.corr()["quality"].plot(kind="bar", grid=True, color=colors(0.0), alpha=0.7)
plt.tight_layout()
../../_images/2e87a232be167fd2b2ce8f10bc7c01ab487f22783743f4d073178c993f6e6fbe.png
wines[["volatile acidity", "density", "alcohol", "quality"]].corr()
volatile acidity density alcohol quality
volatile acidity 1.000000 0.022026 -0.202288 -0.390558
density 0.022026 1.000000 -0.496180 -0.174919
alcohol -0.202288 -0.496180 1.000000 0.476166
quality -0.390558 -0.174919 0.476166 1.000000

Collectively, these figures suggest alcohol is a strong correlate of quality, and several additional factors as candidates for explanatory variables..

LAD line fitting to identify features#

An alternative approach is perform a series of single feature LAD regressions to determine which features have the largest impact on reducing the mean absolute deviations in the residuals.

\[ \begin{align*} \min \frac{1}{I} \sum_{i\in I} \left| y_i - a x_i - b \right| \end{align*} \]

This computation has been presented in a prior notebook.

def lad_fit_1(df, y_col, x_col):
    m = pyo.ConcreteModel("LAD Regression Model")

    N = len(df)
    m.I = pyo.RangeSet(N)

    @m.Param(m.I)
    def y(m, i):
        return df.loc[i - 1, y_col]

    @m.Param(m.I)
    def X(m, i):
        return df.loc[i - 1, x_col]

    m.a, m.b = pyo.Var(), pyo.Var(domain=pyo.Reals)
    m.e_pos = pyo.Var(m.I, domain=pyo.NonNegativeReals)
    m.e_neg = pyo.Var(m.I, domain=pyo.NonNegativeReals)

    @m.Expression(m.I)
    def prediction(m, i):
        return m.a * m.X[i] + m.b

    @m.Constraint(m.I)
    def prediction_error(m, i):
        return m.e_pos[i] - m.e_neg[i] == m.prediction[i] - m.y[i]

    @m.Objective(sense=pyo.minimize)
    def MAD(m):
        return sum(m.e_pos[i] + m.e_neg[i] for i in m.I) / N

    return m


m = lad_fit_1(wines, "quality", "alcohol")
SOLVER.solve(m)
print(f"The mean absolute deviation for a single-feature regression is {m.MAD():0.5f}")
The mean absolute deviation for a single-feature regression is 0.54117

This calculation is performed for all variables to determine which variables are the best candidates to explain deviations in wine quality.

variables = {}
for i in wines.columns:
    m = lad_fit_1(wines, "quality", i)
    SOLVER.solve(m)
    variables[i] = m.MAD()

mad = (wines["alcohol"] - wines["alcohol"].mean()).abs().mean()
colors = plt.get_cmap("tab20c")
fig, ax = plt.subplots()
pd.Series(variables).plot(kind="bar", ax=ax, grid=True, alpha=0.7, color=colors(0.0))
ax.axhline(mad, color=colors(0.4), lw=3)
ax.set_title("MADs for single-feature regressions")
plt.tight_layout()
plt.show()
../../_images/53ead2b99f8874edc1a16e7878d73e5d24b7a865334979bd7bc11ff9a93f5d73.png
wines["prediction"] = [m.prediction[i]() for i in m.I]

fig1, ax1 = plt.subplots(figsize=(6, 5))
wines["quality"].hist(label="data", alpha=0.7, color=plt.get_cmap("tab20c")(0))
ax1.set_xlabel("Quality")
ax1.set_ylabel("Counts")
plt.tight_layout()
plt.show()

fig2, ax2 = plt.subplots(figsize=(6, 6))
wines.plot(
    x="quality",
    y="prediction",
    kind="scatter",
    alpha=0.7,
    color=plt.get_cmap("tab20c")(0),
    ax=ax2,
)
ax2.set_xlabel("Quality")
ax2.set_ylabel("Prediction")
ax2.set_aspect("equal", "box")
min_val = min(ax2.get_xlim()[0], ax2.get_ylim()[0])
max_val = max(ax2.get_xlim()[1], ax2.get_ylim()[1])
ax2.set_xlim(min_val, max_val)
ax2.set_ylim(min_val, max_val)
plt.tight_layout()
plt.show()
../../_images/8b19b96197d086751f12c46dcd08c39d59350b56a2a7f44cbf73ad9394e7a4f0.png ../../_images/abe8866395c0f8c0c3a3f352a05740f4e32497612443163f499f5d31dcf9b33c.png

Multivariate \(L_1\)-regression#

Let us now perform a full multivariate \(L_1\)-regression on the wine dataset to predict the wine quality \(y\) using the provided wine features. We aim to find the coefficients \(m_j\)’s and \(b\) that minimize the mean absolute deviation (MAD) by solving the following problem:

\[\begin{split} \begin{align*} \text{MAD}\,(\hat{y}) = \min_{m, \, b} \quad & \frac{1}{n} \sum_{i=1}^n | y_i - \hat{y}_i| \\ \\ \text{s. t.}\quad & \hat{y}_i = \sum_{j=1}^J x_{i, j} m_j + b & \forall i = 1, \dots, n, \end{align*} \end{split}\]

where \(x_{i, j}\) are values of ‘explanatory’ variables, i.e., the 11 physical and chemical characteristics of the wines. By taking care of the absolute value appearing in the objective function, this can be implemented in Pyomo as a linear optimization problem as follows:

def l1_fit(df, y_col, x_cols):
    m = pyo.ConcreteModel("L1 Regression Model")

    N = len(df)
    m.I = pyo.RangeSet(N)
    m.J = pyo.Set(initialize=x_cols)

    @m.Param(m.I)
    def y(m, i):
        return df.loc[i - 1, y_col]

    @m.Param(m.I, m.J)
    def X(m, i, j):
        return df.loc[i - 1, j]

    m.a = pyo.Var(m.J)
    m.b = pyo.Var(domain=pyo.Reals)

    m.e_pos = pyo.Var(m.I, domain=pyo.NonNegativeReals)
    m.e_neg = pyo.Var(m.I, domain=pyo.NonNegativeReals)

    @m.Expression(m.I)
    def prediction(m, i):
        return sum(m.a[j] * m.X[i, j] for j in m.J) + m.b

    @m.Constraint(m.I)
    def prediction_error(m, i):
        return m.e_pos[i] - m.e_neg[i] == m.prediction[i] - m.y[i]

    @m.Objective(sense=pyo.minimize)
    def MAD(m):
        return sum(m.e_pos[i] + m.e_neg[i] for i in m.I) / N

    return m


m = l1_fit(
    wines,
    "quality",
    [
        "alcohol",
        "volatile acidity",
        "citric acid",
        "sulphates",
        "total sulfur dioxide",
        "density",
        "fixed acidity",
    ],
)
SOLVER.solve(m)
print(f"MAD = {m.MAD():0.5f}\n")

for j in m.J:
    print(f"{j: <20}  {pyo.value(m.a[j]):9.5f}")

wines["prediction"] = [pyo.value(m.prediction[i]) for i in m.I]

fig1, ax1 = plt.subplots(figsize=(7, 7))
wines["quality"].hist(label="data", alpha=0.7, color=plt.get_cmap("tab20c")(0))
ax1.set_xlabel("Quality")
ax1.set_ylabel("Counts")
plt.tight_layout()
plt.show()

fig2, ax2 = plt.subplots(figsize=(7, 7))
wines.plot(
    x="quality",
    y="prediction",
    kind="scatter",
    alpha=0.7,
    color=plt.get_cmap("tab20c")(0),
    ax=ax2,
)
ax2.set_xlabel("Quality")
ax2.set_ylabel("Prediction")
ax2.set_aspect("equal", "box")
min_val = min(ax2.get_xlim()[0], ax2.get_ylim()[0])
max_val = max(ax2.get_xlim()[1], ax2.get_ylim()[1])
ax2.set_xlim(min_val, max_val)
ax2.set_ylim(min_val, max_val)
plt.tight_layout()
plt.show()
MAD = 0.49980

alcohol                 0.34242
volatile acidity       -0.98062
citric acid            -0.28928
sulphates               0.90609
total sulfur dioxide   -0.00219
density               -18.50083
fixed acidity           0.06382
../../_images/7d9dd0d101b3ba653862cb779d0b42cf62e147862f9b2f285ad2b1971b6d9ccf.png ../../_images/1961f8b9013bf7bb506518d5b98cc2727d96f08fd722b150f7824da182b8edf1.png

How do these models perform?#

A successful regression model would demonstrate a substantial reduction from \(\text{MAD}\,(y)\) to \(\text{MAD}\,(\hat{y})\). The value of \(\text{MAD}\,(y)\) sets a benchmark for the regression. The linear regression model clearly has some capability to explain the observed deviations in wine quality. Tabulating the results of the regression using the MAD statistic we find

Regressors

MAD

none

0.683

alcohol only

0.541

all

0.500

Are these models good enough to replace human judgment of wine quality? The reader can be the judge.