🧠 AI with Python – 🍷 Wine Quality Prediction (RandomForest + SHAP)
Posted on: January 29, 2026
Description:
Predicting wine quality is a classic real-world machine learning problem that combines tabular data modeling with model interpretability.
Wine quality depends on multiple physicochemical properties, and understanding how these factors influence quality scores is just as important as making accurate predictions.
In this project, we build a Random Forest regression model to predict wine quality and then use SHAP (SHapley Additive exPlanations) to explain why the model makes certain predictions.
Understanding the Problem
The wine quality dataset contains laboratory measurements such as acidity, sugar, sulphates, and alcohol content.
The target variable, quality, is a numeric score assigned by human tasters.
This makes the task a regression problem, with additional challenges:
- features interact non-linearly
- multiple features contribute simultaneously
- predictions must be explainable, not just accurate
1. Loading the Wine Quality Dataset
We begin by loading the wine quality dataset in CSV format.
import pandas as pd
df = pd.read_csv("winequality-red.csv", sep=";")
df.head()
Each row represents a wine sample with measured chemical properties and a corresponding quality score.
2. Exploring the Target Variable
Before modeling, it’s useful to understand how quality scores are distributed.
print(df["quality"].value_counts().sort_index())
Wine quality typically ranges between 3 and 8, with most samples clustered around mid-range values.
3. Preparing Features and Target
We separate input features from the target variable.
X = df.drop("quality", axis=1)
y = df["quality"]
All features are numeric, making them well-suited for tree-based models.
4. Train/Test Split
We split the dataset to evaluate how well the model generalizes to unseen samples.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3,
random_state=42
)
5. Training a Random Forest Regressor
Wine quality depends on complex, non-linear interactions between features.
Random Forest models are well-suited for capturing such patterns.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(
n_estimators=300,
max_depth=12,
random_state=42
)
model.fit(X_train, y_train)
Random Forests also provide strong performance without extensive feature engineering.
6. Evaluating Model Performance
We evaluate predictions using regression metrics.
from sklearn.metrics import mean_absolute_error, r2_score
y_pred = model.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
- MAE shows the average error in predicted quality score
- R² indicates how much variance in quality is explained by the model
7. Explaining Predictions with SHAP
High-performing models are not enough — we must also understand their decisions.
SHAP provides a unified framework to explain model predictions.
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
SHAP assigns each feature a contribution value for every prediction.
8. Global Feature Importance with SHAP
To understand which features influence wine quality the most, we use a SHAP summary plot.
shap.summary_plot(shap_values, X_test)
This visualization reveals:
- which features matter most overall
- whether higher or lower values increase predicted quality
Alcohol content and sulphates often emerge as strong predictors.
Key Takeaways
- Wine quality prediction is a strong real-world regression use case.
- Random Forest models handle non-linear feature interactions effectively.
- MAE and R² are essential metrics for evaluating regression performance.
- SHAP explains both global and individual model decisions.
- Combining performance with interpretability makes models production-ready.
Conclusion
Wine quality prediction demonstrates how machine learning can deliver both accurate predictions and transparent explanations.
By pairing a powerful ensemble model with SHAP-based interpretability, we gain insight into not just what the model predicts, but why it predicts it.
This project highlights an end-to-end workflow for explainable machine learning on tabular data, making it a valuable addition to the AI with Python – Real-World Mini Projects (Advanced) series.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import shap
# 🧩 Load the Wine Quality Dataset
df = pd.read_csv("winequality-red.csv", sep=";")
print("Dataset Shape:", df.shape)
print("\nQuality Distribution:")
print(df["quality"].value_counts().sort_index())
# 🔍 Separate Features and Target
X = df.drop("quality", axis=1)
y = df["quality"]
# ✂️ Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42
)
# 🌲 Train Random Forest Regressor
model = RandomForestRegressor(
n_estimators=300,
max_depth=12,
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
# 📊 Evaluate Model Performance
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("\nModel Performance:")
print("MAE:", round(mae, 3))
print("R² Score:", round(r2, 3))
# 📈 Actual vs Predicted Visualization
plt.figure(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Wine Quality")
plt.ylabel("Predicted Wine Quality")
plt.title("Wine Quality: Actual vs Predicted")
plt.grid(True)
plt.show()
# 🔎 SHAP Explainability
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# 📊 SHAP Summary Plot (Global Feature Importance)
shap.summary_plot(shap_values, X_test)
# 🔍 SHAP Explanation for a Single Prediction (Optional)
sample_index = 0
print("\nExplaining prediction for sample index:", sample_index)
print("Actual Quality:", y_test.iloc[sample_index])
print("Predicted Quality:", round(y_pred[sample_index], 2))
shap.force_plot(
explainer.expected_value,
shap_values[sample_index],
X_test.iloc[sample_index],
matplotlib=True
)
No comments yet. Be the first to comment!