⚡️ Saturday ML Spark – 🧪 A/B Testing for ML Models
Posted on: June 20, 2026
Description:
Building a better machine learning model is only part of the challenge. The real question is: How do we know the new model is actually better in production?
A model that performs well during development may not necessarily perform better when exposed to real users, real traffic, and real-world data. This is why machine learning teams use A/B testing before replacing production models.
In this project, we explore how A/B testing helps compare machine learning models objectively and reduce deployment risk.
Why A/B Testing Is Necessary
Imagine you currently have a model running in production. The model is generating predictions and supporting business decisions every day. Your team develops a new model that appears to have:
- higher accuracy
- better recall
- stronger F1-score
during offline evaluation. The obvious question becomes: Should we replace the current model immediately?
The answer is usually no. Production systems require evidence before major changes are introduced.
What Is A/B Testing?
A/B testing is a controlled experiment where two versions of a system are compared.
For machine learning:
Model A
Current production model.
Model B
New candidate model.
Both models receive the same data or similar traffic, allowing teams to compare performance fairly.
The Basic Workflow
Current Production Model (A)
↓
Candidate Model (B)
↓
Compare Metrics
↓
Select Winner
The goal is to determine whether the candidate model delivers meaningful improvement.
Training the Baseline Model
The first model represents the current production system.
model_a = LogisticRegression()
model_a.fit(
X_train,
y_train
)
This establishes the baseline performance.
Training the Candidate Model
The second model represents the proposed replacement.
model_b = RandomForestClassifier()
model_b.fit(
X_train,
y_train
)
The candidate may use:
- a different algorithm
- new features
- improved training data
- updated hyperparameters
Generating Predictions
Both models are evaluated using the same test dataset.
pred_a = model_a.predict(X_test)
pred_b = model_b.predict(X_test)
Using identical data ensures the comparison remains fair.
Comparing Performance Metrics
Multiple metrics should be evaluated.
accuracy_score(...)
precision_score(...)
recall_score(...)
f1_score(...)
Each metric highlights different aspects of model performance.
Why Accuracy Alone Is Not Enough
A model can achieve excellent accuracy while still making costly mistakes.
Consider:
- fraud detection
- medical diagnosis
- credit risk assessment
In these systems, precision and recall may matter more than overall accuracy. This is why A/B testing typically evaluates multiple metrics simultaneously.
Visualising Results
Performance comparisons become easier when visualised.
results.plot(kind="bar")
Visualisation helps identify:
- metric improvements
- trade-offs
- regression risks
before deployment decisions are made.
Choosing the Winning Model
A simple approach is selecting the model with the highest F1-score.
winner = results.loc[
results["F1 Score"].idxmax()
]
However, real-world systems may also consider:
- latency
- infrastructure cost
- explainability
- business KPIs
before declaring a winner.
How A/B Testing Works in Production
Offline testing is only the first step.
Production systems often route traffic such as:
90% → Current Model
10% → Candidate Model
or
50% → Model A
50% → Model B
Performance is then monitored using real user interactions. This provides stronger evidence than offline evaluation alone.
Benefits of A/B Testing
A/B testing helps teams:
- reduce deployment risk
- validate improvements
- compare models fairly
- avoid regressions
- make data-driven decisions
It is one of the most widely used deployment practices in mature ML organizations.
Where A/B Testing Is Used
A/B testing is common in:
- recommendation systems
- search ranking models
- advertising platforms
- fraud detection systems
- personalization engines
Any system that evolves continuously can benefit from controlled experimentation.
Key Takeaways
- A/B testing compares production and candidate models objectively.
- Multiple metrics should be evaluated, not accuracy alone.
- Controlled experiments reduce deployment risk.
- F1-score often provides a balanced comparison metric.
- A/B testing is a core practice in production ML systems.
Conclusion
Deploying a new machine learning model should never be based solely on offline evaluation results. A/B testing provides a structured way to compare model versions, validate improvements, and minimize risk before full deployment. By relying on measurable outcomes rather than assumptions, teams can confidently evolve their ML systems over time.
This strengthens the ML Systems track in Saturday ML Spark ⚡️, focusing on practical deployment and experimentation strategies used in real-world machine learning systems.
Code Snippet:
# Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix
)
# =========================================================
# Load Dataset
# =========================================================
data = load_breast_cancer()
X = pd.DataFrame(
data.data,
columns=data.feature_names
)
y = data.target
# =========================================================
# Split Dataset
# =========================================================
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.30,
random_state=42,
stratify=y
)
# =========================================================
# Model A – Current Production Model
# =========================================================
model_a = LogisticRegression(
max_iter=5000
)
model_a.fit(
X_train,
y_train
)
# =========================================================
# Model B – Candidate Model
# =========================================================
model_b = RandomForestClassifier(
n_estimators=200,
random_state=42
)
model_b.fit(
X_train,
y_train
)
# =========================================================
# Generate Predictions
# =========================================================
pred_a = model_a.predict(X_test)
pred_b = model_b.predict(X_test)
# =========================================================
# Compare Model Metrics
# =========================================================
results = pd.DataFrame({
"Model": [
"Model A (Logistic Regression)",
"Model B (Random Forest)"
],
"Accuracy": [
accuracy_score(y_test, pred_a),
accuracy_score(y_test, pred_b)
],
"Precision": [
precision_score(y_test, pred_a),
precision_score(y_test, pred_b)
],
"Recall": [
recall_score(y_test, pred_a),
recall_score(y_test, pred_b)
],
"F1 Score": [
f1_score(y_test, pred_a),
f1_score(y_test, pred_b)
]
})
print("=== A/B Testing Results ===\n")
print(results)
# =========================================================
# Visualize Metric Comparison
# =========================================================
plot_df = results.set_index("Model")
plot_df.plot(
kind="bar",
figsize=(10, 5)
)
plt.title("A/B Testing ML Models")
plt.ylabel("Score")
plt.ylim(0, 1.05)
plt.grid(
axis="y",
linestyle="--",
alpha=0.5
)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
# =========================================================
# Select Winning Model
# =========================================================
winner = results.loc[
results["F1 Score"].idxmax(),
"Model"
]
winner_score = results["F1 Score"].max()
print("\n=== Winner ===")
print("Winning Model:", winner)
print("Winning F1 Score:", round(winner_score, 4))
# =========================================================
# Compare F1 Score Difference
# =========================================================
f1_difference = (
results["F1 Score"].iloc[1]
- results["F1 Score"].iloc[0]
)
print(
"\nF1 Score Difference:",
round(f1_difference, 4)
)
# =========================================================
# Confusion Matrices
# =========================================================
cm_a = confusion_matrix(
y_test,
pred_a
)
cm_b = confusion_matrix(
y_test,
pred_b
)
print("\nConfusion Matrix - Model A")
print(cm_a)
print("\nConfusion Matrix - Model B")
print(cm_b)
# =========================================================
# Save Results
# =========================================================
results.to_csv(
"ab_testing_results.csv",
index=False
)
print(
"\nResults saved to ab_testing_results.csv"
)
# =========================================================
# 📂 Load Saved Results
# =========================================================
saved_results = pd.read_csv(
"ab_testing_results.csv"
)
print("\nSaved Results Preview:\n")
print(saved_results)
# =========================================================
# Simple Deployment Decision
# =========================================================
if (
results.loc[1, "F1 Score"]
>
results.loc[0, "F1 Score"]
):
print(
"\nDeploy Model B"
)
else:
print(
"\nKeep Model A in Production"
)
No comments yet. Be the first to comment!