🧠 AI with Python – 🧪 A/B Testing ML Models
Posted on: June 18, 2026
Description:
Deploying a new machine learning model directly into production can be risky. Even if the new model performs better during development, there is no guarantee it will perform better in real-world conditions.
This is why production ML systems often rely on A/B testing before fully replacing an existing model.
A/B testing allows teams to compare two model versions using measurable metrics and real-world data, helping them make safer deployment decisions.
In this project, we explore how A/B testing works for machine learning systems and how to compare multiple models objectively.
Why A/B Testing Is Important
Suppose you have a machine learning model currently serving users.
A new model has been developed that appears to perform better during evaluation.
The question becomes:
Should we replace the current model?
Simply comparing training metrics is not enough.
The new model may:
- overfit historical data
- behave differently in production
- introduce unexpected biases
- perform poorly on unseen patterns
A/B testing provides evidence before making deployment decisions.
What Is A/B Testing?
A/B testing compares two versions of a system.
For ML systems:
Model A
Current production model.
Model B
New candidate model.
Both models are evaluated using the same dataset or traffic distribution.
Performance metrics are then compared.
A Simple Workflow
Current Model (A)
↓
Candidate Model (B)
↓
Compare Metrics
↓
Select Winner
The goal is to determine whether the new model delivers meaningful improvement.
1. Train Model A
Model A represents the currently deployed model.
model_a = LogisticRegression()
model_a.fit(
X_train,
y_train
)
This acts as the production baseline.
2. Train Model B
Model B is the proposed replacement.
model_b = RandomForestClassifier()
model_b.fit(
X_train,
y_train
)
The candidate model may use a different algorithm or configuration.
3. Generate Predictions
Both models make predictions on identical data.
pred_a = model_a.predict(X_test)
pred_b = model_b.predict(X_test)
This ensures a fair comparison.
4. Compare Metrics
Multiple evaluation metrics should be considered.
accuracy_score(...)
precision_score(...)
recall_score(...)
f1_score(...)
Each metric reveals different aspects of model quality.
Why Accuracy Alone Is Not Enough
A model can achieve high accuracy while performing poorly on minority classes.
This is particularly important for:
- fraud detection
- medical diagnosis
- anomaly detection
- risk prediction
Metrics such as precision, recall, and F1-score often provide a more complete picture.
5. Visualise Results
Visualising model performance makes comparisons easier.
results.plot(kind="bar")
Charts quickly reveal:
- strengths
- weaknesses
- trade-offs
between competing models.
6. Select the Winner
Once metrics are available, the best-performing model can be selected.
winner = results.loc[
results["f1_score"].idxmax()
]
The winning model becomes a candidate for deployment.
How Real-World A/B Testing Works
Production ML systems often route traffic like:
90% → Current Model
10% → Candidate Model
or
50% → Model A
50% → Model B
Performance is measured on live traffic before a final decision is made.
Benefits of A/B Testing
A/B testing helps:
- reduce deployment risk
- validate improvements
- compare model versions fairly
- make data-driven decisions
- prevent performance regressions
It is one of the most common practices in mature ML organizations.
Where A/B Testing Is Used
A/B testing is common in:
- recommendation systems
- search ranking models
- advertising platforms
- fraud detection systems
- personalisation engines
Any production ML system that evolves over time can benefit from controlled experimentation.
Key Takeaways
- A/B testing compares two model versions objectively.
- Model A typically represents the production model.
- Model B represents the candidate replacement.
- Multiple metrics should be evaluated, not accuracy alone.
- A/B testing reduces risk when deploying new models.
Conclusion
A/B testing is one of the safest ways to introduce improvements into production machine learning systems. Rather than replacing models based solely on offline evaluation, teams can compare model versions using controlled experiments and measurable outcomes. This leads to more reliable deployments and better long-term model performance.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score
)
# =========================================================
# 🧩 Load Dataset
# =========================================================
data = load_breast_cancer()
X = pd.DataFrame(
data.data,
columns=data.feature_names
)
y = data.target
# =========================================================
# ✂️ Split Data
# =========================================================
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.30,
random_state=42,
stratify=y
)
# =========================================================
# 🤖 Model A (Production Model)
# =========================================================
model_a = LogisticRegression(
max_iter=5000
)
model_a.fit(
X_train,
y_train
)
# =========================================================
# 🚀 Model B (Candidate Model)
# =========================================================
model_b = RandomForestClassifier(
n_estimators=200,
random_state=42
)
model_b.fit(
X_train,
y_train
)
# =========================================================
# 📊 Generate Predictions
# =========================================================
pred_a = model_a.predict(X_test)
pred_b = model_b.predict(X_test)
# =========================================================
# 📈 Compare Metrics
# =========================================================
results = pd.DataFrame({
"Model": [
"Model A (Logistic Regression)",
"Model B (Random Forest)"
],
"Accuracy": [
accuracy_score(y_test, pred_a),
accuracy_score(y_test, pred_b)
],
"Precision": [
precision_score(y_test, pred_a),
precision_score(y_test, pred_b)
],
"Recall": [
recall_score(y_test, pred_a),
recall_score(y_test, pred_b)
],
"F1 Score": [
f1_score(y_test, pred_a),
f1_score(y_test, pred_b)
]
})
print("=== A/B Testing Results ===\n")
print(results)
# =========================================================
# 📊 Visualize Results
# =========================================================
plot_df = results.set_index("Model")
plot_df.plot(
kind="bar",
figsize=(10, 5)
)
plt.title("A/B Testing ML Models")
plt.ylabel("Score")
plt.ylim(0, 1.05)
plt.xticks(rotation=0)
plt.grid(
axis="y",
linestyle="--",
alpha=0.5
)
plt.tight_layout()
plt.show()
# =========================================================
# 🏆 Select Winning Model
# =========================================================
winner = results.loc[
results["F1 Score"].idxmax(),
"Model"
]
winner_score = results["F1 Score"].max()
print("\n=== Winner ===")
print(f"Winning Model: {winner}")
print(f"Best F1 Score: {winner_score:.4f}")
# =========================================================
# 📊 Performance Difference
# =========================================================
f1_difference = (
results["F1 Score"].iloc[1]
- results["F1 Score"].iloc[0]
)
print(
f"\nF1 Score Difference: {f1_difference:.4f}"
)
# =========================================================
# 💾 Save Results
# =========================================================
results.to_csv(
"ab_testing_results.csv",
index=False
)
print(
"\nResults saved to ab_testing_results.csv"
)
# =========================================================
# 📂 Load Results Back
# =========================================================
saved_results = pd.read_csv(
"ab_testing_results.csv"
)
print("\nSaved Results Preview:\n")
print(saved_results)
No comments yet. Be the first to comment!