⚡️ Saturday ML Spark – 🧪 A/B Testing for ML Models

Posted on: June 20, 2026

Description:

Building a better machine learning model is only part of the challenge. The real question is: How do we know the new model is actually better in production?

A model that performs well during development may not necessarily perform better when exposed to real users, real traffic, and real-world data. This is why machine learning teams use A/B testing before replacing production models.

In this project, we explore how A/B testing helps compare machine learning models objectively and reduce deployment risk.

Why A/B Testing Is Necessary

Imagine you currently have a model running in production. The model is generating predictions and supporting business decisions every day. Your team develops a new model that appears to have:

higher accuracy
better recall
stronger F1-score

during offline evaluation. The obvious question becomes: Should we replace the current model immediately?

The answer is usually no. Production systems require evidence before major changes are introduced.

What Is A/B Testing?

A/B testing is a controlled experiment where two versions of a system are compared.

For machine learning:

Model A

Current production model.

Model B

New candidate model.

Both models receive the same data or similar traffic, allowing teams to compare performance fairly.

The Basic Workflow

Current Production Model (A)
            ↓
      Candidate Model (B)
            ↓
      Compare Metrics
            ↓
      Select Winner

The goal is to determine whether the candidate model delivers meaningful improvement.

Training the Baseline Model

The first model represents the current production system.

model_a = LogisticRegression()

model_a.fit(
    X_train,
    y_train
)

This establishes the baseline performance.

Training the Candidate Model

The second model represents the proposed replacement.

model_b = RandomForestClassifier()

model_b.fit(
    X_train,
    y_train
)

The candidate may use:

a different algorithm
new features
improved training data
updated hyperparameters

Generating Predictions

Both models are evaluated using the same test dataset.

pred_a = model_a.predict(X_test)

pred_b = model_b.predict(X_test)

Using identical data ensures the comparison remains fair.

Comparing Performance Metrics

Multiple metrics should be evaluated.

accuracy_score(...)
precision_score(...)
recall_score(...)
f1_score(...)

Each metric highlights different aspects of model performance.

Why Accuracy Alone Is Not Enough

A model can achieve excellent accuracy while still making costly mistakes.

Consider:

fraud detection
medical diagnosis
credit risk assessment

In these systems, precision and recall may matter more than overall accuracy. This is why A/B testing typically evaluates multiple metrics simultaneously.

Visualising Results

Performance comparisons become easier when visualised.

results.plot(kind="bar")

Visualisation helps identify:

metric improvements
trade-offs
regression risks

before deployment decisions are made.

Choosing the Winning Model

A simple approach is selecting the model with the highest F1-score.

winner = results.loc[
    results["F1 Score"].idxmax()
]

However, real-world systems may also consider:

latency
infrastructure cost
explainability
business KPIs

before declaring a winner.

How A/B Testing Works in Production

Offline testing is only the first step.

Production systems often route traffic such as:

90% → Current Model

10% → Candidate Model

50% → Model A

50% → Model B

Performance is then monitored using real user interactions. This provides stronger evidence than offline evaluation alone.

Benefits of A/B Testing

A/B testing helps teams:

reduce deployment risk
validate improvements
compare models fairly
avoid regressions
make data-driven decisions

It is one of the most widely used deployment practices in mature ML organizations.

Where A/B Testing Is Used

A/B testing is common in:

recommendation systems
search ranking models
advertising platforms
fraud detection systems
personalization engines

Any system that evolves continuously can benefit from controlled experimentation.

Key Takeaways

A/B testing compares production and candidate models objectively.
Multiple metrics should be evaluated, not accuracy alone.
Controlled experiments reduce deployment risk.
F1-score often provides a balanced comparison metric.
A/B testing is a core practice in production ML systems.

Conclusion

Deploying a new machine learning model should never be based solely on offline evaluation results. A/B testing provides a structured way to compare model versions, validate improvements, and minimize risk before full deployment. By relying on measurable outcomes rather than assumptions, teams can confidently evolve their ML systems over time.

This strengthens the ML Systems track in Saturday ML Spark ⚡️, focusing on practical deployment and experimentation strategies used in real-world machine learning systems.

Code Snippet:

# Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)


# =========================================================
# Load Dataset
# =========================================================

data = load_breast_cancer()

X = pd.DataFrame(
    data.data,
    columns=data.feature_names
)

y = data.target


# =========================================================
# Split Dataset
# =========================================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42,
    stratify=y
)


# =========================================================
# Model A – Current Production Model
# =========================================================

model_a = LogisticRegression(
    max_iter=5000
)

model_a.fit(
    X_train,
    y_train
)


# =========================================================
# Model B – Candidate Model
# =========================================================

model_b = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

model_b.fit(
    X_train,
    y_train
)


# =========================================================
# Generate Predictions
# =========================================================

pred_a = model_a.predict(X_test)

pred_b = model_b.predict(X_test)


# =========================================================
# Compare Model Metrics
# =========================================================

results = pd.DataFrame({
    "Model": [
        "Model A (Logistic Regression)",
        "Model B (Random Forest)"
    ],

    "Accuracy": [
        accuracy_score(y_test, pred_a),
        accuracy_score(y_test, pred_b)
    ],

    "Precision": [
        precision_score(y_test, pred_a),
        precision_score(y_test, pred_b)
    ],

    "Recall": [
        recall_score(y_test, pred_a),
        recall_score(y_test, pred_b)
    ],

    "F1 Score": [
        f1_score(y_test, pred_a),
        f1_score(y_test, pred_b)
    ]
})

print("=== A/B Testing Results ===\n")
print(results)


# =========================================================
# Visualize Metric Comparison
# =========================================================

plot_df = results.set_index("Model")

plot_df.plot(
    kind="bar",
    figsize=(10, 5)
)

plt.title("A/B Testing ML Models")
plt.ylabel("Score")
plt.ylim(0, 1.05)

plt.grid(
    axis="y",
    linestyle="--",
    alpha=0.5
)

plt.xticks(rotation=0)

plt.tight_layout()
plt.show()


# =========================================================
# Select Winning Model
# =========================================================

winner = results.loc[
    results["F1 Score"].idxmax(),
    "Model"
]

winner_score = results["F1 Score"].max()

print("\n=== Winner ===")
print("Winning Model:", winner)
print("Winning F1 Score:", round(winner_score, 4))


# =========================================================
# Compare F1 Score Difference
# =========================================================

f1_difference = (
    results["F1 Score"].iloc[1]
    - results["F1 Score"].iloc[0]
)

print(
    "\nF1 Score Difference:",
    round(f1_difference, 4)
)


# =========================================================
# Confusion Matrices
# =========================================================

cm_a = confusion_matrix(
    y_test,
    pred_a
)

cm_b = confusion_matrix(
    y_test,
    pred_b
)

print("\nConfusion Matrix - Model A")
print(cm_a)

print("\nConfusion Matrix - Model B")
print(cm_b)


# =========================================================
# Save Results
# =========================================================

results.to_csv(
    "ab_testing_results.csv",
    index=False
)

print(
    "\nResults saved to ab_testing_results.csv"
)


# =========================================================
# 📂 Load Saved Results
# =========================================================

saved_results = pd.read_csv(
    "ab_testing_results.csv"
)

print("\nSaved Results Preview:\n")
print(saved_results)


# =========================================================
# Simple Deployment Decision
# =========================================================

if (
    results.loc[1, "F1 Score"]
    >
    results.loc[0, "F1 Score"]
):
    print(
        "\nDeploy Model B"
    )
else:
    print(
        "\nKeep Model A in Production"
    )

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

⚡️ Saturday ML Spark – 🧪 A/B Testing for ML Models

Description:

Why A/B Testing Is Necessary

What Is A/B Testing?

Model A

Model B

The Basic Workflow

Training the Baseline Model

Training the Candidate Model

Generating Predictions

Comparing Performance Metrics

Why Accuracy Alone Is Not Enough

Visualising Results

Choosing the Winning Model

How A/B Testing Works in Production

Benefits of A/B Testing

Where A/B Testing Is Used

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

⚡️ Saturday ML Spark – 🧪 A/B Testing for ML Models

Description:

Why A/B Testing Is Necessary

What Is A/B Testing?

Model A

Model B

The Basic Workflow

Training the Baseline Model

Training the Candidate Model

Generating Predictions

Comparing Performance Metrics

Why Accuracy Alone Is Not Enough

Visualising Results

Choosing the Winning Model

How A/B Testing Works in Production

Benefits of A/B Testing

Where A/B Testing Is Used

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

⚡️ Saturday ML Spark – 📈 Model Monitoring & Performance Tracking

⚡️ Saturday ML Spark – 🔄 Train vs Inference Pipeline Consistency

⚡️ Saturday ML Spark – 🏷️ Encoding High Cardinality Categories

Comments