🧠 AI with Python – 🧪 A/B Testing ML Models

Posted on: June 18, 2026

Description:

Deploying a new machine learning model directly into production can be risky. Even if the new model performs better during development, there is no guarantee it will perform better in real-world conditions.

This is why production ML systems often rely on A/B testing before fully replacing an existing model.

A/B testing allows teams to compare two model versions using measurable metrics and real-world data, helping them make safer deployment decisions.

In this project, we explore how A/B testing works for machine learning systems and how to compare multiple models objectively.

Why A/B Testing Is Important

Suppose you have a machine learning model currently serving users.

A new model has been developed that appears to perform better during evaluation.

The question becomes:

Should we replace the current model?

Simply comparing training metrics is not enough.

The new model may:

overfit historical data
behave differently in production
introduce unexpected biases
perform poorly on unseen patterns

A/B testing provides evidence before making deployment decisions.

What Is A/B Testing?

A/B testing compares two versions of a system.

For ML systems:

Model A

Current production model.

Model B

New candidate model.

Both models are evaluated using the same dataset or traffic distribution.

Performance metrics are then compared.

A Simple Workflow

Current Model (A)
        ↓
 Candidate Model (B)
        ↓
 Compare Metrics
        ↓
 Select Winner

The goal is to determine whether the new model delivers meaningful improvement.

1. Train Model A

Model A represents the currently deployed model.

model_a = LogisticRegression()

model_a.fit(
    X_train,
    y_train
)

This acts as the production baseline.

2. Train Model B

Model B is the proposed replacement.

model_b = RandomForestClassifier()

model_b.fit(
    X_train,
    y_train
)

The candidate model may use a different algorithm or configuration.

3. Generate Predictions

Both models make predictions on identical data.

pred_a = model_a.predict(X_test)

pred_b = model_b.predict(X_test)

This ensures a fair comparison.

4. Compare Metrics

Multiple evaluation metrics should be considered.

accuracy_score(...)
precision_score(...)
recall_score(...)
f1_score(...)

Each metric reveals different aspects of model quality.

Why Accuracy Alone Is Not Enough

A model can achieve high accuracy while performing poorly on minority classes.

This is particularly important for:

fraud detection
medical diagnosis
anomaly detection
risk prediction

Metrics such as precision, recall, and F1-score often provide a more complete picture.

5. Visualise Results

Visualising model performance makes comparisons easier.

results.plot(kind="bar")

Charts quickly reveal:

strengths
weaknesses
trade-offs

between competing models.

6. Select the Winner

Once metrics are available, the best-performing model can be selected.

winner = results.loc[
    results["f1_score"].idxmax()
]

The winning model becomes a candidate for deployment.

How Real-World A/B Testing Works

Production ML systems often route traffic like:

90% → Current Model

10% → Candidate Model

50% → Model A

50% → Model B

Performance is measured on live traffic before a final decision is made.

Benefits of A/B Testing

A/B testing helps:

reduce deployment risk
validate improvements
compare model versions fairly
make data-driven decisions
prevent performance regressions

It is one of the most common practices in mature ML organizations.

Where A/B Testing Is Used

A/B testing is common in:

recommendation systems
search ranking models
advertising platforms
fraud detection systems
personalisation engines

Any production ML system that evolves over time can benefit from controlled experimentation.

Key Takeaways

A/B testing compares two model versions objectively.
Model A typically represents the production model.
Model B represents the candidate replacement.
Multiple metrics should be evaluated, not accuracy alone.
A/B testing reduces risk when deploying new models.

Conclusion

A/B testing is one of the safest ways to introduce improvements into production machine learning systems. Rather than replacing models based solely on offline evaluation, teams can compare model versions using controlled experiments and measurable outcomes. This leads to more reliable deployments and better long-term model performance.

Code Snippet:

# 📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)


# =========================================================
# 🧩 Load Dataset
# =========================================================

data = load_breast_cancer()

X = pd.DataFrame(
    data.data,
    columns=data.feature_names
)

y = data.target


# =========================================================
# ✂️ Split Data
# =========================================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42,
    stratify=y
)


# =========================================================
# 🤖 Model A (Production Model)
# =========================================================

model_a = LogisticRegression(
    max_iter=5000
)

model_a.fit(
    X_train,
    y_train
)


# =========================================================
# 🚀 Model B (Candidate Model)
# =========================================================

model_b = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

model_b.fit(
    X_train,
    y_train
)


# =========================================================
# 📊 Generate Predictions
# =========================================================

pred_a = model_a.predict(X_test)

pred_b = model_b.predict(X_test)


# =========================================================
# 📈 Compare Metrics
# =========================================================

results = pd.DataFrame({
    "Model": [
        "Model A (Logistic Regression)",
        "Model B (Random Forest)"
    ],

    "Accuracy": [
        accuracy_score(y_test, pred_a),
        accuracy_score(y_test, pred_b)
    ],

    "Precision": [
        precision_score(y_test, pred_a),
        precision_score(y_test, pred_b)
    ],

    "Recall": [
        recall_score(y_test, pred_a),
        recall_score(y_test, pred_b)
    ],

    "F1 Score": [
        f1_score(y_test, pred_a),
        f1_score(y_test, pred_b)
    ]
})

print("=== A/B Testing Results ===\n")
print(results)


# =========================================================
# 📊 Visualize Results
# =========================================================

plot_df = results.set_index("Model")

plot_df.plot(
    kind="bar",
    figsize=(10, 5)
)

plt.title("A/B Testing ML Models")
plt.ylabel("Score")
plt.ylim(0, 1.05)

plt.xticks(rotation=0)

plt.grid(
    axis="y",
    linestyle="--",
    alpha=0.5
)

plt.tight_layout()
plt.show()


# =========================================================
# 🏆 Select Winning Model
# =========================================================

winner = results.loc[
    results["F1 Score"].idxmax(),
    "Model"
]

winner_score = results["F1 Score"].max()

print("\n=== Winner ===")
print(f"Winning Model: {winner}")
print(f"Best F1 Score: {winner_score:.4f}")


# =========================================================
# 📊 Performance Difference
# =========================================================

f1_difference = (
    results["F1 Score"].iloc[1]
    - results["F1 Score"].iloc[0]
)

print(
    f"\nF1 Score Difference: {f1_difference:.4f}"
)


# =========================================================
# 💾 Save Results
# =========================================================

results.to_csv(
    "ab_testing_results.csv",
    index=False
)

print(
    "\nResults saved to ab_testing_results.csv"
)


# =========================================================
# 📂 Load Results Back
# =========================================================

saved_results = pd.read_csv(
    "ab_testing_results.csv"
)

print("\nSaved Results Preview:\n")
print(saved_results)

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – 🧪 A/B Testing ML Models

Description:

Why A/B Testing Is Important

What Is A/B Testing?

Model A

Model B

A Simple Workflow

1. Train Model A

2. Train Model B

3. Generate Predictions

4. Compare Metrics

Why Accuracy Alone Is Not Enough

5. Visualise Results

6. Select the Winner

How Real-World A/B Testing Works

Benefits of A/B Testing

Where A/B Testing Is Used

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – 🧪 A/B Testing ML Models

Description:

Why A/B Testing Is Important

What Is A/B Testing?

Model A

Model B

A Simple Workflow

1. Train Model A

2. Train Model B

3. Generate Predictions

4. Compare Metrics

Why Accuracy Alone Is Not Enough

5. Visualise Results

6. Select the Winner

How Real-World A/B Testing Works

Benefits of A/B Testing

Where A/B Testing Is Used

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – 🔄 Retraining Strategies (Batch vs Online Learning)

🧠 AI with Python – 📈 Monitoring Model Performance Over Time

🧠 AI with Python – 📉 Detecting Concept Drift

Comments