AW Dev Rethought

🕵️ Debugging is like being the detective in a crime movie where you are also the murderer - Filipe Fortes

🧠 AI with Python – 🧪 A/B Testing ML Models


Description:

Deploying a new machine learning model directly into production can be risky. Even if the new model performs better during development, there is no guarantee it will perform better in real-world conditions.

This is why production ML systems often rely on A/B testing before fully replacing an existing model.

A/B testing allows teams to compare two model versions using measurable metrics and real-world data, helping them make safer deployment decisions.

In this project, we explore how A/B testing works for machine learning systems and how to compare multiple models objectively.


Why A/B Testing Is Important

Suppose you have a machine learning model currently serving users.

A new model has been developed that appears to perform better during evaluation.

The question becomes:

Should we replace the current model?

Simply comparing training metrics is not enough.

The new model may:

  • overfit historical data
  • behave differently in production
  • introduce unexpected biases
  • perform poorly on unseen patterns

A/B testing provides evidence before making deployment decisions.


What Is A/B Testing?

A/B testing compares two versions of a system.

For ML systems:

Model A

Current production model.

Model B

New candidate model.

Both models are evaluated using the same dataset or traffic distribution.

Performance metrics are then compared.


A Simple Workflow

Current Model (A)
        ↓
 Candidate Model (B)
        ↓
 Compare Metrics
        ↓
 Select Winner

The goal is to determine whether the new model delivers meaningful improvement.


1. Train Model A

Model A represents the currently deployed model.

model_a = LogisticRegression()

model_a.fit(
    X_train,
    y_train
)

This acts as the production baseline.


2. Train Model B

Model B is the proposed replacement.

model_b = RandomForestClassifier()

model_b.fit(
    X_train,
    y_train
)

The candidate model may use a different algorithm or configuration.


3. Generate Predictions

Both models make predictions on identical data.

pred_a = model_a.predict(X_test)

pred_b = model_b.predict(X_test)

This ensures a fair comparison.


4. Compare Metrics

Multiple evaluation metrics should be considered.

accuracy_score(...)
precision_score(...)
recall_score(...)
f1_score(...)

Each metric reveals different aspects of model quality.


Why Accuracy Alone Is Not Enough

A model can achieve high accuracy while performing poorly on minority classes.

This is particularly important for:

  • fraud detection
  • medical diagnosis
  • anomaly detection
  • risk prediction

Metrics such as precision, recall, and F1-score often provide a more complete picture.


5. Visualise Results

Visualising model performance makes comparisons easier.

results.plot(kind="bar")

Charts quickly reveal:

  • strengths
  • weaknesses
  • trade-offs

between competing models.


6. Select the Winner

Once metrics are available, the best-performing model can be selected.

winner = results.loc[
    results["f1_score"].idxmax()
]

The winning model becomes a candidate for deployment.


How Real-World A/B Testing Works

Production ML systems often route traffic like:

90% → Current Model

10% → Candidate Model

or

50% → Model A

50% → Model B

Performance is measured on live traffic before a final decision is made.


Benefits of A/B Testing

A/B testing helps:

  • reduce deployment risk
  • validate improvements
  • compare model versions fairly
  • make data-driven decisions
  • prevent performance regressions

It is one of the most common practices in mature ML organizations.


Where A/B Testing Is Used

A/B testing is common in:

  • recommendation systems
  • search ranking models
  • advertising platforms
  • fraud detection systems
  • personalisation engines

Any production ML system that evolves over time can benefit from controlled experimentation.


Key Takeaways

  1. A/B testing compares two model versions objectively.
  2. Model A typically represents the production model.
  3. Model B represents the candidate replacement.
  4. Multiple metrics should be evaluated, not accuracy alone.
  5. A/B testing reduces risk when deploying new models.

Conclusion

A/B testing is one of the safest ways to introduce improvements into production machine learning systems. Rather than replacing models based solely on offline evaluation, teams can compare model versions using controlled experiments and measurable outcomes. This leads to more reliable deployments and better long-term model performance.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)


# =========================================================
# 🧩 Load Dataset
# =========================================================

data = load_breast_cancer()

X = pd.DataFrame(
    data.data,
    columns=data.feature_names
)

y = data.target


# =========================================================
# ✂️ Split Data
# =========================================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42,
    stratify=y
)


# =========================================================
# 🤖 Model A (Production Model)
# =========================================================

model_a = LogisticRegression(
    max_iter=5000
)

model_a.fit(
    X_train,
    y_train
)


# =========================================================
# 🚀 Model B (Candidate Model)
# =========================================================

model_b = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

model_b.fit(
    X_train,
    y_train
)


# =========================================================
# 📊 Generate Predictions
# =========================================================

pred_a = model_a.predict(X_test)

pred_b = model_b.predict(X_test)


# =========================================================
# 📈 Compare Metrics
# =========================================================

results = pd.DataFrame({
    "Model": [
        "Model A (Logistic Regression)",
        "Model B (Random Forest)"
    ],

    "Accuracy": [
        accuracy_score(y_test, pred_a),
        accuracy_score(y_test, pred_b)
    ],

    "Precision": [
        precision_score(y_test, pred_a),
        precision_score(y_test, pred_b)
    ],

    "Recall": [
        recall_score(y_test, pred_a),
        recall_score(y_test, pred_b)
    ],

    "F1 Score": [
        f1_score(y_test, pred_a),
        f1_score(y_test, pred_b)
    ]
})

print("=== A/B Testing Results ===\n")
print(results)


# =========================================================
# 📊 Visualize Results
# =========================================================

plot_df = results.set_index("Model")

plot_df.plot(
    kind="bar",
    figsize=(10, 5)
)

plt.title("A/B Testing ML Models")
plt.ylabel("Score")
plt.ylim(0, 1.05)

plt.xticks(rotation=0)

plt.grid(
    axis="y",
    linestyle="--",
    alpha=0.5
)

plt.tight_layout()
plt.show()


# =========================================================
# 🏆 Select Winning Model
# =========================================================

winner = results.loc[
    results["F1 Score"].idxmax(),
    "Model"
]

winner_score = results["F1 Score"].max()

print("\n=== Winner ===")
print(f"Winning Model: {winner}")
print(f"Best F1 Score: {winner_score:.4f}")


# =========================================================
# 📊 Performance Difference
# =========================================================

f1_difference = (
    results["F1 Score"].iloc[1]
    - results["F1 Score"].iloc[0]
)

print(
    f"\nF1 Score Difference: {f1_difference:.4f}"
)


# =========================================================
# 💾 Save Results
# =========================================================

results.to_csv(
    "ab_testing_results.csv",
    index=False
)

print(
    "\nResults saved to ab_testing_results.csv"
)


# =========================================================
# 📂 Load Results Back
# =========================================================

saved_results = pd.read_csv(
    "ab_testing_results.csv"
)

print("\nSaved Results Preview:\n")
print(saved_results)

Link copied!

Comments

Add Your Comment

Comment Added!