AW Dev Rethought

🕵️ Debugging is like being the detective in a crime movie where you are also the murderer - Filipe Fortes

🧠 AI with Python – 📈 Monitoring Model Performance Over Time


Description:

Training a machine learning model is only the first step in building an ML system. Once a model is deployed, the real challenge begins: ensuring it continues to perform well as data, users, and business conditions evolve.

A model that achieves excellent accuracy today may gradually become less effective over time. This is why model performance monitoring is a critical part of every production machine learning system.

In this project, we explore how to track model performance metrics over time and identify early signs of degradation.


Why Monitoring Is Important

Many ML projects focus heavily on training and evaluation but spend little attention on what happens after deployment.

In reality, deployed models encounter:

  • changing user behaviour
  • new data distributions
  • evolving business rules
  • seasonal trends
  • unexpected anomalies

Without monitoring, performance issues can remain hidden until they start impacting users or business outcomes.


What Is Model Performance Monitoring?

Model performance monitoring is the process of continuously measuring how well a machine learning model performs after deployment.

The goal is to answer questions such as:

  • Is accuracy declining?
  • Are predictions becoming less reliable?
  • Is the model becoming less confident?
  • Has the data changed significantly?
  • Should the model be retrained?

Monitoring provides visibility into the long-term health of an ML system.


Tracking Key Metrics

A production ML system typically tracks several performance metrics. Common examples include:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • Confidence Scores

These metrics provide different perspectives on model quality.


1. Tracking Accuracy Over Time

Accuracy is often the first metric teams monitor.

accuracy = [
    0.96, 0.95, 0.94, 0.93,
    0.92, 0.91, 0.89
]

A gradual decline may indicate:

  • data drift
  • feature drift
  • concept drift

Monitoring trends is often more useful than looking at a single value.


2. Monitoring Precision and Recall

Accuracy alone does not tell the whole story.

A model may maintain accuracy while precision or recall deteriorates.

precision = [...]
recall = [...]

Tracking multiple metrics helps reveal hidden issues.

This is particularly important for:

  • fraud detection
  • healthcare systems
  • recommendation engines

where different types of errors have different consequences.


3. Monitoring F1 Score

F1 Score balances precision and recall.

f1_score = [...]

It provides a more complete picture of model quality when datasets are imbalanced.

A declining F1 score often signals overall model degradation.


4. Monitoring Prediction Confidence

Prediction confidence is another valuable signal.

avg_confidence = [...]

If confidence steadily decreases:

  • the model may be encountering unfamiliar data
  • input distributions may have shifted
  • retraining may be necessary

Confidence trends often reveal problems before accuracy drops significantly.


5. Visualising Trends

Monitoring systems commonly visualise metrics using dashboards.

plt.plot(
    performance_data["date"],
    performance_data["accuracy"]
)

Trend analysis makes it easier to spot:

  • gradual degradation
  • sudden failures
  • unusual spikes
  • seasonal patterns

Visualisation is a core part of production ML observability.


6. Creating Monitoring Alerts

Production systems usually include automated alerts.

if latest_accuracy < 0.90:
    print("Alert")

Real-world implementations may trigger:

  • email notifications
  • Slack alerts
  • PagerDuty incidents
  • monitoring dashboards

This enables teams to respond quickly when performance declines.


Common Causes of Performance Degradation

Data Drift

The distribution of incoming data changes over time.

Feature Drift

Individual features behave differently than during training.

Concept Drift

The relationship between inputs and outputs changes.

Business Changes

New products, users, or processes invalidate previous assumptions.

Monitoring helps identify each of these issues.


Where Performance Monitoring Is Used

Model monitoring is essential in:

  • fraud detection platforms
  • recommendation systems
  • healthcare ML applications
  • financial risk models
  • customer analytics systems

Any production ML system requires ongoing performance tracking.


Key Takeaways

  1. Model performance should be monitored continuously after deployment.
  2. Accuracy alone is not sufficient for production monitoring.
  3. Precision, recall, F1-score, and confidence provide additional insights.
  4. Monitoring helps detect drift and degradation early.
  5. Performance tracking is a fundamental MLOps practice.

Conclusion

Deploying a model is not the end of the machine learning lifecycle. Real-world ML systems require continuous monitoring to ensure they remain accurate, reliable, and aligned with changing business conditions. By tracking performance metrics over time, teams can identify issues early and maintain healthy production systems.

This strengthens the ML Systems track in the AI with Python series — focusing on the operational practices that keep machine learning models performing effectively long after deployment.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt


# =========================================================
# 📝 Create Performance Tracking Data
# =========================================================

performance_data = pd.DataFrame({
    "date": pd.date_range(
        start="2025-01-01",
        periods=10,
        freq="D"
    ),

    "accuracy": [
        0.96, 0.96, 0.95, 0.94, 0.94,
        0.92, 0.91, 0.90, 0.89, 0.87
    ],

    "precision": [
        0.95, 0.95, 0.94, 0.93, 0.92,
        0.91, 0.90, 0.89, 0.87, 0.85
    ],

    "recall": [
        0.94, 0.93, 0.93, 0.92, 0.91,
        0.90, 0.88, 0.87, 0.86, 0.84
    ],

    "f1_score": [
        0.945, 0.94, 0.935, 0.925, 0.915,
        0.905, 0.89, 0.88, 0.865, 0.845
    ],

    "avg_confidence": [
        0.94, 0.94, 0.93, 0.92, 0.91,
        0.90, 0.89, 0.88, 0.86, 0.84
    ]
})


# =========================================================
# 🔍 View Monitoring Data
# =========================================================

print("Model Performance Tracking Data:\n")
print(performance_data)


# =========================================================
# 📈 Plot Accuracy Over Time
# =========================================================

plt.figure(figsize=(8, 4))

plt.plot(
    performance_data["date"],
    performance_data["accuracy"],
    marker="o"
)

plt.title("Model Accuracy Over Time")
plt.xlabel("Date")
plt.ylabel("Accuracy")
plt.grid(True)
plt.tight_layout()
plt.show()


# =========================================================
# 📈 Plot Precision, Recall, and F1-score
# =========================================================

plt.figure(figsize=(8, 4))

plt.plot(
    performance_data["date"],
    performance_data["precision"],
    marker="o",
    label="Precision"
)

plt.plot(
    performance_data["date"],
    performance_data["recall"],
    marker="o",
    label="Recall"
)

plt.plot(
    performance_data["date"],
    performance_data["f1_score"],
    marker="o",
    label="F1 Score"
)

plt.title("Classification Metrics Over Time")
plt.xlabel("Date")
plt.ylabel("Score")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


# =========================================================
# 📉 Plot Average Confidence Over Time
# =========================================================

plt.figure(figsize=(8, 4))

plt.plot(
    performance_data["date"],
    performance_data["avg_confidence"],
    marker="o"
)

plt.title("Average Prediction Confidence Over Time")
plt.xlabel("Date")
plt.ylabel("Average Confidence")
plt.grid(True)
plt.tight_layout()
plt.show()


# =========================================================
# 🚨 Detect Performance Drop
# =========================================================

latest_accuracy = performance_data["accuracy"].iloc[-1]
latest_f1 = performance_data["f1_score"].iloc[-1]

print("\n=== Performance Alerts ===")

if latest_accuracy < 0.90:
    print("⚠️ Alert: Accuracy dropped below threshold")
else:
    print("✅ Accuracy is within healthy range")

if latest_f1 < 0.85:
    print("⚠️ Alert: F1-score dropped below threshold")
else:
    print("✅ F1-score is within healthy range")


# =========================================================
# 📊 Calculate Performance Change
# =========================================================

accuracy_change = (
    performance_data["accuracy"].iloc[-1]
    - performance_data["accuracy"].iloc[0]
)

f1_change = (
    performance_data["f1_score"].iloc[-1]
    - performance_data["f1_score"].iloc[0]
)

confidence_change = (
    performance_data["avg_confidence"].iloc[-1]
    - performance_data["avg_confidence"].iloc[0]
)

print("\n=== Performance Change Summary ===")
print("Accuracy Change:", round(accuracy_change, 4))
print("F1-score Change:", round(f1_change, 4))
print("Confidence Change:", round(confidence_change, 4))


# =========================================================
# 💾 Save Monitoring Data
# =========================================================

performance_data.to_csv(
    "model_performance_tracking.csv",
    index=False
)

print("\nPerformance tracking data saved successfully.")


# =========================================================
# 📂 Read Saved Monitoring Data
# =========================================================

saved_data = pd.read_csv("model_performance_tracking.csv")

print("\nSaved Monitoring Data Preview:\n")
print(saved_data.head())

Link copied!

Comments

Add Your Comment

Comment Added!