AW Dev Rethought

🕵️ Debugging is like being the detective in a crime movie where you are also the murderer - Filipe Fortes

🧠 AI with Python – 🔄 Retraining Strategies (Batch vs Online Learning)


Description:

A machine learning model is not a one-time asset that can be trained and forgotten. Once deployed, the world around the model continues to change.

Customer behaviour evolves. Business processes change. Market conditions shift. New data patterns emerge. As a result, model performance gradually degrades unless the model is updated. This is where retraining strategies become important.

In this project, we explore two common approaches used in production ML systems:

  • Batch Retraining
  • Online Learning

Both aim to keep models accurate, but they do so in very different ways.


Why Retraining Is Necessary

Machine learning models learn patterns from historical data.

Over time, those patterns may become outdated because of:

  • concept drift
  • feature drift
  • changing user behaviour
  • new products or services
  • evolving business environments

Without retraining, even highly accurate models can become ineffective.


What Is Batch Retraining?

Batch retraining involves rebuilding a model periodically using a larger, updated dataset.

The workflow usually looks like:

Collect New Data
        ↓
Combine Historical Data
        ↓
Retrain Model
        ↓
Deploy New Version

The old model is replaced by a newly trained model.


Training the Initial Model

We first train a model using available data.

batch_model.fit(
    X_initial,
    y_initial
)

This represents the initial production model.


Periodic Retraining

After collecting new data, we retrain the model.

batch_model.fit(
    X,
    y
)

The model learns from the complete updated dataset.


Advantages of Batch Retraining

Batch retraining provides:

  • stable training process
  • access to full historical context
  • often higher accuracy
  • easier validation and testing

It is widely used in traditional ML pipelines.


Limitations of Batch Retraining

However, batch retraining:

  • requires more compute resources
  • may take longer to execute
  • updates only at scheduled intervals

A model may remain outdated between retraining cycles.


What Is Online Learning?

Online learning updates the model continuously as new data arrives.

Instead of retraining from scratch, the model learns incrementally.

The workflow becomes:

New Data Arrives
        ↓
Update Model
        ↓
Continue Serving

The model evolves continuously.


Initial Online Training

online_model.partial_fit(
    X_initial,
    y_initial,
    classes=[0, 1]
)

The model starts with an initial training phase.


Incremental Updates

As new data arrives:

online_model.partial_fit(
    X_batch,
    y_batch
)

The model learns without rebuilding itself entirely.


Advantages of Online Learning

Online learning offers:

  • continuous adaptation
  • lower retraining costs
  • support for streaming data
  • faster reaction to changing environments

It is especially useful when data changes rapidly.


Limitations of Online Learning

Online learning can also introduce challenges:

  • greater sensitivity to noisy data
  • harder debugging
  • more complex monitoring
  • risk of learning undesirable patterns

Careful monitoring becomes essential.


Batch vs Online Learning

Batch Retraining

Best when:

  • data changes slowly
  • training resources are available
  • model stability is important

Examples:

  • monthly forecasting
  • customer segmentation
  • demand prediction

Online Learning

Best when:

  • data changes rapidly
  • streaming data exists
  • real-time adaptation is required

Examples:

  • recommendation systems
  • fraud detection
  • ad-click prediction
  • personalization systems

How Companies Use Retraining

Many production systems use a hybrid approach.

Example:

  • online updates throughout the day
  • full batch retraining weekly

This combines adaptability with stability.


Why Retraining Matters in MLOps

Retraining is a core part of machine learning operations because it helps:

  • maintain model accuracy
  • combat concept drift
  • respond to changing environments
  • extend model lifespan

Without retraining, model performance inevitably declines.


Key Takeaways

  1. Machine learning models require updates after deployment.
  2. Batch retraining rebuilds models periodically using accumulated data.
  3. Online learning updates models continuously as new data arrives.
  4. Each strategy has different trade-offs in cost, speed, and adaptability.
  5. Retraining is a critical component of production ML systems.

Conclusion

Machine learning models operate in environments that constantly change. Choosing the right retraining strategy is essential for maintaining performance over time. Batch retraining offers stability and comprehensive learning, while online learning provides rapid adaptation to new information. Understanding when to use each approach is a key skill in building reliable production ML systems.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.linear_model import SGDClassifier


# =========================================================
# 🧩 Load Dataset
# =========================================================

data = load_breast_cancer()

X = pd.DataFrame(
    data.data,
    columns=data.feature_names
)

y = data.target


# =========================================================
# ✂️ Split Initial and Future Data
# =========================================================

# Initial training data
# Future data simulates newly arriving records

X_initial, X_future, y_initial, y_future = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42,
    stratify=y
)


# =========================================================
# 🧠 PART 1 – BATCH RETRAINING
# =========================================================

print("=== Batch Retraining ===\n")

# ---------------------------------------------------------
# Initial Model Training
# ---------------------------------------------------------

batch_model = SGDClassifier(
    random_state=42
)

batch_model.fit(
    X_initial,
    y_initial
)

# ---------------------------------------------------------
# Initial Evaluation
# ---------------------------------------------------------

initial_predictions = batch_model.predict(
    X_future
)

initial_accuracy = accuracy_score(
    y_future,
    initial_predictions
)

print(
    "Initial Accuracy:",
    round(initial_accuracy, 4)
)


# ---------------------------------------------------------
# Simulate Batch Retraining
# ---------------------------------------------------------

# New data has arrived.
# Retrain using the full available dataset.

batch_model.fit(
    X,
    y
)

retrained_predictions = batch_model.predict(
    X_future
)

retrained_accuracy = accuracy_score(
    y_future,
    retrained_predictions
)

print(
    "Batch Retrained Accuracy:",
    round(retrained_accuracy, 4)
)


# =========================================================
# 🧠 PART 2 – ONLINE LEARNING
# =========================================================

print("\n=== Online Learning ===\n")

# ---------------------------------------------------------
# Create Online Model
# ---------------------------------------------------------

online_model = SGDClassifier(
    random_state=42
)

# ---------------------------------------------------------
# Initial Training
# ---------------------------------------------------------

online_model.partial_fit(
    X_initial,
    y_initial,
    classes=[0, 1]
)

# ---------------------------------------------------------
# Initial Accuracy
# ---------------------------------------------------------

online_initial_predictions = online_model.predict(
    X_future
)

online_initial_accuracy = accuracy_score(
    y_future,
    online_initial_predictions
)

print(
    "Initial Online Accuracy:",
    round(online_initial_accuracy, 4)
)


# ---------------------------------------------------------
# Simulate Streaming Updates
# ---------------------------------------------------------

batch_size = 20

for start in range(
    0,
    len(X_future),
    batch_size
):

    end = start + batch_size

    X_batch = X_future.iloc[start:end]
    y_batch = y_future.iloc[start:end]

    online_model.partial_fit(
        X_batch,
        y_batch
    )


# ---------------------------------------------------------
# Evaluate Updated Online Model
# ---------------------------------------------------------

online_predictions = online_model.predict(
    X_future
)

online_accuracy = accuracy_score(
    y_future,
    online_predictions
)

print(
    "Online Learning Accuracy:",
    round(online_accuracy, 4)
)


# =========================================================
# 📊 Comparison Summary
# =========================================================

summary = pd.DataFrame({
    "Strategy": [
        "Batch Retraining",
        "Online Learning"
    ],

    "Accuracy": [
        retrained_accuracy,
        online_accuracy
    ]
})

print("\n=== Strategy Comparison ===")
print(summary)


# =========================================================
# 💾 Save Results
# =========================================================

summary.to_csv(
    "retraining_strategy_comparison.csv",
    index=False
)

print(
    "\nResults saved to retraining_strategy_comparison.csv"
)


# =========================================================
# 📂 Load Results Back
# =========================================================

saved_results = pd.read_csv(
    "retraining_strategy_comparison.csv"
)

print("\nSaved Results:\n")
print(saved_results)

Link copied!

Comments

Add Your Comment

Comment Added!