🧠 AI with Python – 🔄 Retraining Strategies (Batch vs Online Learning)
Posted on: June 16, 2026
Description:
A machine learning model is not a one-time asset that can be trained and forgotten. Once deployed, the world around the model continues to change.
Customer behaviour evolves. Business processes change. Market conditions shift. New data patterns emerge. As a result, model performance gradually degrades unless the model is updated. This is where retraining strategies become important.
In this project, we explore two common approaches used in production ML systems:
- Batch Retraining
- Online Learning
Both aim to keep models accurate, but they do so in very different ways.
Why Retraining Is Necessary
Machine learning models learn patterns from historical data.
Over time, those patterns may become outdated because of:
- concept drift
- feature drift
- changing user behaviour
- new products or services
- evolving business environments
Without retraining, even highly accurate models can become ineffective.
What Is Batch Retraining?
Batch retraining involves rebuilding a model periodically using a larger, updated dataset.
The workflow usually looks like:
Collect New Data
↓
Combine Historical Data
↓
Retrain Model
↓
Deploy New Version
The old model is replaced by a newly trained model.
Training the Initial Model
We first train a model using available data.
batch_model.fit(
X_initial,
y_initial
)
This represents the initial production model.
Periodic Retraining
After collecting new data, we retrain the model.
batch_model.fit(
X,
y
)
The model learns from the complete updated dataset.
Advantages of Batch Retraining
Batch retraining provides:
- stable training process
- access to full historical context
- often higher accuracy
- easier validation and testing
It is widely used in traditional ML pipelines.
Limitations of Batch Retraining
However, batch retraining:
- requires more compute resources
- may take longer to execute
- updates only at scheduled intervals
A model may remain outdated between retraining cycles.
What Is Online Learning?
Online learning updates the model continuously as new data arrives.
Instead of retraining from scratch, the model learns incrementally.
The workflow becomes:
New Data Arrives
↓
Update Model
↓
Continue Serving
The model evolves continuously.
Initial Online Training
online_model.partial_fit(
X_initial,
y_initial,
classes=[0, 1]
)
The model starts with an initial training phase.
Incremental Updates
As new data arrives:
online_model.partial_fit(
X_batch,
y_batch
)
The model learns without rebuilding itself entirely.
Advantages of Online Learning
Online learning offers:
- continuous adaptation
- lower retraining costs
- support for streaming data
- faster reaction to changing environments
It is especially useful when data changes rapidly.
Limitations of Online Learning
Online learning can also introduce challenges:
- greater sensitivity to noisy data
- harder debugging
- more complex monitoring
- risk of learning undesirable patterns
Careful monitoring becomes essential.
Batch vs Online Learning
Batch Retraining
Best when:
- data changes slowly
- training resources are available
- model stability is important
Examples:
- monthly forecasting
- customer segmentation
- demand prediction
Online Learning
Best when:
- data changes rapidly
- streaming data exists
- real-time adaptation is required
Examples:
- recommendation systems
- fraud detection
- ad-click prediction
- personalization systems
How Companies Use Retraining
Many production systems use a hybrid approach.
Example:
- online updates throughout the day
- full batch retraining weekly
This combines adaptability with stability.
Why Retraining Matters in MLOps
Retraining is a core part of machine learning operations because it helps:
- maintain model accuracy
- combat concept drift
- respond to changing environments
- extend model lifespan
Without retraining, model performance inevitably declines.
Key Takeaways
- Machine learning models require updates after deployment.
- Batch retraining rebuilds models periodically using accumulated data.
- Online learning updates models continuously as new data arrives.
- Each strategy has different trade-offs in cost, speed, and adaptability.
- Retraining is a critical component of production ML systems.
Conclusion
Machine learning models operate in environments that constantly change. Choosing the right retraining strategy is essential for maintaining performance over time. Batch retraining offers stability and comprehensive learning, while online learning provides rapid adaptation to new information. Understanding when to use each approach is a key skill in building reliable production ML systems.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
# =========================================================
# 🧩 Load Dataset
# =========================================================
data = load_breast_cancer()
X = pd.DataFrame(
data.data,
columns=data.feature_names
)
y = data.target
# =========================================================
# ✂️ Split Initial and Future Data
# =========================================================
# Initial training data
# Future data simulates newly arriving records
X_initial, X_future, y_initial, y_future = train_test_split(
X,
y,
test_size=0.30,
random_state=42,
stratify=y
)
# =========================================================
# 🧠 PART 1 – BATCH RETRAINING
# =========================================================
print("=== Batch Retraining ===\n")
# ---------------------------------------------------------
# Initial Model Training
# ---------------------------------------------------------
batch_model = SGDClassifier(
random_state=42
)
batch_model.fit(
X_initial,
y_initial
)
# ---------------------------------------------------------
# Initial Evaluation
# ---------------------------------------------------------
initial_predictions = batch_model.predict(
X_future
)
initial_accuracy = accuracy_score(
y_future,
initial_predictions
)
print(
"Initial Accuracy:",
round(initial_accuracy, 4)
)
# ---------------------------------------------------------
# Simulate Batch Retraining
# ---------------------------------------------------------
# New data has arrived.
# Retrain using the full available dataset.
batch_model.fit(
X,
y
)
retrained_predictions = batch_model.predict(
X_future
)
retrained_accuracy = accuracy_score(
y_future,
retrained_predictions
)
print(
"Batch Retrained Accuracy:",
round(retrained_accuracy, 4)
)
# =========================================================
# 🧠 PART 2 – ONLINE LEARNING
# =========================================================
print("\n=== Online Learning ===\n")
# ---------------------------------------------------------
# Create Online Model
# ---------------------------------------------------------
online_model = SGDClassifier(
random_state=42
)
# ---------------------------------------------------------
# Initial Training
# ---------------------------------------------------------
online_model.partial_fit(
X_initial,
y_initial,
classes=[0, 1]
)
# ---------------------------------------------------------
# Initial Accuracy
# ---------------------------------------------------------
online_initial_predictions = online_model.predict(
X_future
)
online_initial_accuracy = accuracy_score(
y_future,
online_initial_predictions
)
print(
"Initial Online Accuracy:",
round(online_initial_accuracy, 4)
)
# ---------------------------------------------------------
# Simulate Streaming Updates
# ---------------------------------------------------------
batch_size = 20
for start in range(
0,
len(X_future),
batch_size
):
end = start + batch_size
X_batch = X_future.iloc[start:end]
y_batch = y_future.iloc[start:end]
online_model.partial_fit(
X_batch,
y_batch
)
# ---------------------------------------------------------
# Evaluate Updated Online Model
# ---------------------------------------------------------
online_predictions = online_model.predict(
X_future
)
online_accuracy = accuracy_score(
y_future,
online_predictions
)
print(
"Online Learning Accuracy:",
round(online_accuracy, 4)
)
# =========================================================
# 📊 Comparison Summary
# =========================================================
summary = pd.DataFrame({
"Strategy": [
"Batch Retraining",
"Online Learning"
],
"Accuracy": [
retrained_accuracy,
online_accuracy
]
})
print("\n=== Strategy Comparison ===")
print(summary)
# =========================================================
# 💾 Save Results
# =========================================================
summary.to_csv(
"retraining_strategy_comparison.csv",
index=False
)
print(
"\nResults saved to retraining_strategy_comparison.csv"
)
# =========================================================
# 📂 Load Results Back
# =========================================================
saved_results = pd.read_csv(
"retraining_strategy_comparison.csv"
)
print("\nSaved Results:\n")
print(saved_results)
No comments yet. Be the first to comment!