AW Dev Rethought

🕵️ Debugging is like being the detective in a crime movie where you are also the murderer - Filipe Fortes

⚡️ Saturday ML Spark – 🔄 Train vs Inference Pipeline Consistency


Description:

Many machine learning projects achieve excellent results during development but perform poorly after deployment. Surprisingly, the model itself is often not the problem. A common reason is pipeline inconsistency — the preprocessing applied during training differs from what happens during inference.

In this project, we explore why train and inference consistency matters and how pipelines help prevent one of the most common production ML mistakes.


Understanding the Problem

Machine learning models learn patterns from the data they receive during training.

If the training data is transformed in one way and production data is transformed differently, the model begins receiving inputs it was never trained to understand.

Even a highly accurate model can produce unreliable predictions under these conditions.


What Is Pipeline Consistency?

Pipeline consistency means: The exact same preprocessing steps used during training must also be applied during inference.

This includes:

  • scaling
  • encoding
  • feature engineering
  • missing value handling
  • feature selection

The model should see data in the same format during both phases.


The Common Mistake

Consider a workflow where training data is scaled.

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

model.fit(X_train_scaled, y_train)

The model learns patterns from scaled values.

However, during inference:

model.predict(X_test)

raw values are passed directly to the model.

This creates a mismatch between training and prediction environments.


Why This Causes Problems

The model was trained using:

Scaled feature values

But receives:

Raw feature values

during inference.

As a result:

  • prediction quality drops
  • model behaviour becomes unstable
  • production performance differs from testing results

This is one of the most common causes of ML deployment failures.


The Correct Solution: Pipelines

Scikit-learn Pipelines solve this problem elegantly.

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

The pipeline stores both:

  • preprocessing logic
  • model logic

as a single reusable workflow.


Training the Pipeline

Instead of manually scaling data:


The pipeline automatically learns:

  • scaling parameters
  • model parameters

together.


Inference Using the Same Pipeline

Predictions become simple.


The pipeline automatically applies the same preprocessing before generating predictions. This guarantees consistency.


Why Pipelines Are Important in Production

Pipelines help:

  • prevent preprocessing mistakes
  • simplify deployment
  • improve reproducibility
  • reduce maintenance effort
  • ensure consistent model behaviour

They are considered a best practice in modern ML systems.


Real-World Examples

Pipeline consistency is critical in:

  • fraud detection systems
  • recommendation engines
  • healthcare ML
  • financial risk models
  • customer analytics platforms

Any mismatch between training and inference can lead to incorrect business decisions.


Key Takeaways

  1. Training and inference must use identical preprocessing steps.
  2. Inconsistent transformations can significantly degrade model performance.
  3. Scaling, encoding, and feature engineering should be part of the pipeline.
  4. Pipelines bundle preprocessing and modeling into a single workflow.
  5. Pipeline consistency is a fundamental production ML practice.

Conclusion

Many machine learning failures occur not because the model is poor, but because the data presented during inference differs from the data used during training. Maintaining train and inference pipeline consistency ensures reliable predictions and smoother deployments.

This strengthens the ML Systems track in Saturday ML Spark ⚡️, focusing on the engineering practices that make machine learning systems reliable in production.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# 🧩 Load Dataset
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# ❌ WRONG APPROACH – INCONSISTENT PIPELINE
# =========================================================

print("=== Wrong Approach ===")

# Training data is scaled
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

wrong_model = LogisticRegression(max_iter=5000)

wrong_model.fit(X_train_scaled, y_train)

# Test data is NOT scaled
wrong_predictions = wrong_model.predict(X_test)

print(
    "Accuracy with inconsistent preprocessing:",
    accuracy_score(y_test, wrong_predictions)
)


# =========================================================
# ✅ CORRECT APPROACH – PIPELINE CONSISTENCY
# =========================================================

print("\n=== Correct Approach ===")

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=5000))
])

pipeline.fit(X_train, y_train)

predictions = pipeline.predict(X_test)

print(
    "Pipeline Accuracy:",
    accuracy_score(y_test, predictions)
)


# =========================================================
# 🚀 INFERENCE ON NEW DATA
# =========================================================

new_samples = X_test.iloc[:5]

new_predictions = pipeline.predict(new_samples)

print("\nNew Sample Predictions:")
print(new_predictions)


# =========================================================
# 🔍 VIEW PIPELINE COMPONENTS
# =========================================================

print("\nPipeline Steps:")
for step_name, step_obj in pipeline.named_steps.items():
    print(f"{step_name}: {step_obj}")


# =========================================================
# 💾 SAVE PIPELINE (OPTIONAL)
# =========================================================

import joblib

joblib.dump(
    pipeline,
    "train_inference_pipeline.pkl"
)

print("\nPipeline saved successfully.")


# =========================================================
# 📂 LOAD PIPELINE (OPTIONAL)
# =========================================================

loaded_pipeline = joblib.load(
    "train_inference_pipeline.pkl"
)

loaded_predictions = loaded_pipeline.predict(
    X_test.iloc[:3]
)

print("\nPredictions from Loaded Pipeline:")
print(loaded_predictions)

Link copied!

Comments

Add Your Comment

Comment Added!