AW Dev Rethought

🌟 The best way to predict the future is to invent it - Alan Kay

⚡️ Saturday ML Spark – 💾 Save & Load Models with joblib


Description:

Training a machine learning model can take time and computational resources. In real-world systems, we don’t retrain a model every time we need predictions — instead, we save the trained model and reuse it.

In this project, we explore how to persist trained models using joblib, a lightweight and efficient tool for serializing Python objects.


Understanding the Problem

When you train a model:

  • The model learns parameters
  • Those parameters exist only in memory
  • Once the program ends, they are lost

To use a model in production — APIs, dashboards, batch systems — we need a way to store and reload it.

That’s where model persistence comes in.


Why joblib?

While Python’s built-in pickle can serialize objects, joblib is optimized for:

  • Large NumPy arrays
  • scikit-learn models
  • Faster serialization
  • Efficient disk storage

It’s widely used in production ML workflows.


1. Train a Machine Learning Model

We begin by training a model as usual.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

model.fit(X_train, y_train)

At this point, the model exists only in memory.


2. Save the Trained Model

We serialize the model into a file.

import joblib

joblib.dump(model, "random_forest_model.pkl")

This creates a .pkl file containing:

  • Model parameters
  • Learned weights
  • Configuration settings

3. Load the Saved Model

Later — even in a different script — we can reload it.

loaded_model = joblib.load("random_forest_model.pkl")

No retraining required.


4. Use the Loaded Model for Predictions

preds = loaded_model.predict(X_test)

The predictions will match those from the original trained model.


Why Model Persistence Matters

Saving models enables:

  • Deployment in APIs (FastAPI, Flask, etc.)
  • Sharing models across teams
  • Reproducible ML workflows
  • Faster inference pipelines

Model persistence is the bridge between experimentation and real-world systems.


Key Takeaways

  1. joblib efficiently saves scikit-learn models.
  2. Saved models can be reused without retraining.
  3. .pkl files store model state and parameters.
  4. Critical for deployment and production systems.
  5. A foundational ML engineering skill.

Conclusion

Saving and loading models with joblib is a simple yet essential technique in practical machine learning. It ensures that trained models can be reused, deployed, and shared efficiently — making it a core component of production-ready ML systems.

This completes another topic in Saturday ML Spark ⚡️ – Advanced & Practical.


Code Snippet:

import joblib
import pandas as pd

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


model = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)

model.fit(X_train, y_train)


joblib.dump(model, "random_forest_model.pkl")


loaded_model = joblib.load("random_forest_model.pkl")

preds = loaded_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, preds))

Link copied!

Comments

Add Your Comment

Comment Added!