AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

🧠 AI with Python – ⚖️ Handling Imbalanced Data


Description:

In many real-world machine learning problems, the data is not evenly distributed. One class dominates, while the other — often the more important one — appears rarely.

This is known as class imbalance, and it can silently break your model’s effectiveness.

In this project, we explore how to handle imbalanced data using two practical approaches: SMOTE and class_weight.


Understanding the Problem

In an imbalanced dataset:

  • Majority class → appears frequently
  • Minority class → appears rarely

For example:

  • Fraud detection → very few fraudulent transactions
  • Medical diagnosis → rare disease cases
  • Churn prediction → fewer churned users

A model trained on such data may achieve high accuracy while completely ignoring the minority class.


Why Imbalance Is a Problem

Consider a dataset with:

  • 90% class A
  • 10% class B

A model predicting only class A achieves 90% accuracy, but fails to detect class B completely.

This is why metrics like:

  • Precision
  • Recall
  • F1 Score

are far more important than accuracy in such cases.


Baseline Model (No Handling)

We first train a model without handling imbalance.

model = LogisticRegression()
model.fit(X_train, y_train)

This usually results in:

  • high accuracy
  • poor recall for the minority class

Approach 1: SMOTE

SMOTE (Synthetic Minority Oversampling Technique) generates new synthetic examples for the minority class.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

Then train the model:

model.fit(X_train_smote, y_train_smote)

This helps the model learn better patterns for the minority class.


Approach 2: class_weight

Instead of modifying the dataset, we modify how the model learns.

model = LogisticRegression(class_weight="balanced")
model.fit(X_train, y_train)

This increases the penalty for misclassifying minority samples.


SMOTE vs class_weight

Both methods aim to improve performance on the minority class, but they work differently:

  • SMOTE
    • modifies the dataset
    • increases minority samples
    • may introduce synthetic noise
  • class_weight
    • keeps the dataset unchanged
    • adjusts model learning
    • simpler and faster

Why This Matters

Handling imbalance is critical in:

  • fraud detection systems
  • healthcare models
  • anomaly detection
  • recommendation systems

Ignoring imbalance can lead to models that appear accurate but fail in real-world usage.


Key Takeaways

  1. Imbalanced data biases models toward the majority class.
  2. Accuracy is not a reliable metric for such problems.
  3. SMOTE generates synthetic minority samples.
  4. class_weight adjusts model learning without changing data.
  5. Both methods improve real-world model usefulness.

Conclusion

Class imbalance is one of the most common challenges in machine learning. By using techniques like SMOTE and class_weight, we can build models that better capture rare but critical patterns, making them far more useful in practice.

This is a key topic in the Advanced ML track of the AI with Python series — helping you move from basic models to real-world problem solving.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

from imblearn.over_sampling import SMOTE


# 🧩 Create an Imbalanced Dataset
X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.90, 0.10],   # 90% majority, 10% minority
    random_state=42
)

X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
y = pd.Series(y, name="target")


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# 🚨 Baseline Model (No Imbalance Handling)
# =========================================================

baseline_model = LogisticRegression(max_iter=5000)
baseline_model.fit(X_train, y_train)

baseline_pred = baseline_model.predict(X_test)

print("=== Baseline Model ===")
print(classification_report(y_test, baseline_pred))
print(confusion_matrix(y_test, baseline_pred))


# =========================================================
# 🔁 Approach 1 – SMOTE
# =========================================================

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(
    X_train,
    y_train
)

smote_model = LogisticRegression(max_iter=5000)
smote_model.fit(X_train_smote, y_train_smote)

smote_pred = smote_model.predict(X_test)

print("\n=== SMOTE Model ===")
print(classification_report(y_test, smote_pred))
print(confusion_matrix(y_test, smote_pred))


# =========================================================
# ⚖️ Approach 2 – class_weight
# =========================================================

weighted_model = LogisticRegression(
    max_iter=5000,
    class_weight="balanced"
)

weighted_model.fit(X_train, y_train)

weighted_pred = weighted_model.predict(X_test)

print("\n=== class_weight Model ===")
print(classification_report(y_test, weighted_pred))
print(confusion_matrix(y_test, weighted_pred))

Link copied!

Comments

Add Your Comment

Comment Added!