🧠 AI with Python – ⚖️ Handling Imbalanced Data

Posted on: May 5, 2026

Description:

In many real-world machine learning problems, the data is not evenly distributed. One class dominates, while the other — often the more important one — appears rarely.

This is known as class imbalance, and it can silently break your model’s effectiveness.

In this project, we explore how to handle imbalanced data using two practical approaches: SMOTE and class_weight.

Understanding the Problem

In an imbalanced dataset:

Majority class → appears frequently
Minority class → appears rarely

For example:

Fraud detection → very few fraudulent transactions
Medical diagnosis → rare disease cases
Churn prediction → fewer churned users

A model trained on such data may achieve high accuracy while completely ignoring the minority class.

Why Imbalance Is a Problem

Consider a dataset with:

90% class A
10% class B

A model predicting only class A achieves 90% accuracy, but fails to detect class B completely.

This is why metrics like:

Precision
Recall
F1 Score

are far more important than accuracy in such cases.

Baseline Model (No Handling)

We first train a model without handling imbalance.

model = LogisticRegression()
model.fit(X_train, y_train)

This usually results in:

high accuracy
poor recall for the minority class

Approach 1: SMOTE

SMOTE (Synthetic Minority Oversampling Technique) generates new synthetic examples for the minority class.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

Then train the model:

model.fit(X_train_smote, y_train_smote)

This helps the model learn better patterns for the minority class.

Approach 2: class_weight

Instead of modifying the dataset, we modify how the model learns.

model = LogisticRegression(class_weight="balanced")
model.fit(X_train, y_train)

This increases the penalty for misclassifying minority samples.

SMOTE vs class_weight

Both methods aim to improve performance on the minority class, but they work differently:

SMOTE
- modifies the dataset
- increases minority samples
- may introduce synthetic noise
class_weight
- keeps the dataset unchanged
- adjusts model learning
- simpler and faster

Why This Matters

Handling imbalance is critical in:

fraud detection systems
healthcare models
anomaly detection
recommendation systems

Ignoring imbalance can lead to models that appear accurate but fail in real-world usage.

Key Takeaways

Imbalanced data biases models toward the majority class.
Accuracy is not a reliable metric for such problems.
SMOTE generates synthetic minority samples.
class_weight adjusts model learning without changing data.
Both methods improve real-world model usefulness.

Conclusion

Class imbalance is one of the most common challenges in machine learning. By using techniques like SMOTE and class_weight, we can build models that better capture rare but critical patterns, making them far more useful in practice.

This is a key topic in the Advanced ML track of the AI with Python series — helping you move from basic models to real-world problem solving.

Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

from imblearn.over_sampling import SMOTE


# 🧩 Create an Imbalanced Dataset
X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.90, 0.10],   # 90% majority, 10% minority
    random_state=42
)

X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
y = pd.Series(y, name="target")


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# 🚨 Baseline Model (No Imbalance Handling)
# =========================================================

baseline_model = LogisticRegression(max_iter=5000)
baseline_model.fit(X_train, y_train)

baseline_pred = baseline_model.predict(X_test)

print("=== Baseline Model ===")
print(classification_report(y_test, baseline_pred))
print(confusion_matrix(y_test, baseline_pred))


# =========================================================
# 🔁 Approach 1 – SMOTE
# =========================================================

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(
    X_train,
    y_train
)

smote_model = LogisticRegression(max_iter=5000)
smote_model.fit(X_train_smote, y_train_smote)

smote_pred = smote_model.predict(X_test)

print("\n=== SMOTE Model ===")
print(classification_report(y_test, smote_pred))
print(confusion_matrix(y_test, smote_pred))


# =========================================================
# ⚖️ Approach 2 – class_weight
# =========================================================

weighted_model = LogisticRegression(
    max_iter=5000,
    class_weight="balanced"
)

weighted_model.fit(X_train, y_train)

weighted_pred = weighted_model.predict(X_test)

print("\n=== class_weight Model ===")
print(classification_report(y_test, weighted_pred))
print(confusion_matrix(y_test, weighted_pred))

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – ⚖️ Handling Imbalanced Data

Description:

Understanding the Problem

Why Imbalance Is a Problem

Baseline Model (No Handling)

Approach 1: SMOTE

Approach 2: class_weight

SMOTE vs class_weight

Why This Matters

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – ⚖️ Handling Imbalanced Data

Description:

Understanding the Problem

Why Imbalance Is a Problem

Baseline Model (No Handling)

Approach 1: SMOTE

Approach 2: class_weight

SMOTE vs class_weight

Why This Matters

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – 🎯 Threshold Tuning

🧠 AI with Python – ⚔️ LightGBM vs RandomForest

🧠 AI with Python – 🚀 XGBoost for Tabular Data

Comments