🧠 AI with Python – ⚖️ Handling Imbalanced Data
Posted on: May 5, 2026
Description:
In many real-world machine learning problems, the data is not evenly distributed. One class dominates, while the other — often the more important one — appears rarely.
This is known as class imbalance, and it can silently break your model’s effectiveness.
In this project, we explore how to handle imbalanced data using two practical approaches: SMOTE and class_weight.
Understanding the Problem
In an imbalanced dataset:
- Majority class → appears frequently
- Minority class → appears rarely
For example:
- Fraud detection → very few fraudulent transactions
- Medical diagnosis → rare disease cases
- Churn prediction → fewer churned users
A model trained on such data may achieve high accuracy while completely ignoring the minority class.
Why Imbalance Is a Problem
Consider a dataset with:
- 90% class A
- 10% class B
A model predicting only class A achieves 90% accuracy, but fails to detect class B completely.
This is why metrics like:
- Precision
- Recall
- F1 Score
are far more important than accuracy in such cases.
Baseline Model (No Handling)
We first train a model without handling imbalance.
model = LogisticRegression()
model.fit(X_train, y_train)
This usually results in:
- high accuracy
- poor recall for the minority class
Approach 1: SMOTE
SMOTE (Synthetic Minority Oversampling Technique) generates new synthetic examples for the minority class.
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
Then train the model:
model.fit(X_train_smote, y_train_smote)
This helps the model learn better patterns for the minority class.
Approach 2: class_weight
Instead of modifying the dataset, we modify how the model learns.
model = LogisticRegression(class_weight="balanced")
model.fit(X_train, y_train)
This increases the penalty for misclassifying minority samples.
SMOTE vs class_weight
Both methods aim to improve performance on the minority class, but they work differently:
- SMOTE
- modifies the dataset
- increases minority samples
- may introduce synthetic noise
- class_weight
- keeps the dataset unchanged
- adjusts model learning
- simpler and faster
Why This Matters
Handling imbalance is critical in:
- fraud detection systems
- healthcare models
- anomaly detection
- recommendation systems
Ignoring imbalance can lead to models that appear accurate but fail in real-world usage.
Key Takeaways
- Imbalanced data biases models toward the majority class.
- Accuracy is not a reliable metric for such problems.
- SMOTE generates synthetic minority samples.
- class_weight adjusts model learning without changing data.
- Both methods improve real-world model usefulness.
Conclusion
Class imbalance is one of the most common challenges in machine learning. By using techniques like SMOTE and class_weight, we can build models that better capture rare but critical patterns, making them far more useful in practice.
This is a key topic in the Advanced ML track of the AI with Python series — helping you move from basic models to real-world problem solving.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
# 🧩 Create an Imbalanced Dataset
X, y = make_classification(
n_samples=5000,
n_features=10,
n_informative=6,
n_redundant=2,
weights=[0.90, 0.10], # 90% majority, 10% minority
random_state=42
)
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
y = pd.Series(y, name="target")
# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42,
stratify=y
)
# =========================================================
# 🚨 Baseline Model (No Imbalance Handling)
# =========================================================
baseline_model = LogisticRegression(max_iter=5000)
baseline_model.fit(X_train, y_train)
baseline_pred = baseline_model.predict(X_test)
print("=== Baseline Model ===")
print(classification_report(y_test, baseline_pred))
print(confusion_matrix(y_test, baseline_pred))
# =========================================================
# 🔁 Approach 1 – SMOTE
# =========================================================
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(
X_train,
y_train
)
smote_model = LogisticRegression(max_iter=5000)
smote_model.fit(X_train_smote, y_train_smote)
smote_pred = smote_model.predict(X_test)
print("\n=== SMOTE Model ===")
print(classification_report(y_test, smote_pred))
print(confusion_matrix(y_test, smote_pred))
# =========================================================
# ⚖️ Approach 2 – class_weight
# =========================================================
weighted_model = LogisticRegression(
max_iter=5000,
class_weight="balanced"
)
weighted_model.fit(X_train, y_train)
weighted_pred = weighted_model.predict(X_test)
print("\n=== class_weight Model ===")
print(classification_report(y_test, weighted_pred))
print(confusion_matrix(y_test, weighted_pred))
No comments yet. Be the first to comment!