⚡️ Saturday ML Sparks – ROC Curve & AUC Comparison 📈🧠


Description:

Understanding the Problem

Most classification models output a probability score.

But choosing a fixed threshold like 0.5 may not always be ideal — especially in:

  • imbalanced datasets
  • medical predictions
  • fraud detection
  • risk assessment

ROC curves show the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) at every possible threshold.

AUC gives a single number summarizing performance:

  • AUC = 1.0 → perfect classifier
  • AUC = 0.5 → random guessing
  • Higher AUC = better ability to separate classes

1. Load Dataset & Split into Train/Test Sets

We’ll use the Breast Cancer dataset — a classic binary classification dataset.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

2. Train Two Models for Comparison

We’ll compare Logistic Regression vs Random Forest.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

log_reg = LogisticRegression(max_iter=5000)
rf = RandomForestClassifier(n_estimators=300, random_state=42)

log_reg.fit(X_train, y_train)
rf.fit(X_train, y_train)

3. Compute ROC Curves & AUC Scores

from sklearn.metrics import roc_curve, roc_auc_score

# Predicted probabilities for the positive class
log_proba = log_reg.predict_proba(X_test)[:, 1]
rf_proba = rf.predict_proba(X_test)[:, 1]

log_fpr, log_tpr, _ = roc_curve(y_test, log_proba)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)

log_auc = roc_auc_score(y_test, log_proba)
rf_auc = roc_auc_score(y_test, rf_proba)

print("Logistic Regression AUC:", log_auc)
print("Random Forest AUC:", rf_auc)

3. Compute ROC Curves & AUC Scores

from sklearn.metrics import roc_curve, roc_auc_score

# Predicted probabilities for the positive class
log_proba = log_reg.predict_proba(X_test)[:, 1]
rf_proba = rf.predict_proba(X_test)[:, 1]

log_fpr, log_tpr, _ = roc_curve(y_test, log_proba)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)

log_auc = roc_auc_score(y_test, log_proba)
rf_auc = roc_auc_score(y_test, rf_proba)

print("Logistic Regression AUC:", log_auc)
print("Random Forest AUC:", rf_auc)

4. Plot ROC Curves

import matplotlib.pyplot as plt

plt.figure(figsize=(7,6))

plt.plot(log_fpr, log_tpr, label=f"Logistic Regression (AUC = {log_auc:.3f})")
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.3f})")

plt.plot([0,1], [0,1], "k--", label="Random Guess")

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()

This visualization immediately shows which model separates classes better.


Key Takeaways

  1. ROC curves show performance across all thresholds, not just at 0.5.
  2. AUC is a powerful single-number metric that summarizes separation quality.
  3. Higher AUC = better classifier, regardless of threshold tuning.
  4. Logistic Regression often performs well in linear problems, while Random Forests capture non-linear patterns, often boosting AUC.
  5. ROC curves are essential for medical, fraud, risk, and imbalanced classification tasks.

Conclusion

ROC and AUC are fundamental to evaluating classification models beyond accuracy and precision/recall.

They help visualize threshold behavior, compare classifiers, and understand model robustness.

By comparing Logistic Regression and Random Forest, we see how different models behave across probability thresholds — a critical insight for real-world deployments.


Code Snippet:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score


# Load the Breast Cancer dataset (binary classification)
data = load_breast_cancer()
X, y = data.data, data.target

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)


# Initialize models
log_reg = LogisticRegression(max_iter=5000)
rf = RandomForestClassifier(n_estimators=300, random_state=42)

# Train
log_reg.fit(X_train, y_train)
rf.fit(X_train, y_train)


# Get predicted probabilities for the positive class
log_proba = log_reg.predict_proba(X_test)[:, 1]
rf_proba = rf.predict_proba(X_test)[:, 1]

# ROC curve points
log_fpr, log_tpr, _ = roc_curve(y_test, log_proba)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)

# AUC scores
log_auc = roc_auc_score(y_test, log_proba)
rf_auc = roc_auc_score(y_test, rf_proba)

print("Logistic Regression AUC:", log_auc)
print("Random Forest AUC:", rf_auc)


plt.figure(figsize=(7,6))

plt.plot(log_fpr, log_tpr, label=f"Logistic Regression (AUC = {log_auc:.3f})")
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.3f})")

# Random baseline
plt.plot([0,1], [0,1], "k--", label="Random Guess")

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()

Link copied!

Comments

Add Your Comment

Comment Added!