⚡️ Saturday ML Sparks – ROC Curve & AUC Comparison 📈🧠
Posted on: November 22, 2025
Description:
Understanding the Problem
Most classification models output a probability score.
But choosing a fixed threshold like 0.5 may not always be ideal — especially in:
- imbalanced datasets
- medical predictions
- fraud detection
- risk assessment
ROC curves show the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) at every possible threshold.
AUC gives a single number summarizing performance:
- AUC = 1.0 → perfect classifier
- AUC = 0.5 → random guessing
- Higher AUC = better ability to separate classes
1. Load Dataset & Split into Train/Test Sets
We’ll use the Breast Cancer dataset — a classic binary classification dataset.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
2. Train Two Models for Comparison
We’ll compare Logistic Regression vs Random Forest.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
log_reg = LogisticRegression(max_iter=5000)
rf = RandomForestClassifier(n_estimators=300, random_state=42)
log_reg.fit(X_train, y_train)
rf.fit(X_train, y_train)
3. Compute ROC Curves & AUC Scores
from sklearn.metrics import roc_curve, roc_auc_score
# Predicted probabilities for the positive class
log_proba = log_reg.predict_proba(X_test)[:, 1]
rf_proba = rf.predict_proba(X_test)[:, 1]
log_fpr, log_tpr, _ = roc_curve(y_test, log_proba)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)
log_auc = roc_auc_score(y_test, log_proba)
rf_auc = roc_auc_score(y_test, rf_proba)
print("Logistic Regression AUC:", log_auc)
print("Random Forest AUC:", rf_auc)
3. Compute ROC Curves & AUC Scores
from sklearn.metrics import roc_curve, roc_auc_score
# Predicted probabilities for the positive class
log_proba = log_reg.predict_proba(X_test)[:, 1]
rf_proba = rf.predict_proba(X_test)[:, 1]
log_fpr, log_tpr, _ = roc_curve(y_test, log_proba)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)
log_auc = roc_auc_score(y_test, log_proba)
rf_auc = roc_auc_score(y_test, rf_proba)
print("Logistic Regression AUC:", log_auc)
print("Random Forest AUC:", rf_auc)
4. Plot ROC Curves
import matplotlib.pyplot as plt
plt.figure(figsize=(7,6))
plt.plot(log_fpr, log_tpr, label=f"Logistic Regression (AUC = {log_auc:.3f})")
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.3f})")
plt.plot([0,1], [0,1], "k--", label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()
This visualization immediately shows which model separates classes better.
Key Takeaways
- ROC curves show performance across all thresholds, not just at 0.5.
- AUC is a powerful single-number metric that summarizes separation quality.
- Higher AUC = better classifier, regardless of threshold tuning.
- Logistic Regression often performs well in linear problems, while Random Forests capture non-linear patterns, often boosting AUC.
- ROC curves are essential for medical, fraud, risk, and imbalanced classification tasks.
Conclusion
ROC and AUC are fundamental to evaluating classification models beyond accuracy and precision/recall.
They help visualize threshold behavior, compare classifiers, and understand model robustness.
By comparing Logistic Regression and Random Forest, we see how different models behave across probability thresholds — a critical insight for real-world deployments.
Code Snippet:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
# Load the Breast Cancer dataset (binary classification)
data = load_breast_cancer()
X, y = data.data, data.target
# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
# Initialize models
log_reg = LogisticRegression(max_iter=5000)
rf = RandomForestClassifier(n_estimators=300, random_state=42)
# Train
log_reg.fit(X_train, y_train)
rf.fit(X_train, y_train)
# Get predicted probabilities for the positive class
log_proba = log_reg.predict_proba(X_test)[:, 1]
rf_proba = rf.predict_proba(X_test)[:, 1]
# ROC curve points
log_fpr, log_tpr, _ = roc_curve(y_test, log_proba)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)
# AUC scores
log_auc = roc_auc_score(y_test, log_proba)
rf_auc = roc_auc_score(y_test, rf_proba)
print("Logistic Regression AUC:", log_auc)
print("Random Forest AUC:", rf_auc)
plt.figure(figsize=(7,6))
plt.plot(log_fpr, log_tpr, label=f"Logistic Regression (AUC = {log_auc:.3f})")
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.3f})")
# Random baseline
plt.plot([0,1], [0,1], "k--", label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()
No comments yet. Be the first to comment!