AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

⚡️ Saturday ML Spark – ⚔️ LightGBM vs RandomForest


Description:

When working with tabular data, two of the most commonly used ensemble models are RandomForest and LightGBM. Both are powerful, widely used, and capable of delivering strong performance — but they work very differently under the hood.

In this project, we compare these two models to understand their behavior, performance, and when to use each.


Understanding the Problem

Choosing the right model is not just about accuracy. It also involves:

  • training speed
  • scalability
  • robustness
  • ability to capture complex patterns

RandomForest and LightGBM approach these challenges using different strategies.


RandomForest – Bagging Approach

RandomForest is based on bagging (Bootstrap Aggregation).

  • It builds multiple decision trees independently
  • Each tree sees a random subset of data
  • Final prediction is an average (or majority vote)
rf_model = RandomForestClassifier(n_estimators=200)
rf_model.fit(X_train, y_train)

This approach reduces variance and is highly stable.


LightGBM – Boosting Approach

LightGBM is based on gradient boosting.

  • Trees are built sequentially
  • Each new tree corrects errors of the previous one
  • Focus is on improving difficult predictions
lgbm_model = LGBMClassifier(n_estimators=200)
lgbm_model.fit(X_train, y_train)

This approach often leads to higher accuracy.


Performance Comparison

We evaluate both models on the same dataset.

rf_pred = rf_model.predict(X_test)
lgbm_pred = lgbm_model.predict(X_test)

Typical observations:

  • LightGBM → faster training, often better accuracy
  • RandomForest → more stable, less sensitive to tuning

Key Differences

🔹 RandomForest

  • independent trees
  • robust and less prone to overfitting
  • easy to use
  • slower on large datasets

🔹 LightGBM

  • sequential learning
  • faster and more efficient
  • better performance on large data
  • requires careful tuning

When to Use What

  • Use RandomForest when:
    • you need a reliable baseline
    • interpretability matters
    • dataset is small to medium
  • Use LightGBM when:
    • performance is critical
    • dataset is large
    • you want faster training

Why This Comparison Matters

In real-world ML systems:

  • model choice impacts latency and cost
  • training time affects iteration speed
  • performance affects business outcomes

Understanding these trade-offs helps make better decisions.


Key Takeaways

  1. RandomForest uses bagging; LightGBM uses boosting.
  2. LightGBM is typically faster and more efficient.
  3. RandomForest is simpler and more stable.
  4. Both models perform well on tabular data.
  5. Model selection depends on data size and requirements.

Conclusion

RandomForest and LightGBM are both essential tools for machine learning practitioners. While RandomForest offers simplicity and stability, LightGBM provides speed and high performance. Choosing between them depends on the specific needs of your problem.

This comparison strengthens your understanding in Saturday ML Spark ⚡️ – Advanced & Practical, helping you move from using models to selecting the right one.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd
import time

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier


# 🧩 Load Dataset
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# 🌲 Train RandomForest Model
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=6,
    random_state=42
)

start_rf = time.time()
rf_model.fit(X_train, y_train)
end_rf = time.time()


# 🚀 Train LightGBM Model
lgbm_model = LGBMClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    random_state=42
)

start_lgbm = time.time()
lgbm_model.fit(X_train, y_train)
end_lgbm = time.time()


# 📊 Evaluate Both Models
rf_pred = rf_model.predict(X_test)
lgbm_pred = lgbm_model.predict(X_test)

print("=== RandomForest ===")
print("Accuracy:", accuracy_score(y_test, rf_pred))
print("Training Time:", round(end_rf - start_rf, 4), "seconds")
print(classification_report(y_test, rf_pred))

print("\n=== LightGBM ===")
print("Accuracy:", accuracy_score(y_test, lgbm_pred))
print("Training Time:", round(end_lgbm - start_lgbm, 4), "seconds")
print(classification_report(y_test, lgbm_pred))

Link copied!

Comments

Add Your Comment

Comment Added!