AW Dev Rethought

Code is read far more often than it is written - Guido van Rossum

⚡️ Saturday ML Spark – 🚨 Anomaly Detection with Isolation Forest


Description:

In many real-world systems, the most important data points are often the rarest ones.

Fraudulent transactions, system failures, unusual user behaviour, and data corruption all fall under the category of anomalies.

In this project, we use Isolation Forest, an unsupervised machine learning algorithm designed specifically to detect such rare and unusual patterns in data.


Understanding the Problem

Anomaly detection differs from traditional supervised learning:

  • there are no labels telling us what is normal or abnormal
  • anomalies are rare and diverse
  • defining explicit rules is often impractical

This makes anomaly detection a natural fit for unsupervised learning, where the model learns patterns directly from the data.


Why Isolation Forest?

Isolation Forest is based on a simple but powerful idea:

Anomalies are easier to isolate than normal points.

Instead of modeling normal behaviour explicitly, Isolation Forest:

  • randomly splits data
  • measures how quickly a point gets isolated
  • flags points that isolate early as anomalies

This makes it efficient and scalable for real-world datasets.


1. Creating a Dataset with Anomalies

To demonstrate the concept, we start with synthetic data that includes both normal points and anomalies.

from sklearn.datasets import make_blobs
import numpy as np
import pandas as pd

X_normal, _ = make_blobs(
    n_samples=300,
    centers=1,
    cluster_std=0.6,
    random_state=42
)

rng = np.random.RandomState(42)
X_anomaly = rng.uniform(low=-6, high=6, size=(20, 2))

X = np.vstack([X_normal, X_anomaly])
df = pd.DataFrame(X, columns=["feature_1", "feature_2"])

This setup mimics real-world scenarios where anomalies are few and scattered.


2. Training the Isolation Forest Model

We train the model without providing any labels.

from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.06,
    random_state=42
)

iso_forest.fit(df)

The contamination parameter represents the expected proportion of anomalies in the data.


3. Detecting Anomalies

Once trained, the model classifies each point as normal or anomalous.

df["anomaly"] = iso_forest.predict(df)
  • 1 → normal data point
  • 1 → anomaly

4. Visualizing Detected Anomalies

Visualization helps validate and interpret anomaly detection results.

import matplotlib.pyplot as plt

normal = df[df["anomaly"] == 1]
anomaly = df[df["anomaly"] == -1]

plt.scatter(normal["feature_1"], normal["feature_2"], alpha=0.6)
plt.scatter(anomaly["feature_1"], anomaly["feature_2"], color="red")
plt.title("Anomaly Detection using Isolation Forest")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

The anomalies appear clearly separated from the main cluster.


5. Understanding Anomaly Scores

Isolation Forest also provides a continuous anomaly score.

df["anomaly_score"] = iso_forest.decision_function(df)

Lower scores indicate stronger anomalies, allowing further ranking and investigation.


Key Takeaways

  1. Anomaly detection focuses on identifying rare and unusual patterns.
  2. Isolation Forest is an unsupervised algorithm — no labels required.
  3. Anomalies are detected by how quickly they are isolated.
  4. Visualization plays a key role in validating results.
  5. Common use cases include fraud detection, system monitoring, and security analytics.

Conclusion

Isolation Forest offers a simple yet powerful approach to anomaly detection in real-world datasets.

By focusing on isolation rather than distance or density, it scales efficiently and performs well even when anomalies are scarce.

This project demonstrates how unsupervised machine learning can uncover hidden risks and irregularities — making Isolation Forest a valuable tool in the Saturday ML Spark ⚡️ series.


Code Snippet:

# 📦 Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs


# 🧩 Generate Sample Data (Normal + Anomalies)
# Normal data points
X_normal, _ = make_blobs(
    n_samples=300,
    centers=1,
    cluster_std=0.6,
    random_state=42
)

# Anomalous data points
rng = np.random.RandomState(42)
X_anomaly = rng.uniform(low=-6, high=6, size=(20, 2))

# Combine into a single dataset
X = np.vstack([X_normal, X_anomaly])

df = pd.DataFrame(X, columns=["feature_1", "feature_2"])


# 🔍 Visualize Raw Data
plt.figure(figsize=(6, 5))
plt.scatter(df["feature_1"], df["feature_2"], alpha=0.6)
plt.title("Raw Data (Normal + Anomalies)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


# 🧠 Train Isolation Forest Model
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.06,   # expected proportion of anomalies
    random_state=42
)

iso_forest.fit(df)


# 🚨 Predict Anomalies
# -1 → anomaly, 1 → normal
df["anomaly"] = iso_forest.predict(df)


# 📊 Visualize Detected Anomalies
plt.figure(figsize=(6, 5))

normal = df[df["anomaly"] == 1]
anomaly = df[df["anomaly"] == -1]

plt.scatter(
    normal["feature_1"],
    normal["feature_2"],
    label="Normal",
    alpha=0.6
)

plt.scatter(
    anomaly["feature_1"],
    anomaly["feature_2"],
    color="red",
    label="Anomaly"
)

plt.title("Anomaly Detection using Isolation Forest")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()


# 🔎 Anomaly Scores (Optional Inspection)
df["anomaly_score"] = iso_forest.decision_function(df)

print("Top detected anomalies:")
print(
    df[df["anomaly"] == -1]
    .sort_values("anomaly_score")
    .head()
)

Link copied!

Comments

Add Your Comment

Comment Added!