⚡️ Saturday ML Sparks – Clustering with KMeans 🔷🧠


Description:

Unsupervised learning is all about uncovering hidden patterns in data when we don’t have labels.

Among all unsupervised algorithms, KMeans is one of the simplest, fastest, and most widely used techniques for tasks like:

  • customer segmentation
  • anomaly detection
  • grouping similar behaviors
  • pattern discovery in large datasets

In this Saturday ML Spark, we explore how KMeans divides data into clusters and how to visualize the results.


Understanding the Problem

Unlike supervised learning, where each sample has a known target label, unsupervised learning operates without predefined classes.

KMeans attempts to:

  1. Group data into k clusters
  2. Assign each point to the nearest cluster centroid
  3. Iteratively refine centroids until convergence

The result is a natural grouping of data based on similarity — incredibly useful when labels aren’t available.


1. Generate Synthetic Unlabeled Data

We create a dataset with three natural clusters.

from sklearn.datasets import make_blobs
import numpy as np

X, y_true = make_blobs(
    n_samples=300,
    centers=3,
    cluster_std=0.60,
    random_state=42
)

2. Apply KMeans Clustering

Initialize KMeans and fit it to the data.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

KMeans learns:

  • the cluster assignments
  • the centroid positions

3. Retrieve Cluster Labels and Centroids

After training, we extract the results.

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

These labels represent the cluster each sample belongs to.


4. Visualize the Clusters

Plot the grouped points and the learned centroids.

import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis")
plt.scatter(
    centroids[:, 0], centroids[:, 1],
    c="red", s=200, marker="X"
)
plt.title("KMeans Clustering Result")
plt.show()

Visualizing clusters makes the grouping intuitive and interpretable.


5. Predict Clusters for New Points

KMeans can also classify new unseen data points.

new_points = np.array([[0, 2], [3, 4], [-1, -2]])
preds = kmeans.predict(new_points)
print(preds)

This is useful for real-world segmentation tasks.


Key Takeaways

  1. KMeans is a foundational unsupervised algorithm used widely in data science.
  2. It discovers natural groupings in data without needing labels.
  3. Centroids represent “cluster centers” — key to understanding the grouping.
  4. Visualization helps interpret clustering results clearly.
  5. KMeans is efficient, scalable, and ideal for segmentation tasks in business and analytics.

Conclusion

Clustering is an essential part of unsupervised learning, and KMeans provides a simple yet powerful way to uncover structure in unlabeled datasets. Whether you’re analyzing customers, patterns, behaviors, or high-dimensional data, this algorithm is often the first and most effective choice.

With just a few lines of code, KMeans reveals meaningful insights that would otherwise remain hidden — a perfect tool for exploration and discovery in machine learning.


Code Snippet:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans


X, y_true = make_blobs(
    n_samples=300,
    centers=3,
    cluster_std=0.60,
    random_state=42
)


kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)


labels = kmeans.labels_
centroids = kmeans.cluster_centers_


plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap="viridis")
plt.scatter(centroids[:, 0], centroids[:, 1], c="red", s=200, marker="X")
plt.title("KMeans Clustering Result")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()


new_points = np.array([[0, 2], [3, 4], [-1, -2]])
preds = kmeans.predict(new_points)
print("Cluster predictions for new points:", preds)

Link copied!

Comments

Add Your Comment

Comment Added!