⚡️ Saturday ML Sparks – Clustering with KMeans 🔷🧠
Posted on: December 13, 2025
Description:
Unsupervised learning is all about uncovering hidden patterns in data when we don’t have labels.
Among all unsupervised algorithms, KMeans is one of the simplest, fastest, and most widely used techniques for tasks like:
- customer segmentation
- anomaly detection
- grouping similar behaviors
- pattern discovery in large datasets
In this Saturday ML Spark, we explore how KMeans divides data into clusters and how to visualize the results.
Understanding the Problem
Unlike supervised learning, where each sample has a known target label, unsupervised learning operates without predefined classes.
KMeans attempts to:
- Group data into k clusters
- Assign each point to the nearest cluster centroid
- Iteratively refine centroids until convergence
The result is a natural grouping of data based on similarity — incredibly useful when labels aren’t available.
1. Generate Synthetic Unlabeled Data
We create a dataset with three natural clusters.
from sklearn.datasets import make_blobs
import numpy as np
X, y_true = make_blobs(
n_samples=300,
centers=3,
cluster_std=0.60,
random_state=42
)
2. Apply KMeans Clustering
Initialize KMeans and fit it to the data.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
KMeans learns:
- the cluster assignments
- the centroid positions
3. Retrieve Cluster Labels and Centroids
After training, we extract the results.
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
These labels represent the cluster each sample belongs to.
4. Visualize the Clusters
Plot the grouped points and the learned centroids.
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis")
plt.scatter(
centroids[:, 0], centroids[:, 1],
c="red", s=200, marker="X"
)
plt.title("KMeans Clustering Result")
plt.show()
Visualizing clusters makes the grouping intuitive and interpretable.
5. Predict Clusters for New Points
KMeans can also classify new unseen data points.
new_points = np.array([[0, 2], [3, 4], [-1, -2]])
preds = kmeans.predict(new_points)
print(preds)
This is useful for real-world segmentation tasks.
Key Takeaways
- KMeans is a foundational unsupervised algorithm used widely in data science.
- It discovers natural groupings in data without needing labels.
- Centroids represent “cluster centers” — key to understanding the grouping.
- Visualization helps interpret clustering results clearly.
- KMeans is efficient, scalable, and ideal for segmentation tasks in business and analytics.
Conclusion
Clustering is an essential part of unsupervised learning, and KMeans provides a simple yet powerful way to uncover structure in unlabeled datasets. Whether you’re analyzing customers, patterns, behaviors, or high-dimensional data, this algorithm is often the first and most effective choice.
With just a few lines of code, KMeans reveals meaningful insights that would otherwise remain hidden — a perfect tool for exploration and discovery in machine learning.
Code Snippet:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
X, y_true = make_blobs(
n_samples=300,
centers=3,
cluster_std=0.60,
random_state=42
)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap="viridis")
plt.scatter(centroids[:, 0], centroids[:, 1], c="red", s=200, marker="X")
plt.title("KMeans Clustering Result")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
new_points = np.array([[0, 2], [3, 4], [-1, -2]])
preds = kmeans.predict(new_points)
print("Cluster predictions for new points:", preds)
No comments yet. Be the first to comment!