AW Dev Rethought

Code is read far more often than it is written - Guido van Rossum

⚡️ Saturday ML Sparks – t-SNE & UMAP Visualization 📊🧠


Description:

High-dimensional datasets are common in machine learning — images, embeddings, text vectors, sensor readings — but visualizing them directly is impossible.

This is where manifold learning techniques like t-SNE and UMAP become incredibly valuable.

In this Saturday ML Spark, we explore how these techniques project complex, high-dimensional data into 2D space, revealing structure, clusters, and relationships that are otherwise hidden.


Understanding the Problem

When data has dozens or hundreds of features:

  • patterns become hard to interpret
  • clusters are not visually obvious
  • PCA may not capture non-linear structure

While PCA is linear, t-SNE and UMAP are non-linear techniques designed specifically for visualization and exploration.

They help answer questions like:

  • Are there natural clusters in my data?
  • Do classes overlap?
  • Is my embedding meaningful?

1. Load a High-Dimensional Dataset

We use the Digits dataset, where each image is represented by 64 features.

from sklearn.datasets import load_digits

digits = load_digits()
X = digits.data
y = digits.target

This dataset is ideal for demonstrating visualization techniques.


2. Standardize the Features

Scaling improves the quality of distance-based embeddings.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Both t-SNE and UMAP benefit from normalized input data.


3. Visualizing with t-SNE

t-SNE focuses on preserving local neighborhood relationships, making it excellent for identifying tight clusters.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

t-SNE is computationally expensive but produces visually intuitive results.


4. Visualizing with UMAP

UMAP is faster, more scalable, and preserves both local and global structure.

import umap.umap_ as umap

umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X_scaled)

UMAP is often preferred for large datasets and repeated experimentation.


5. Comparing the Visualizations

Both techniques reveal structure, but in different ways:

  • t-SNE produces compact, well-separated clusters
  • UMAP shows smoother transitions and global relationships

Neither should be used directly for model training — they are exploratory tools, not feature engineering steps.


When to Use t-SNE vs UMAP

  • Use t-SNE when exploring small to medium datasets
  • Use UMAP when working with large datasets or embeddings
  • Use PCA when interpretability and speed matter

These tools complement each other rather than compete.


Key Takeaways

  1. t-SNE and UMAP enable visualization of high-dimensional data.
  2. Both are non-linear techniques designed for exploration, not training.
  3. Feature scaling improves embedding quality significantly.
  4. t-SNE emphasizes local structure; UMAP balances local and global structure.
  5. These techniques are essential for understanding embeddings and clusters.

Conclusion

t-SNE and UMAP provide powerful ways to see your data — turning abstract vectors into intuitive visual patterns.

They are indispensable tools for exploratory analysis, helping you validate assumptions, inspect embeddings, and gain intuition before modeling.

As part of the Saturday ML Spark series, these visualization techniques complete the unsupervised learning toolkit — preparing you for deeper analysis, clustering, and representation learning.


Code Snippet:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import umap.umap_ as umap


digits = load_digits()
X = digits.data
y = digits.target


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42
)

X_tsne = tsne.fit_transform(X_scaled)


plt.figure(figsize=(8, 5))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap="tab10", s=15)
plt.title("t-SNE Visualization of Digits Dataset")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()


umap_model = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)

X_umap = umap_model.fit_transform(X_scaled)


plt.figure(figsize=(8, 5))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap="tab10", s=15)
plt.title("UMAP Visualization of Digits Dataset")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

Link copied!

Comments

Add Your Comment

Comment Added!