MNIST Embedding with SG-t-SNE-Pi¶

This notebook reproduces the point-cloud MNIST demo from the Julia SGtSNEpi.jl package.

Workflow:

Load MNIST (70,000 digits, 28×28 pixels)
Extract HOG features (324-dimensional)
Embed into 2D with SG-t-SNE-Pi
Evaluate and visualize

Note: The embedding step takes ~10–15 minutes on a modern laptop. First run is slower due to Numba JIT compilation.

In [1]:

Copied!





import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1, as_frame=False, parser="liac-arff")
X_raw, y = mnist.data.astype(np.float64), mnist.target.astype(int)
print(f"Raw data shape: {X_raw.shape}")
print(f"Labels: {sorted(set(y))}")
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1, as_frame=False, parser="liac-arff")
X_raw, y = mnist.data.astype(np.float64), mnist.target.astype(int)
print(f"Raw data shape: {X_raw.shape}")
print(f"Labels: {sorted(set(y))}")

Raw data shape: (70000, 784)
Labels: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)]

In [2]:

Copied!





from skimage.feature import hog

def extract_hog_features(images, pixels_per_cell=(7, 7)):
    """Extract HOG features from flattened 28x28 images."""
    features = []
    for img in images:
        feat = hog(
            img.reshape(28, 28),
            pixels_per_cell=pixels_per_cell,
            cells_per_block=(2, 2),
            orientations=9,
        )
        features.append(feat)
    return np.array(features)

F = extract_hog_features(X_raw)
print(f"HOG feature shape: {F.shape}")
from skimage.feature import hog

def extract_hog_features(images, pixels_per_cell=(7, 7)):
    """Extract HOG features from flattened 28x28 images."""
    features = []
    for img in images:
        feat = hog(
            img.reshape(28, 28),
            pixels_per_cell=pixels_per_cell,
            cells_per_block=(2, 2),
            orientations=9,
        )
        features.append(feat)
    return np.array(features)

F = extract_hog_features(X_raw)
print(f"HOG feature shape: {F.shape}")

HOG feature shape: (70000, 324)

Parameter Mapping¶

The following table maps Julia SGtSNEpi.jl parameters to our Python API:

Julia	Python	Value
`k = 3*u = 30`	`n_neighbors=30`	30
`u = 10`	`lambda_=10`	10
`max_iter = 1000`	`max_iter=1000`	1000
`early_exag = 250`	`early_exag=250`	250
`alpha = 12`	`alpha=12`	12
`eta = 200`	`eta=200`	200

In [3]:

Copied!





from pysgtsnepi import SGtSNEpi

model = SGtSNEpi(
    d=2,
    lambda_=10,
    n_neighbors=30,
    max_iter=1000,
    early_exag=250,
    alpha=12,
    eta=200,
    random_state=0,
)
Y = model.fit_transform(F)
print(f"Embedding shape: {Y.shape}")
from pysgtsnepi import SGtSNEpi

model = SGtSNEpi(
    d=2,
    lambda_=10,
    n_neighbors=30,
    max_iter=1000,
    early_exag=250,
    alpha=12,
    eta=200,
    random_state=0,
)
Y = model.fit_transform(F)
print(f"Embedding shape: {Y.shape}")

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Embedding shape: (70000, 2)

In [4]:

Copied!





from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import silhouette_score
from sklearn.manifold import trustworthiness

# kNN accuracy in embedding space
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(Y, y)
knn_acc = knn.score(Y, y)

# Silhouette score (subsample for speed)
rng = np.random.RandomState(42)
idx = rng.choice(len(Y), size=5000, replace=False)
sil = silhouette_score(Y[idx], y[idx])

# Trustworthiness
trust = trustworthiness(F[idx], Y[idx], n_neighbors=10)

print(f"kNN accuracy (k=10): {knn_acc:.4f}")
print(f"Silhouette score (5k subsample): {sil:.4f}")
print(f"Trustworthiness (5k subsample, k=10): {trust:.4f}")
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import silhouette_score
from sklearn.manifold import trustworthiness

# kNN accuracy in embedding space
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(Y, y)
knn_acc = knn.score(Y, y)

# Silhouette score (subsample for speed)
rng = np.random.RandomState(42)
idx = rng.choice(len(Y), size=5000, replace=False)
sil = silhouette_score(Y[idx], y[idx])

# Trustworthiness
trust = trustworthiness(F[idx], Y[idx], n_neighbors=10)

print(f"kNN accuracy (k=10): {knn_acc:.4f}")
print(f"Silhouette score (5k subsample): {sil:.4f}")
print(f"Trustworthiness (5k subsample, k=10): {trust:.4f}")

kNN accuracy (k=10): 0.9736
Silhouette score (5k subsample): 0.3968
Trustworthiness (5k subsample, k=10): 0.9774

In [5]:

Copied!





import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 12), dpi=167)
scatter = ax.scatter(
    Y[:, 0], Y[:, 1],
    c=y, cmap="tab10",
    s=0.5, alpha=0.6,
    rasterized=True,
)
ax.set_aspect("equal")
ax.set_title("SG-t-SNE-Pi embedding of MNIST (70k digits, HOG features)")
ax.set_xlabel("Dimension 1")
ax.set_ylabel("Dimension 2")
cbar = fig.colorbar(scatter, ax=ax, ticks=range(10), shrink=0.8)
cbar.set_label("Digit")
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 12), dpi=167)
scatter = ax.scatter(
    Y[:, 0], Y[:, 1],
    c=y, cmap="tab10",
    s=0.5, alpha=0.6,
    rasterized=True,
)
ax.set_aspect("equal")
ax.set_title("SG-t-SNE-Pi embedding of MNIST (70k digits, HOG features)")
ax.set_xlabel("Dimension 1")
ax.set_ylabel("Dimension 2")
cbar = fig.colorbar(scatter, ax=ax, ticks=range(10), shrink=0.8)
cbar.set_label("Digit")
plt.tight_layout()
plt.show()

No description has been provided for this image

Results¶

The embedding should show well-separated digit clusters, comparable to the Julia reference. SG-t-SNE-Pi is designed for graph embedding (not just point clouds), so the kNN graph construction and lambda equalization steps are critical for quality.

Comparison with Julia SGtSNEpi.jl:

Same algorithm, same parameters, same HOG feature pipeline
Minor numerical differences due to different kNN implementations (PyNNDescent vs. NearestNeighborDescent.jl)
Visual quality should be equivalent