MNIST Embedding with SG-t-SNE-Pi¶
This notebook reproduces the point-cloud MNIST demo from the Julia SGtSNEpi.jl package.
Workflow:
- Load MNIST (70,000 digits, 28×28 pixels)
- Extract HOG features (324-dimensional)
- Embed into 2D with SG-t-SNE-Pi
- Evaluate and visualize
Note: The embedding step takes ~10–15 minutes on a modern laptop. First run is slower due to Numba JIT compilation.
In [1]:
Copied!
import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1, as_frame=False, parser="liac-arff")
X_raw, y = mnist.data.astype(np.float64), mnist.target.astype(int)
print(f"Raw data shape: {X_raw.shape}")
print(f"Labels: {sorted(set(y))}")
import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", version=1, as_frame=False, parser="liac-arff")
X_raw, y = mnist.data.astype(np.float64), mnist.target.astype(int)
print(f"Raw data shape: {X_raw.shape}")
print(f"Labels: {sorted(set(y))}")
Raw data shape: (70000, 784) Labels: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)]
In [2]:
Copied!
from skimage.feature import hog
def extract_hog_features(images, pixels_per_cell=(7, 7)):
"""Extract HOG features from flattened 28x28 images."""
features = []
for img in images:
feat = hog(
img.reshape(28, 28),
pixels_per_cell=pixels_per_cell,
cells_per_block=(2, 2),
orientations=9,
)
features.append(feat)
return np.array(features)
F = extract_hog_features(X_raw)
print(f"HOG feature shape: {F.shape}")
from skimage.feature import hog
def extract_hog_features(images, pixels_per_cell=(7, 7)):
"""Extract HOG features from flattened 28x28 images."""
features = []
for img in images:
feat = hog(
img.reshape(28, 28),
pixels_per_cell=pixels_per_cell,
cells_per_block=(2, 2),
orientations=9,
)
features.append(feat)
return np.array(features)
F = extract_hog_features(X_raw)
print(f"HOG feature shape: {F.shape}")
HOG feature shape: (70000, 324)
Parameter Mapping¶
The following table maps Julia SGtSNEpi.jl parameters to our Python API:
| Julia | Python | Value |
|---|---|---|
k = 3*u = 30 |
n_neighbors=30 |
30 |
u = 10 |
lambda_=10 |
10 |
max_iter = 1000 |
max_iter=1000 |
1000 |
early_exag = 250 |
early_exag=250 |
250 |
alpha = 12 |
alpha=12 |
12 |
eta = 200 |
eta=200 |
200 |
In [3]:
Copied!
from pysgtsnepi import SGtSNEpi
model = SGtSNEpi(
d=2,
lambda_=10,
n_neighbors=30,
max_iter=1000,
early_exag=250,
alpha=12,
eta=200,
random_state=0,
)
Y = model.fit_transform(F)
print(f"Embedding shape: {Y.shape}")
from pysgtsnepi import SGtSNEpi
model = SGtSNEpi(
d=2,
lambda_=10,
n_neighbors=30,
max_iter=1000,
early_exag=250,
alpha=12,
eta=200,
random_state=0,
)
Y = model.fit_transform(F)
print(f"Embedding shape: {Y.shape}")
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Embedding shape: (70000, 2)
In [4]:
Copied!
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import silhouette_score
from sklearn.manifold import trustworthiness
# kNN accuracy in embedding space
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(Y, y)
knn_acc = knn.score(Y, y)
# Silhouette score (subsample for speed)
rng = np.random.RandomState(42)
idx = rng.choice(len(Y), size=5000, replace=False)
sil = silhouette_score(Y[idx], y[idx])
# Trustworthiness
trust = trustworthiness(F[idx], Y[idx], n_neighbors=10)
print(f"kNN accuracy (k=10): {knn_acc:.4f}")
print(f"Silhouette score (5k subsample): {sil:.4f}")
print(f"Trustworthiness (5k subsample, k=10): {trust:.4f}")
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import silhouette_score
from sklearn.manifold import trustworthiness
# kNN accuracy in embedding space
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(Y, y)
knn_acc = knn.score(Y, y)
# Silhouette score (subsample for speed)
rng = np.random.RandomState(42)
idx = rng.choice(len(Y), size=5000, replace=False)
sil = silhouette_score(Y[idx], y[idx])
# Trustworthiness
trust = trustworthiness(F[idx], Y[idx], n_neighbors=10)
print(f"kNN accuracy (k=10): {knn_acc:.4f}")
print(f"Silhouette score (5k subsample): {sil:.4f}")
print(f"Trustworthiness (5k subsample, k=10): {trust:.4f}")
kNN accuracy (k=10): 0.9736 Silhouette score (5k subsample): 0.3968 Trustworthiness (5k subsample, k=10): 0.9774
In [5]:
Copied!
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 12), dpi=167)
scatter = ax.scatter(
Y[:, 0], Y[:, 1],
c=y, cmap="tab10",
s=0.5, alpha=0.6,
rasterized=True,
)
ax.set_aspect("equal")
ax.set_title("SG-t-SNE-Pi embedding of MNIST (70k digits, HOG features)")
ax.set_xlabel("Dimension 1")
ax.set_ylabel("Dimension 2")
cbar = fig.colorbar(scatter, ax=ax, ticks=range(10), shrink=0.8)
cbar.set_label("Digit")
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 12), dpi=167)
scatter = ax.scatter(
Y[:, 0], Y[:, 1],
c=y, cmap="tab10",
s=0.5, alpha=0.6,
rasterized=True,
)
ax.set_aspect("equal")
ax.set_title("SG-t-SNE-Pi embedding of MNIST (70k digits, HOG features)")
ax.set_xlabel("Dimension 1")
ax.set_ylabel("Dimension 2")
cbar = fig.colorbar(scatter, ax=ax, ticks=range(10), shrink=0.8)
cbar.set_label("Digit")
plt.tight_layout()
plt.show()
Results¶
The embedding should show well-separated digit clusters, comparable to the Julia reference. SG-t-SNE-Pi is designed for graph embedding (not just point clouds), so the kNN graph construction and lambda equalization steps are critical for quality.
Comparison with Julia SGtSNEpi.jl:
- Same algorithm, same parameters, same HOG feature pipeline
- Minor numerical differences due to different kNN implementations (PyNNDescent vs. NearestNeighborDescent.jl)
- Visual quality should be equivalent