Compress your static word embedding without training

Apply Zipfian whitening for dimensional reduction. Compress embeddings to ~16% size with no training and minimal downstream loss.

TL;DR

Rethinking dimensionality reduction on static word embeddings

Static word embeddings are having a small revival, as in Model2Vec and WordLlama. They are fast, cheap to serve, and easy to deploy compared to large transformer models, but they still eat RAM, so compression matters. Matryoshka representation learning is popular and strong, yet it requires contrastive training. A training-free approachHere “training-free” means no backprop; for example, a simple SVD-based method. would be easier to ship, but the default is still PCA. In this blog post, we show that using Zipfian Whitening to compress static word embeddings works very well.

Whitening (& PCA)

Before introducing Zipfian whitening, start with whitening, the baseline this note builds on. Say we have an embedding matrix with \(X \in \mathbb{R}^{V \times D}\) with rows \(x_w \in \mathbb{R}^{D}\) and we want compressed vectors \(z_w \in \mathbb{R}^{D'}\). Whitening here is PCA followed by scaling the retained components so they are uncorrelated and have unit variance. In this pipeline you center by the mean, form the covariance, eigendecompose, keep the top \(D'\) eigenpairs \(U_{D'}\) and \(S_{D'}\), and scale by the inverse square roots of those eigenvalues:

\(\begin{aligned} \mu &= {\color{#2475B0}{\frac{1}{V}}} \sum_w x_w \\ \Sigma &= {\color{#2475B0}{\frac{1}{V}}} \sum_w (x_w - \mu)(x_w - \mu)^\top \\ &= U S U^\top \quad (U_{D'}, S_{D'}) = \text{top-}D'\text{ eigenpairs} \\ W &= U_{D'} \operatorname{diag}\!\Big(\frac{1}{\sqrt{S_{D'} + \varepsilon}}\Big) \\ z_w^{\text{white}} &= W^\top (x_w - \mu) \end{aligned}\)

The issue is that both whitening and PCA above implicitly weight every word equally (\({\color{#2475B0}{1/V}}\)), ignoring token probabilities. High-frequency words tilt the covariance and drown out rare, informative ones.

Zipfian whitening

To fix the equal-weight issue, Zipfian Whitening keeps the same steps but considers word probabilities \(\color{#d33}{p(w)}\):

\[\begin{aligned} \mu_p &= \sum_w {\color{#d33}{p(w)}} x_w \\ \Sigma_p &= \sum_w {\color{#d33}{p(w)}}\, (x_w - \mu_p)(x_w - \mu_p)^\top \\ &= U S U^\top,\quad (U_{D'}, S_{D'}) = \text{top-}D'\text{ eigenpairs} \\ W_p &= U_{D'} \operatorname{diag}\!\Big(\frac{1}{\sqrt{S_{D'} + \varepsilon}}\Big) \\ z_w &= W_p^\top (x_w - \mu_p) \end{aligned}\]

The only change is swapping the uniform weight (\({\color{#2475B0}{1/V}}\)) for actual token probability \(\color{#d33}{p(w)}\) when taking expectations; this down-weights frequent words and preserves rare, informative ones. Compared with the original Zipfian whitening paper, the twist here is using it explicitly as a dimensionality-reduction method, like PCA.

You can plug in any \(\color{#d33}{p(w)}\) you like—estimated from a general corpus (e.g., Wikipedia) or from the target task. Task-specific \(\color{#d33}{p(w)}\) is usually stronger when you have enough text, acting as lightweight test-time adaptation.

Pseudocode

# Inputs: X[V, D] embeddings, p[V] frequencies (sum to 1), target dim D_out
# Output: Z[V, D_out] compressed + whitened embeddings
mu = p @ X                                 # weighted mean, since p sums to 1
Xc = X - mu
Sigma = (Xc.T @ np.diag(p) @ Xc) / p.sum()
U, S, _ = np.linalg.svd(Sigma)
U_d, S_d = U[:, :D_out], S[:D_out]
W = U_d @ np.diag(1.0 / np.sqrt(S_d + eps))
Z = Xc @ W

Experiments

Setup

We evaluate whether Zipfian whitening holds up when dimensions shrink.

Results

STS dimension sweep for Zipfian whitening vs. baselines
STS across dimensions. Zipfian whitening with task p(w) tracks the original 300d accuracy even at 50d and stays strongest through 300d; generic p(w) also beats baselines.

Why it works

Conclusions

Using Zipfian whitening as a dimensionality-reduction method gives a training-free, frequency-aware drop-in for compressing static embeddings: swapping uniform weights for task-aware \(p(w)\) keeps STS accuracy even at 50d. Using task \(p(w)\) is the strongest option; a generic corpus still beats PCA/uniform whitening. If you have target text, compute \(p(w)\), whiten once, and ship smaller embeddings without retraining.

Acknowledgements

We thank Sho Yokoi, Yuji Yamamoto, and Hayato Tsukagoshi for discussions and insightful feedbacks.