-
-
Save lgalke/febaaa1313d9c11f3bc8240defed8390 to your computer and use it in GitHub Desktop.
""" | |
All-but-the-Top: Simple and Effective Postprocessing for Word Representations | |
Paper: https://arxiv.org/abs/1702.01417 | |
Last Updated: Fri 15 Nov 2019 11:47:00 AM CET | |
**Prior version had serious issues, please excuse any inconveniences.** | |
""" | |
import numpy as np | |
from sklearn.decomposition import PCA | |
def all_but_the_top(v, D): | |
""" | |
Arguments: | |
:v: word vectors of shape (n_words, n_dimensions) | |
:D: number of principal components to subtract | |
""" | |
# 1. Subtract mean vector | |
v_tilde = v - np.mean(v, axis=0) | |
# 2. Compute the first `D` principal components | |
# on centered embedding vectors | |
u = PCA(n_components=D).fit(v_tilde).components_ # [D, emb_size] | |
# Subtract first `D` principal components | |
# [vocab_size, emb_size] @ [emb_size, D] @ [D, emb_size] -> [vocab_size, emb_size] | |
return v_tilde - (v @ u.T @ u) |
Thanks - trying this version (that ends return v - (v_tilde @ u.T @ u)
) has the expected behavior in my evaluations!
@gojomo I just double checked with the paper itself and it should be v_tilde - (v @ u.T @ u)
instead of v - (v_tilde @...
.
Applying PCA to centered / non-centered versions should not make a difference. The important thing is to subtract the mean from the embeddings to match the paper. v_tilde
holds the centered version in this gist so now we return the embeddings - mean - first D principal components
.
@s1998: I think this last point is also not considered in your implementation. In line 11 you do not subtract from the centered version. Am I missing something?
@lgalke You are correct, it should be mean centered embeddings that I need to subtract from. I'll fix that, thanks.
I adapted the code to match the version of this implementation. Changes are also ported back in vec4ir.