-
-
Save abetancort/881966eff529497ce19d997cf6310afe to your computer and use it in GitHub Desktop.
Comparing Text Similarity Measures & Text Embedding Methods
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def tagged_document(list_of_list_of_words): | |
for i, list_of_words in enumerate(list_of_list_of_words): | |
yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i]) | |
training_data = list(tagged_document(data)) | |
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30) | |
model.build_vocab(training_data) | |
model.train(training_data, total_examples=model.corpus_count, epochs=model.epochs) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def cos_similarity(x,y): | |
""" return cosine similarity between two lists """ | |
numerator = sum(a*b for a,b in zip(x,y)) | |
denominator = squared_sum(x)*squared_sum(y) | |
return round(numerator/float(denominator),3) | |
cos_similarity(embeddings[0], embeddings[1]) | |
# OUTPUT | |
0.891 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.metrics.pairwise import cosine_similarity | |
from sklearn.feature_extraction.text import CountVectorizer | |
vectorizer = CountVectorizer() | |
X = vectorizer.fit_transform(headlines) | |
arr = X.toarray() | |
create_heatmap(cosine_similarity(arr)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nlp = spacy.load('en_core_web_md') | |
docs = [nlp(headline) for headline in headlines] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from simple_elmo import ElmoModel | |
model = ElmoModel() | |
model.load("/content/209.zip") | |
sentence = "After stealing gold from the bank vault, the bank robber was seen fishing on the river bank." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
labels = [headline[:20] for headline in headlines] | |
def create_heatmap(similarity, cmap = "YlGnBu"): | |
df = pd.DataFrame(similarity) | |
df.columns = labels | |
df.index = labels | |
fig, ax = plt.subplots(figsize=(5,5)) | |
sns.heatmap(df, cmap=cmap) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def distance_to_similarity(distance): | |
return 1/exp(distance) | |
distance_to_similarity(distance) | |
# OUTPUT | |
0.8450570465624478 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wget http://vectors.nlpl.eu/repository/20/209.zip |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
python -m spacy download en_core_web_md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
elmo_vectors = model.get_elmo_vectors(sentence, layers="average") | |
print(f"Tensor shape: {elmo_vectors.shape}") | |
# OUTPUT | |
Tensor shape: (1, 92, 1024) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vault = np.sum(elmo_vectors[0][29:33], axis = 0)/4 | |
robber = np.sum(elmo_vectors[0][45:49], axis = 0)/4 | |
river = np.sum(elmo_vectors[0][87:91], axis = 0)/4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from math import sqrt, pow, exp | |
def squared_sum(x): | |
""" return 3 rounded square rooted value """ | |
return round(sqrt(sum([a*a for a in x])),3) | |
def euclidean_distance(x,y): | |
""" return euclidean distance between two lists """ | |
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y))) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sentence_transformers import SentenceTransformer, util | |
model = SentenceTransformer('stsb-roberta-large') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
embeddings = [nlp(sentence).vector for sentence in sentences] | |
distance = euclidean_distance(embeddings[0], embeddings[1]) | |
print(distance) | |
# OUTPUT | |
1.8646982721454675 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import gensim | |
import gensim.downloader as api | |
dataset = api.load("text8") | |
data = [i for i in dataset] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
import tensorflow_hub as hub | |
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" | |
model = hub.load(module_url) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
headlines = [ | |
#Crypto | |
'Investors unfazed by correction as crypto funds see $154 million inflows', | |
'Bitcoin, Ethereum prices continue descent, but crypto funds see inflows', | |
#Inflation | |
'The surge in euro area inflation during the pandemic: transitory but with upside risks', | |
"Inflation: why it's temporary and raising interest rates will do more harm than good", | |
#common | |
'Will Cryptocurrency Protect Against Inflation?'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pip install transformers sentence-transformers |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def jaccard_similarity(x,y): | |
""" returns the jaccard similarity between two lists """ | |
intersection_cardinality = len(set.intersection(*[set(x), set(y)])) | |
union_cardinality = len(set.union(*[set(x), set(y)])) | |
return intersection_cardinality/float(union_cardinality) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vectors = [model.infer_vector([word for word in sent]).reshape(1,-1) for sent in sentences] | |
similarity = [] | |
for i in range(len(sentences)): | |
row = [] | |
for j in range(len(sentences)): | |
row.append(cosine_similarity(vectors[i],vectors[j])[0][0]) | |
similarity.append(row) | |
create_heatmap(similarity) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
diff_bank_1 = cosine_similarity(vault, river) | |
diff_bank_2 = cosine_similarity(river, robber) | |
same_bank = cosine_similarity(vault, robber) | |
print('Vector similarity for *similar* meanings: %.2f' % same_bank) | |
print('Vector similarity for *different* meanings: %.2f' % diff_bank_1) | |
print('Vector similarity for *different* meanings: %.2f' % diff_bank_2) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sentences = ["The bottle is empty", | |
"There is nothing in the bottle"] | |
sentences = [sent.lower().split(" ") for sent in sentences] | |
jaccard_similarity(sentences[0], sentences[1]) | |
# OUPUT | |
0.42857142857142855 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
embeddings = model.encode(sentences, convert_to_tensor=True) | |
similarity = [] | |
for i in range(len(sentences)): | |
row = [] | |
for j in range(len(sentences)): | |
row.append(util.pytorch_cos_sim(embeddings[i], embeddings[j]).item()) | |
similarity.append(row) | |
create_heatmap(similarity) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
embeddings = model(text) | |
similarity = cosine_similarity(embeddings) | |
create_heatmap(similarity) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.feature_extraction.text import TfidfVectorizer | |
vectorizer = TfidfVectorizer() | |
X = vectorizer.fit_transform(headlines) | |
arr = X.toarray() | |
create_heatmap(cosine_similarity(arr)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
similarity = [] | |
for i in range(len(docs)): | |
row = [] | |
for j in range(len(docs)): | |
row.append(docs[i].similarity(docs[j])) | |
similarity.append(row) | |
create_heatmap(similarity) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print(docs[0].vector) |
Author
abetancort
commented
Jan 6, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment