Last active
May 25, 2023 21:24
-
-
Save avidale/c6b1d13b32a36f19750cd01148560561 to your computer and use it in GitHub Desktop.
fasttext_similarity_weirdness.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "fasttext_similarity_weirdness.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyN/LQY3jVFwwrNjNQZUeOc0", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/avidale/c6b1d13b32a36f19750cd01148560561/fasttext_similarity_weirdness.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "GOaB-MD8XHmZ", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"In this stub, I want to demonstrate some shit that happens when we use gensim fasttext model to search for similar words. \n", | |
"\n", | |
"Хочу продемонстрировать некоторое дерьмо, происходящее в gensimовской модели fasttext при поиске похожих слов." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Sb0Xoq8OUrUf", | |
"colab_type": "code", | |
"outputId": "b6f70c67-7dd8-4ac0-cb4d-4be85014f4ae", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 204 | |
} | |
}, | |
"source": [ | |
"!wget http://vectors.nlpl.eu/repository/20/181.zip" | |
], | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"--2020-03-03 19:54:09-- http://vectors.nlpl.eu/repository/20/181.zip\n", | |
"Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225\n", | |
"Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 2622716250 (2.4G) [application/zip]\n", | |
"Saving to: ‘181.zip’\n", | |
"\n", | |
"181.zip 100%[===================>] 2.44G 23.0MB/s in 1m 54s \n", | |
"\n", | |
"2020-03-03 19:56:09 (22.0 MB/s) - ‘181.zip’ saved [2622716250/2622716250]\n", | |
"\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "RCnoZQ0AUzpB", | |
"colab_type": "code", | |
"outputId": "8dc1ae30-e184-4dac-81eb-5b0efc79050c", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 136 | |
} | |
}, | |
"source": [ | |
"!unzip 181.zip" | |
], | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Archive: 181.zip\n", | |
" inflating: meta.json \n", | |
" inflating: model.model \n", | |
" inflating: model.model.vectors_ngrams.npy \n", | |
" inflating: model.model.vectors.npy \n", | |
" inflating: model.model.vectors_vocab.npy \n", | |
" inflating: README \n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "eqxHi7oUVBrL", | |
"colab_type": "code", | |
"outputId": "49af8c00-85a8-4f8b-e93a-b22e7b5d3187", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 68 | |
} | |
}, | |
"source": [ | |
"!ls" | |
], | |
"execution_count": 7, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"181.zip model.model.vectors_ngrams.npy README\n", | |
"meta.json model.model.vectors.npy\t sample_data\n", | |
"model.model model.model.vectors_vocab.npy\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "a_lX93QSVyCl", | |
"colab_type": "code", | |
"outputId": "28c19cc7-6acb-467a-bed6-d26c6aaa9840", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 511 | |
} | |
}, | |
"source": [ | |
"!pip install gensim==3.8.1" | |
], | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting gensim==3.8.1\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/d1/dd/112bd4258cee11e0baaaba064060eb156475a42362e59e3ff28e7ca2d29d/gensim-3.8.1-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)\n", | |
"\u001b[K |████████████████████████████████| 24.2MB 1.6MB/s \n", | |
"\u001b[?25hRequirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.4.1)\n", | |
"Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.12.0)\n", | |
"Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.9.0)\n", | |
"Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.17.5)\n", | |
"Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (1.11.15)\n", | |
"Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.21.0)\n", | |
"Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.49.0)\n", | |
"Requirement already satisfied: botocore<1.15.0,>=1.14.15 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (1.14.15)\n", | |
"Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.9.4)\n", | |
"Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.3.3)\n", | |
"Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (1.24.3)\n", | |
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (3.0.4)\n", | |
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2019.11.28)\n", | |
"Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2.8)\n", | |
"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (2.6.1)\n", | |
"Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (0.15.2)\n", | |
"Installing collected packages: gensim\n", | |
" Found existing installation: gensim 3.6.0\n", | |
" Uninstalling gensim-3.6.0:\n", | |
" Successfully uninstalled gensim-3.6.0\n", | |
"Successfully installed gensim-3.8.1\n" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "display_data", | |
"data": { | |
"application/vnd.colab-display-data+json": { | |
"pip_warning": { | |
"packages": [ | |
"gensim" | |
] | |
} | |
} | |
}, | |
"metadata": { | |
"tags": [] | |
} | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "vtSWKrx1VavY", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import gensim" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "v_UnFRKZU56Q", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"model = gensim.models.fasttext.FastTextKeyedVectors.load('model.model')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "2X3prObOVd-Y", | |
"colab_type": "code", | |
"outputId": "1262c8b9-d409-4e6d-ec07-147489be475f", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"word = 'покемошечка'\n", | |
"word in model.vocab # we are deliberately taking an OOV word to demonstrate that similarity is incorrect with ngrams" | |
], | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"False" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 3 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "wfZ8mNRPVoop", | |
"colab_type": "code", | |
"outputId": "0d01c04f-a34f-4226-b2fc-670cccc2feb7", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 187 | |
} | |
}, | |
"source": [ | |
"model.most_similar(word)" | |
], | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"[('юлечка', 0.7381488680839539),\n", | |
" ('лялечка', 0.7292031645774841),\n", | |
" ('алечка', 0.708588182926178),\n", | |
" ('кошечка', 0.7078714370727539),\n", | |
" ('илюшечка', 0.7053546905517578),\n", | |
" ('лешечка', 0.701703667640686),\n", | |
" ('лилечка', 0.7000791430473328),\n", | |
" ('сашечка', 0.6995923519134521),\n", | |
" ('лёнечка', 0.6978040933609009),\n", | |
" ('лелечка', 0.6871213316917419)]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 4 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "aBN2DSmobhEl", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Result is:\n", | |
"```\n", | |
"[('юлечка', 0.7381488680839539),\n", | |
" ('лялечка', 0.7292031645774841),\n", | |
" ('алечка', 0.708588182926178),\n", | |
" ('кошечка', 0.7078714370727539),\n", | |
" ('илюшечка', 0.7053546905517578),\n", | |
" ('лешечка', 0.701703667640686),\n", | |
" ('лилечка', 0.7000791430473328),\n", | |
" ('сашечка', 0.6995923519134521),\n", | |
" ('лёнечка', 0.6978040933609009),\n", | |
" ('лелечка', 0.6871213316917419)]\n", | |
" ```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "3lF4YBhNVsFX", | |
"colab_type": "code", | |
"outputId": "8cfa2de3-e05a-40cc-a816-9439aafb0c5b", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"model.cosine_similarities(model['юлечка'], model['покемошечка'].reshape(1, -1))" | |
], | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"array([0.74520236], dtype=float32)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 5 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "prY-cPs5batJ", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Result is:\n", | |
"```\n", | |
"array([0.74520236], dtype=float32)\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "hySf8Kv4YxR0", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"What happens: cosine similarities used for neighbor retrieval are different from similarities calculated directly from word vectors. \n", | |
"\n", | |
"Why it happens:\n", | |
"* usually when calculating vectors for OOV words fasttext calculates average of n-gram vectors\n", | |
"* but if we pass `use_norm=True`, then fasttext calculates average of *L2-normalized* n-gram vectors ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090)). And it is wrong!\n", | |
"* when we lookup for most similar words, we use just this option, `use_norm=True` ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L831)), how unfortunate!\n", | |
"* why averaging normalized vectors is wrong: because it was never done when model was trained, and is normally never done when the model is applied, so such vectors are most probably meaningless.\n", | |
"* how to do it right: *first* average n-gram vectors, and *then* normalize them. \n", | |
"\n", | |
"Call to action: rewrite `word_vec` method for FastTextKeyedVectors to apply normalization and averaging in the rigth order. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "8rkv5ZNtWoxe", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Что мы видим: сходства слов, использованные при поиске, не совпадают с прямым подсчётом косинусной близости по векторам слов. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "tk0PkBM1XW5F", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Теперь почему так происходит:\n", | |
"* вообще-то при расчёте вектора OOV слова fasttext усредняться векторы n-грамм\n", | |
"* но если указать use_norm=True, то усредняться будут L2-нормализованные векторы n-грамм. и это неправильно!\n", | |
"* при расчёте most_similar как раз используется use_norm=True\n", | |
"* как делать правильно: сначала складывать векторы, потом усреднять\n", | |
"\n", | |
"Вот код: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090\n", | |
"\n", | |
"Почему то, как сейчас, неправильно: если нормализовывать векторы n-грамм перед усреднением, то каждый поделится на собственную норму (а они разные!), и среднее из них будет чем-то, чего модель не видела ни на обучении, ни (в нормальном сценарии) даже на применении. И, скорее всего, чем-то не очень осмысленным. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "qHo6OMtI-v0u", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 51 | |
}, | |
"outputId": "8a6d5cb7-eab0-4f7a-9c47-3f1a23068226" | |
}, | |
"source": [ | |
"word = 'some_oov_word'\n", | |
"pairs = model.most_similar(word)\n", | |
"top_neighbor, top_simil = pairs[0]\n", | |
"print(top_simil)\n", | |
"print(model.cosine_similarities(model[word], model[top_neighbor].reshape(1, -1))[0])" | |
], | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"0.7857677936553955\n", | |
"0.81707764\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "pQSCKJuT-3zf", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment