Skip to content

Instantly share code, notes, and snippets.

@myazdani
Created December 6, 2020 17:38
Show Gist options
  • Save myazdani/f9945ae5864ba894987b4c43ac0a4bd4 to your computer and use it in GitHub Desktop.
Save myazdani/f9945ae5864ba894987b4c43ac0a4bd4 to your computer and use it in GitHub Desktop.
approximate-7UP-embedding.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "approximate-7UP-embedding.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyNYkN8Gtxx0xDXgMSLdU3uP",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/myazdani/f9945ae5864ba894987b4c43ac0a4bd4/approximate-7up-embedding.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3TUXQR2cG3F9"
},
"source": [
"# Approximate OOV word emebddings with average word embeddings of similar terms\n",
"\n",
"If you are using word embeddings from SpaCy and a term is out-of-vocabulary, one easy idea is to approximate it with the average embeddings of terms that *are* in vocabulary. \n",
"\n",
"This ddemo notebook tries out this idea on an example term that we pretend is OOV: `7UP`\n",
"\n",
"\n",
"`7UP` is actually in our dictionary but we are going to pretend that it's not and approximate it. Then we will see how are approximation compares to the actual embedding for `7UP` to get a sense for how well our approximation works. \n",
"\n",
"Note: \n",
"To use the large SpaCy language model in Colab you have to download it and then restart the runtime. Just the way things are 🤷‍♂️"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "M7TkxyGHF3M4",
"outputId": "ad08550c-783b-4cbe-d8eb-26677602dbea"
},
"source": [
"!python -m spacy download en_core_web_lg"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"Requirement already satisfied: en_core_web_lg==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz#egg=en_core_web_lg==2.2.5 in /usr/local/lib/python3.6/dist-packages (2.2.5)\n",
"Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.6/dist-packages (from en_core_web_lg==2.2.5) (2.2.4)\n",
"Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (7.4.0)\n",
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (2.23.0)\n",
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (2.0.4)\n",
"Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.1.3)\n",
"Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (0.8.0)\n",
"Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.4)\n",
"Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.0)\n",
"Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (50.3.2)\n",
"Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.18.5)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (4.41.1)\n",
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.4)\n",
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (3.0.4)\n",
"Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (0.4.1)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (2020.11.8)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (1.24.3)\n",
"Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (3.0.4)\n",
"Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_lg==2.2.5) (2.0.0)\n",
"Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_lg==2.2.5) (3.4.0)\n",
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
"You can now load the model via spacy.load('en_core_web_lg')\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "X0Y_1A2HGDPH"
},
"source": [
"import spacy\n",
"nlp = spacy.load(\"en_core_web_lg\")\n",
"\n",
"from sklearn.preprocessing import normalize\n",
"import numpy as np"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Whjg9TT6GISG"
},
"source": [
"# helper functions\n",
"w2vec = lambda x: nlp(x).vector\n",
"word_dist = lambda x,y: np.sum((normalize(x[np.newaxis,:]) - \n",
" normalize(y[np.newaxis,:]))**2)"
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "_k5LW00aGPv7"
},
"source": [
"## Apprpoximate \"7-UP\" with a combination of other sugary drinks\n",
"sugary_drinks = [\"pepsi\", \"sprite\", \"coke\", \"soda\"] \n",
"\n",
"sevenup_hat = np.mean([w2vec(drink) for drink in sugary_drinks], axis = 0)"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "fxsg9cKyGSgB",
"outputId": "b550c096-cde2-4198-9110-eeed87b36384"
},
"source": [
"sevenup = w2vec(\"7UP\")\n",
"print(\"Distance between '7UP' and approximate '7UP':\", \n",
" word_dist(sevenup, sevenup_hat))\n",
"for term in sugary_drinks + [\"water\", \"soda water\", \"banana\"]:\n",
" print(f\"Distance between '7UP' and '{term}'\", \n",
" word_dist(sevenup, w2vec(term)))"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"Distance between '7UP' and approximate '7UP': 0.8895961\n",
"Distance between '7UP' and 'pepsi' 0.9475707\n",
"Distance between '7UP' and 'sprite' 1.1475383\n",
"Distance between '7UP' and 'coke' 1.2436303\n",
"Distance between '7UP' and 'soda' 1.1630545\n",
"Distance between '7UP' and 'water' 1.8564961\n",
"Distance between '7UP' and 'soda water' 1.4359412\n",
"Distance between '7UP' and 'banana' 1.7483511\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "3_12XoxMGwD7"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment