Skip to content

Instantly share code, notes, and snippets.

@firmai
Created April 25, 2022 09:58
Show Gist options
  • Save firmai/e557460de6880cf3c71c1cb5ad80cd8d to your computer and use it in GitHub Desktop.
Save firmai/e557460de6880cf3c71c1cb5ad80cd8d to your computer and use it in GitHub Desktop.
Doc2Vec Example.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/firmai/e557460de6880cf3c71c1cb5ad80cd8d/doc2vec-example.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LCgVnQopb6TI"
},
"source": [
"# Doc2Vec demonstration \n",
"\n",
"In this notebook, let us take a look at how to \"learn\" document embeddings and use them for text classification. We will be using the dataset of \"Sentiment and Emotion in Text\" from [Kaggle](https://www.kaggle.com/c/sa-emotions/data).\n",
"\n",
"\"In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft’s Cortana Intelligence Gallery.\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KX5dKXdcaENd",
"outputId": "956f503d-1a2c-4af1-aad5-a5da021ae29b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting nltk==3.5\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)\n",
"\u001b[K |████████████████████████████████| 1.4MB 5.1MB/s \n",
"\u001b[?25hRequirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from nltk==3.5) (7.1.2)\n",
"Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from nltk==3.5) (1.0.1)\n",
"Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from nltk==3.5) (2019.12.20)\n",
"Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from nltk==3.5) (4.41.1)\n",
"Building wheels for collected packages: nltk\n",
" Building wheel for nltk (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for nltk: filename=nltk-3.5-cp37-none-any.whl size=1434691 sha256=a68222bfb8c06405a2c5f264264ffa3daf49f5d73637541f961a024360751028\n",
" Stored in directory: /root/.cache/pip/wheels/ae/8c/3f/b1fe0ba04555b08b57ab52ab7f86023639a526d8bc8d384306\n",
"Successfully built nltk\n",
"Installing collected packages: nltk\n",
" Found existing installation: nltk 3.2.5\n",
" Uninstalling nltk-3.2.5:\n",
" Successfully uninstalled nltk-3.2.5\n",
"Successfully installed nltk-3.5\n",
"Requirement already satisfied: pandas==1.1.5 in /usr/local/lib/python3.7/dist-packages (1.1.5)\n",
"Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas==1.1.5) (2018.9)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas==1.1.5) (2.8.1)\n",
"Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas==1.1.5) (1.19.5)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas==1.1.5) (1.15.0)\n",
"Collecting gensim==3.8.3\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/5c/4e/afe2315e08a38967f8a3036bbe7e38b428e9b7a90e823a83d0d49df1adf5/gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2MB)\n",
"\u001b[K |████████████████████████████████| 24.2MB 1.3MB/s \n",
"\u001b[?25hRequirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (1.19.5)\n",
"Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (1.4.1)\n",
"Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (5.1.0)\n",
"Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (1.15.0)\n",
"Installing collected packages: gensim\n",
" Found existing installation: gensim 3.6.0\n",
" Uninstalling gensim-3.6.0:\n",
" Successfully uninstalled gensim-3.6.0\n",
"Successfully installed gensim-3.8.3\n",
"Collecting scikit-learn==0.21.3\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/9f/c5/e5267eb84994e9a92a2c6a6ee768514f255d036f3c8378acfa694e9f2c99/scikit_learn-0.21.3-cp37-cp37m-manylinux1_x86_64.whl (6.7MB)\n",
"\u001b[K |████████████████████████████████| 6.7MB 5.1MB/s \n",
"\u001b[?25hRequirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.21.3) (1.0.1)\n",
"Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.21.3) (1.19.5)\n",
"Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.21.3) (1.4.1)\n",
"Installing collected packages: scikit-learn\n",
" Found existing installation: scikit-learn 0.22.2.post1\n",
" Uninstalling scikit-learn-0.22.2.post1:\n",
" Successfully uninstalled scikit-learn-0.22.2.post1\n",
"Successfully installed scikit-learn-0.21.3\n"
]
}
],
"source": [
"# To install only the requirements of this notebook, uncomment the lines below and run this cell\n",
"\n",
"# ===========================\n",
"\n",
"!pip install nltk==3.5\n",
"!pip install pandas==1.1.5\n",
"!pip install gensim==3.8.3\n",
"!pip install scikit-learn==0.21.3\n",
"\n",
"# ==========================="
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CIlwQe1S4EpL"
},
"outputs": [],
"source": [
"# To install the requirements for the entire chapter, uncomment the lines below and run this cell\n",
"\n",
"# ===========================\n",
"\n",
"# try:\n",
"# import google.colab\n",
"# !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/ch4-requirements.txt | xargs -n 1 -L 1 pip install\n",
"# except ModuleNotFoundError:\n",
"# !pip install -r \"ch4-requirements.txt\"\n",
"\n",
"# ==========================="
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hSB6W1seb6TJ",
"outputId": "e93459c9-fd82-4d22-852b-819faeb430a6"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
"[nltk_data] Unzipping corpora/stopwords.zip.\n"
]
}
],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"import pandas as pd\n",
"import nltk\n",
"nltk.download('stopwords')\n",
"from nltk.tokenize import TweetTokenizer\n",
"from nltk.corpus import stopwords\n",
"from sklearn.model_selection import train_test_split\n",
"from gensim.models.doc2vec import Doc2Vec, TaggedDocument"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NGAFbmrA4EpM",
"outputId": "f78def1c-c291-4fba-dd41-f24f1456757c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2021-07-16 08:27:55-- https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 2479133 (2.4M) [text/plain]\n",
"Saving to: ‘DATAPATH/train_data.csv’\n",
"\n",
"train_data.csv 100%[===================>] 2.36M --.-KB/s in 0.1s \n",
"\n",
"2021-07-16 08:27:55 (22.4 MB/s) - ‘DATAPATH/train_data.csv’ saved [2479133/2479133]\n",
"\n",
"--2021-07-16 08:27:55-- https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 783640 (765K) [text/plain]\n",
"Saving to: ‘DATAPATH/test_data.csv’\n",
"\n",
"test_data.csv 100%[===================>] 765.27K --.-KB/s in 0.05s \n",
"\n",
"2021-07-16 08:27:55 (15.4 MB/s) - ‘DATAPATH/test_data.csv’ saved [783640/783640]\n",
"\n",
"total 3.2M\n",
"drwxr-xr-x 2 root root 4.0K Jul 16 08:27 .\n",
"drwxr-xr-x 1 root root 4.0K Jul 16 08:27 ..\n",
"-rw-r--r-- 1 root root 766K Jul 16 08:27 test_data.csv\n",
"-rw-r--r-- 1 root root 2.4M Jul 16 08:27 train_data.csv\n"
]
}
],
"source": [
"#Load the dataset and explore.\n",
"try:\n",
" from google.colab import files\n",
" !wget -P DATAPATH https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv\n",
" !wget -P DATAPATH https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv\n",
" !ls -lah DATAPATH\n",
" filepath = \"DATAPATH/train_data.csv\"\n",
"except ModuleNotFoundError:\n",
" filepath = \"Data/Sentiment and Emotion in Text/train_data.csv\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 221
},
"id": "lSvnHBYPb6TQ",
"outputId": "b992755a-470e-470b-eb59-e4225711f252"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(30000, 2)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sentiment</th>\n",
" <th>content</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>empty</td>\n",
" <td>@tiffanylue i know i was listenin to bad habi...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>sadness</td>\n",
" <td>Layin n bed with a headache ughhhh...waitin o...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>sadness</td>\n",
" <td>Funeral ceremony...gloomy friday...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>enthusiasm</td>\n",
" <td>wants to hang out with friends SOON!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>neutral</td>\n",
" <td>@dannycastillo We want to trade with someone w...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sentiment content\n",
"0 empty @tiffanylue i know i was listenin to bad habi...\n",
"1 sadness Layin n bed with a headache ughhhh...waitin o...\n",
"2 sadness Funeral ceremony...gloomy friday...\n",
"3 enthusiasm wants to hang out with friends SOON!\n",
"4 neutral @dannycastillo We want to trade with someone w..."
]
},
"execution_count": 5,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(filepath)\n",
"print(df.shape)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5JEI6SH7b6TU",
"outputId": "7c4bccf9-3c39-4e43-cde8-3989a7a002d0"
},
"outputs": [
{
"data": {
"text/plain": [
"worry 7433\n",
"neutral 6340\n",
"sadness 4828\n",
"happiness 2986\n",
"love 2068\n",
"surprise 1613\n",
"hate 1187\n",
"fun 1088\n",
"relief 1021\n",
"empty 659\n",
"enthusiasm 522\n",
"boredom 157\n",
"anger 98\n",
"Name: sentiment, dtype: int64"
]
},
"execution_count": 6,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"df['sentiment'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CHajyKpmb6TY",
"outputId": "bbb05164-f107-4b7c-fedb-145a3b2d1ca3"
},
"outputs": [
{
"data": {
"text/plain": [
"(16759, 2)"
]
},
"execution_count": 7,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"#Let us take the top 3 categories and leave out the rest.\n",
"shortlist = ['neutral', \"happiness\", \"worry\"]\n",
"df_subset = df[df['sentiment'].isin(shortlist)]\n",
"df_subset.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "m2oiZzU5b6Tf"
},
"source": [
"# Text pre-processing:\n",
"Tweets are different. Somethings to consider:\n",
"- Removing @mentions, and urls perhaps?\n",
"- using NLTK Tweet tokenizer instead of a regular one\n",
"- stopwords, numbers as usual."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Rl-FfMdLb6Th",
"outputId": "818e0510-afdb-4732-fe69-c6119ca695c1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"16759 16759\n"
]
}
],
"source": [
"#strip_handles removes personal information such as twitter handles, which don't\n",
"#contribute to emotion in the tweet. preserve_case=False converts everything to lowercase.\n",
"tweeter = TweetTokenizer(strip_handles=True,preserve_case=False)\n",
"mystopwords = set(stopwords.words(\"english\"))\n",
"\n",
"#Function to tokenize tweets, remove stopwords and numbers. \n",
"#Keeping punctuations and emoticon symbols could be relevant for this task!\n",
"def preprocess_corpus(texts):\n",
" def remove_stops_digits(tokens):\n",
" #Nested function that removes stopwords and digits from a list of tokens\n",
" return [token for token in tokens if token not in mystopwords and not token.isdigit()]\n",
" #This return statement below uses the above function to process twitter tokenizer output further. \n",
" return [remove_stops_digits(tweeter.tokenize(content)) for content in texts]\n",
"\n",
"#df_subset contains only the three categories we chose. \n",
"mydata = preprocess_corpus(df_subset['content'])\n",
"mycats = df_subset['sentiment']\n",
"print(len(mydata), len(mycats))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rsGwfVebb6Tl",
"outputId": "c19bc96f-513c-45b6-d476-b95899ab7eca"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model Saved\n"
]
}
],
"source": [
"#Split data into train and test, following the usual process\n",
"train_data, test_data, train_cats, test_cats = train_test_split(mydata,mycats,random_state=1234)\n",
"\n",
"#prepare training data in doc2vec format:\n",
"train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(train_data)]\n",
"#Train a doc2vec model to learn tweet representations. Use only training data!!\n",
"model = Doc2Vec(vector_size=50, alpha=0.025, min_count=5, dm =1, epochs=100)\n",
"model.build_vocab(train_doc2vec)\n",
"model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)\n",
"model.save(\"d2v.model\")\n",
"print(\"Model Saved\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hTqo26Vsb6Ts",
"outputId": "cd16346c-ca81-4dc7-c269-d9ccf83a774d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" happiness 0.37 0.45 0.41 713\n",
" neutral 0.46 0.53 0.49 1595\n",
" worry 0.58 0.46 0.51 1882\n",
"\n",
" accuracy 0.48 4190\n",
" macro avg 0.47 0.48 0.47 4190\n",
"weighted avg 0.50 0.48 0.49 4190\n",
"\n"
]
}
],
"source": [
"#Infer the feature representation for training and test data using the trained model\n",
"model= Doc2Vec.load(\"d2v.model\")\n",
"#infer in multiple steps to get a stable representation. \n",
"train_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in train_data]\n",
"test_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in test_data]\n",
"\n",
"#Use any regular classifier like logistic regression\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"myclass = LogisticRegression(class_weight=\"balanced\") #because classes are not balanced. \n",
"myclass.fit(train_vectors, train_cats)\n",
"\n",
"preds = myclass.predict(test_vectors)\n",
"from sklearn.metrics import classification_report, confusion_matrix\n",
"print(classification_report(test_cats, preds))\n",
"\n",
"#print(confusion_matrix(test_cats,preds))"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "Doc2Vec Example.ipynb",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment