Last active
February 27, 2023 22:19
-
-
Save scign/2dda76c292ef76943e0cd9ff8d5a174a to your computer and use it in GitHub Desktop.
Guided LDA using gensim
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Guided LDA using gensim" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<em>Aleem Juma</em> \n", | |
"<em>March 6, 2019</em>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The gensim package for python is a well-known library of text processing routines. One of the language model frameworks that are included in the package is a Latent Dirichlet Allocation (LDA) topic modeling framework. LDA can be used as an unsupervised learning method in which topics are identified based on word co-occurrence probabilities; however with the implementation of LDA included in the gensim package we can also seed terms with topic probabilities. This turns a fully-unsupervized training method into a semi-supervized training method. Semi-supervised because we are not tagging all terms or documents with topic probabilities, just a few, but it turns out that's enough to push the model in a certain direction." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this writeup I will show how to build an LDA model in gensim with seed words, and plot the resulting topic probability distribution that has been assigned to words. I will then train further models with seed probabilities and explore how this leads the model to different topic probability distributions." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Importing libraries" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We will of course need `gensim`, and we will be working with matrix manipulation so we will need `numpy`. `nltk` provides support functions for language processing and we will be visualizing results with `matplotlib`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import warnings\n", | |
"warnings.filterwarnings(action='ignore', category=UserWarning)\n", | |
"import gensim\n", | |
"import numpy as np\n", | |
"import nltk\n", | |
"from nltk.corpus import stopwords\n", | |
"from nltk.stem import WordNetLemmatizer\n", | |
"import matplotlib.pyplot as plt\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"There are a couple of additional subpackages that `nltk` requires to use the POS tagging feature and the WordNet model. We have to make sure those are downloaded." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# nltk.download('averaged_perceptron_tagger')\n", | |
"# nltk.download('wordnet')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's set up a simple corpus of text strings to work with. The first two seem to relate to food, the next two to animals, and the last one a bit of both." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"txt = [\n", | |
" 'I like to eat broccoli and bananas.',\n", | |
" 'I munched a banana and spinach smoothie for breakfast.',\n", | |
" 'Chinchillas and kittens are cute.',\n", | |
" 'My sister adopted a kitten yesterday.',\n", | |
" 'Look at this cute hamster munching on a piece of broccoli.'\n", | |
"]\n", | |
"len(txt)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The first step in most language processing analytics is preprocessing the text strings. We will be running the following on the text: \n", | |
"* `gensim.simple_preprocess` - does a bunch of pre-processing steps such as tokenizing, removing punctuation and converting to lower case \n", | |
"* lemmatization - turning a word into its base form, e.g. 'shaving' -> 'shave' (note this is not the same thing as stemming, which would turn 'shaving' to 'shav')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In order to perform the lemmatization, we use the `WordNet` language model provided through the `nltk` package. The `WordNetLemmatizer.lemmatize()` function takes a part-of-speech tag to tell it whether we're passing a noun, verb, adjective or adverb. This makes a difference for example 'shaving' could either be used as a verb, however it could also be used as a noun, as in \"wood shaving\". If we were to pass 'shaving' and indicate that it's a noun the lemmatization function would need to leave it alone since it's already in its base form. As a verb, however that would need to translate to 'shave'. \n", | |
"To identify what part-of-speech any particular word is, is not easy, but `nltk` again comes to the rescue providing access to a part-of-speech tagger which returns a suitable tag for each word in a given text." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The twist is that the `nltk.pos_tag` function returns the Penn Treebank tag for the word but we just want whether the word is a noun, verb, adjective or adverb. We need a short simplification routine to translate from the Penn tag to a simpler tag." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# simplify Penn tags to n (NOUN), v (VERB), a (ADJECTIVE) or r (ADVERB)\n", | |
"def simplify(penn_tag):\n", | |
" pre = penn_tag[0]\n", | |
" if (pre == 'J'):\n", | |
" return 'a'\n", | |
" elif (pre == 'R'):\n", | |
" return 'r'\n", | |
" elif (pre == 'V'):\n", | |
" return 'v'\n", | |
" else:\n", | |
" return 'n'" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we're ready to perform some simple preprocessing on the text." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def preprocess(text):\n", | |
" stop_words = stopwords.words('english')\n", | |
" toks = gensim.utils.simple_preprocess(str(text), deacc=True)\n", | |
" wn = WordNetLemmatizer()\n", | |
" return [wn.lemmatize(tok, simplify(pos)) for tok, pos in nltk.pos_tag(toks) if tok not in stop_words]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[['like', 'eat', 'broccoli', 'banana'],\n", | |
" ['munch', 'banana', 'spinach', 'smoothie', 'breakfast'],\n", | |
" ['chinchilla', 'kitten', 'cute'],\n", | |
" ['sister', 'adopt', 'kitten', 'yesterday'],\n", | |
" ['look', 'cute', 'hamster', 'munch', 'piece', 'broccoli']]" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"corp = [preprocess(line) for line in txt]\n", | |
"corp" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"As you can see words have been translated to their base form and converted to lower case. Another step that we performed is omitting 'stop words' like 'this', 'on', 'are', which don't contribute much to the topic probability." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"17" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dictionary = gensim.corpora.Dictionary(corp)\n", | |
"len(dictionary)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We have a vocabulary of 17 words after removing stop words." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The LDA algorithm implementation in `gensim` reads the strings in a 'bag of words' format. This structure lists each distinct word in the sentence once, along with the number of times it occurs in the sentence. The gensim dictionary conveniently provides the `doc2bow` function that converts a line into its respective 'bow' format." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[[(0, 1), (1, 1), (2, 1), (3, 1)],\n", | |
" [(0, 1), (4, 1), (5, 1), (6, 1), (7, 1)],\n", | |
" [(8, 1), (9, 1), (10, 1)],\n", | |
" [(10, 1), (11, 1), (12, 1), (13, 1)],\n", | |
" [(1, 1), (5, 1), (9, 1), (14, 1), (15, 1), (16, 1)]]" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bow = [dictionary.doc2bow(line) for line in corp]\n", | |
"bow" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The model parameter that tracks how words are allocated to terms is called `eta` in the gensim implementation. When not provided, or provided as the keyword `'auto'`, gensim presupposes an even distribution across terms and topics. The question we need to ask is, if we provide a non-uniform matrix as the eta parameter, does that affect the topic distribution assigned to terms and documents?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To find out, we'll first train a topic model on the corpus of sentences we set up, using the `'auto'` keyword. We will then train a model using a prior distribution skewed in the same direction as the auto-generated model, just for fun, and then train another model using a prior distribution to try to push the model in the opposite direction." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We'll set up a function that displays the probability distribution calculated by the algorithm so that we can see how the topics have been allocated across terms." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def viz_model(model, modeldict):\n", | |
" ntopics = model.num_topics\n", | |
" # top words associated with the resulting topics\n", | |
" topics = ['Topic {}: {}'.format(t,modeldict[w]) for t in range(ntopics) for w,p in model.get_topic_terms(t, topn=1)]\n", | |
" terms = [modeldict[w] for w in modeldict.keys()]\n", | |
" fig,ax=plt.subplots()\n", | |
" ax.imshow(model.get_topics()) # plot the numpy matrix\n", | |
" ax.set_xticks(modeldict.keys()) # set up the x-axis\n", | |
" ax.set_xticklabels(terms, rotation=90)\n", | |
" ax.set_yticks(np.arange(ntopics)) # set up the y-axis\n", | |
" ax.set_yticklabels(topics)\n", | |
" plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We will run the following function for each test. Train a model with our prior distribution (or `'auto'`), plot the model, print out the topic distribution and show the topic allocation for our corpus." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def test_eta(eta, dictionary, ntopics, print_topics=True, print_dist=True):\n", | |
" np.random.seed(42) # set the random seed for repeatability\n", | |
" bow = [dictionary.doc2bow(line) for line in corp] # get the bow-format lines with the set dictionary\n", | |
" with (np.errstate(divide='ignore')): # ignore divide-by-zero warnings\n", | |
" model = gensim.models.ldamodel.LdaModel(\n", | |
" corpus=bow, id2word=dictionary, num_topics=ntopics,\n", | |
" random_state=42, chunksize=100, eta=eta,\n", | |
" eval_every=-1, update_every=1,\n", | |
" passes=150, alpha='auto', per_word_topics=True)\n", | |
" # visuzlize the model term topics\n", | |
" viz_model(model, dictionary)\n", | |
" print('Perplexity: {:.2f}'.format(model.log_perplexity(bow)))\n", | |
" if print_topics:\n", | |
" # display the top terms for each topic\n", | |
" for topic in range(ntopics):\n", | |
" print('Topic {}: {}'.format(topic, [dictionary[w] for w,p in model.get_topic_terms(topic, topn=3)]))\n", | |
" if print_dist:\n", | |
" # display the topic probabilities for each document\n", | |
" for line,bag in zip(txt,bow):\n", | |
" doc_topics = ['({}, {:.1%})'.format(topic, prob) for topic,prob in model.get_document_topics(bag)]\n", | |
" print('{} {}'.format(line, doc_topics))\n", | |
" return model" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Before we try a custom prior distribution, let's see how the model does with the default setting, i.e. `'auto'`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Perplexity: -2.98\n", | |
"Topic 0: ['banana', 'broccoli', 'munch']\n", | |
"Topic 1: ['kitten', 'cute', 'broccoli']\n", | |
"I like to eat broccoli and bananas. ['(0, 99.1%)']\n", | |
"I munched a banana and spinach smoothie for breakfast. ['(0, 99.2%)']\n", | |
"Chinchillas and kittens are cute. ['(1, 99.1%)']\n", | |
"My sister adopted a kitten yesterday. ['(1, 99.4%)']\n", | |
"Look at this cute hamster munching on a piece of broccoli. ['(1, 99.6%)']\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<gensim.models.ldamodel.LdaModel at 0x2b8c796eb70>" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"test_eta('auto',dictionary,ntopics=2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Not bad. The distribution allocated with an even start correctly identifies the first two and the next two as separate topics, but fails to identify that the last sentence contains elements of both topics." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To define a prior distribution, we need to create a numpy matrix with the same number of rows and columns as topics and terms, respectively. We then populate that matrix with our prior distribution. To do this we pre-populate all the matrix elements with 1, then with a really high number for the elements that correspond to our 'guided' term-topic distribution." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def create_eta(priors, etadict, ntopics):\n", | |
" eta = np.full(shape=(ntopics, len(etadict)), fill_value=1) # create a (ntopics, nterms) matrix and fill with 1\n", | |
" for word, topic in priors.items(): # for each word in the list of priors\n", | |
" keyindex = [index for index,term in etadict.items() if term==word] # look up the word in the dictionary\n", | |
" if (len(keyindex)>0): # if it's in the dictionary\n", | |
" eta[topic,keyindex[0]] = 1e7 # put a large number in there\n", | |
" eta = np.divide(eta, eta.sum(axis=0)) # normalize so that the probabilities sum to 1 over all topics\n", | |
" return eta" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's start with a list that uses the same topic words." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAbcAAAByCAYAAADQxZ9YAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJztnXe4JFW1t98fSYYwBEkCEkQYBCRIkDCGi4hKkCBRlOQFDCAYPwNeEMP9xIiIKCBcQQwDShIlyEUUkJyG5EcSBUQFCUMSgd/3x941p6enz5nTu+rMqelZ7/P006d3V61afbq6Vu21V5BtgiAIgmCQmGe8FQiCIAiCpgnjFgRBEAwcYdyCIAiCgSOMWxAEQTBwhHELgiAIBo4wbkEQBMHAEcYtCIIgGDjCuAVBEAQDRxi3IAiCYOAI4xYEQRAMHPONtwJzK4svOa+XW7Gd//4/PbrMeKsw21jl5X9vVF7b/3cLPP5ic8Keea45WWPBQguOtwazj3nnnnnKtGkPPmJ76Vlt186r61zAcivOx8nnrjDeavRknx8fPN4qzDZOfvd3GpXX9v/dqmc92ZgsX39bY7LGAq219nirMNt4YdGXjbcKs41LL/3M/aPZbu4x90EQBMFcQxi3IAiCYOAI4xYEQRAMHGHcgiAIgoFjROMm6eWSbsqPhyU92PF6gX4OJOkUSZP62P5wSXdLulPSVqPY/nJJ6/ejUxAEQTCYjBgtaftRYH0ASUcCT9n+WsmBbO832m0lrQvsDKwFvBK4QNIk2y+VHDsIgiCYuyh2S0r6pKRb8+OQPPZqSbdJOk3SVElTJE3I702fWUnaVtINkm6WdFEP8TsAP7H9vO17gD8DG45CrX0l/SEfe6N8rE3z2I2SrpC0eh7/T0lnSrpQ0l2S/rvjs50g6br8Wf6rY/wBSUdmWbdIWmOkYwRBEATjQ5Fxk7QJsBewCbAZ8ME824I02zrO9muB54CDuvZdDjge2Mn2esAePQ6xAvCXjtcP5DGyMRouU/ZltjcDDgVOymN3AJNtbwB8Afhix/brAbsA6wLvkbR8Hv+U7Y3y+2+VtFbHPn/Lsk4CPjqKYwRBEASzmdIk7jcAP7f9DICks4HJwEXAfbavytv9CDgQ+FbHvpsBl9q+H8D2P3vIV48x5+3fNoJeP8nb/K+kZSQtAiwOnCpptR7b/8b2tPwZ7gRWAh4C9pT0PtL/Z3mSwb497/OL/Hw9sE3+e6RjDH0o6UDS/4Nll593pE2DIAiCGpS6JXsZnwrP4rV6jHXzAGmtrWJFktGZFb2O/SXgQtvrADsCnTV5/tXx94vAfNmleCiwpe11gQuG2edFhm4ORjrGkDL2CbY3sr3R4i8P4xYEQTBWlBq33wE7SZqQZ0c7AL/P760qaeP8957A5V37XgFsKWllAElL9pB/Lmn2tECeDa1MminNit2zzDeT3IdPA4sBD+b39x2FjInANOBJSa8ARpopVvR7jCAIgmAMKTJutq8huQCvBa4Cjrc9Nb99G3CApFuAhYETuvb9G/AB4BxJNwOn95B/M3A2aS3rV8AHq0jJWay5PSnpSuBY4IA89hXgq5KuGOXHu4HkgrwVOJFkjGdFv8cIgiAIxpBRr7nZPrLr9dHA0T02fdH2gT32n9zx9/nA+bM43lHAUT3Ge86kOuV3jV8OrNExdHgeP6lru7d3vHzvMLJW7Pj7KmCrkY4RBEEQjA9RoSQIgiAYOBpteWP7bnLSdxAEQRCMFzFzC4IgCAaOMG5BEATBwBGduMeJh6YuzBGvGk1FsdnPKps+Pd4qzDaOOLzZ76Dt/7u79lq0MVn3nHdTY7LGgtV+uul4qzDbmO+ZkVKPB4xLR7dZzNyCIAiCgSOMWxAEQTBwhHELgiAIBo4wbkEQBMHAEcYtCIIgGDjGzLhJermkm/LjYUkPdrxeoE9Zp0iaNMptl5H0W0lPS/rWrPfoH0n75750QRAEQQsZs1QA24+Sq5VIOhJ4yvbXCmXt18fmzwCfBTYAXl1yvFGwP6nA8sNjJD8IgiCowbi4JSV9UtKt+XFIHnu1pNsknSZpqqQpkibk9y6XVBnKbSXdIOlmSRd1y7b9lO0rSF3AR6vPcpLOkXRLlvv6rM9NHdt8StLhknYnGe2fVbNQSRtLukzS9ZJ+LWnZmv+iIAiCoAaz3bhJ2gTYC9iE1JX7g5LWzW+vBRxn+7Uk43RQ177LAccDO9leD9ijz2OfUhnJLo4DLs7NSTcktdrpie2fATcBu9ten9R89RjgXbY3JHUf/0I/egVBEATNMh4VSt4A/Nz2MwCSzgYmAxcB9+VWMpCMxIFA57rZZsCltu8HsP3Pfg48gnvzzWRDafsFUl+44XrGdfMaYG3gN5IA5iV1Ep8JSQeSPhMLstCo9Q6CIAj6YzyM20h1YjyL1+ox1hTdcl9gxpntgnmsGwG32H7DLA9gn0Bu3jpRS47V5wiCIJjrGY81t98BO0maIGkRYAfg9/m9VSVtnP/eE7i8a98rgC0lrQwgacmGdLoUeH+WOa+kiaRgkeUlLSFpQWDbju2nAVWRvtuBFbK7lbwGt3ZDegVBEAQFzHbjZvsa4CfAtcBVwPG2p+a3bwMOkHQLsDB5ltOx79+ADwDnSLoZOL3XMSQ9QOoS/j5JD1RpBCOsuR0MvE3SVOA6YE3bzwFfznqeSzJiFacAJ+WAEwO7AN/IOt0IvL6f/0kQBEHQLLPFLWn7yK7XR5OMTzcv2j6wx/6TO/4+Hzh/FsdbcZjxnmtuth8Gtu8x/g3gGz3GpwBTOoZuIK0bBkEQBC0gKpQEQRAEA0dr+rnZvpuc9B0EQRAEdYiZWxAEQTBwhHELgiAIBo4wbkEQBMHAITtyiccDSf8A7h/FpksBjzR02CZlzW3y2qxb0/LarFvT8tqsW9Py2qxbP/JWtr30rDYK49ZyJF1ne6O2yZrb5LVZt6bltVm3puW1Wbem5bVZt7GQF27JIAiCYOAI4xYEQRAMHGHc2s8Js95kXGTNbfLarFvT8tqsW9Py2qxb0/LarFvj8mLNLQiCIBg4YuYWBEEQDBxh3IIgCIKBI4xbMDBIetloxoIgGHzCuLUUSdtK+qSk/6oefe6/Zn5+Xa9HDb0uGc1YgdyF68oA/jDKsVEjaXNJ75a0d/WoIUuS3lN9l5JWqprcDiKSVpa0Vf57gqRFZ7XP7EDSrqMZG6WseSV9pL5W0+VtJ6mR63LTumWZq/QY23jmLfuSOSbnSRi3FiLpe8DuwCGAgF2BlfsU87H8/PUej68V6LRg7ny+VO5OvmR+rAIs36+8DrmbS7oduCO/Xk/Sd/uUsZykDYEJkjboMOJvBhaqodtppP/VZGDj/KiTZPpdYDNSl3lIHd2PK9RtC0kXS/p/ku6VdJ+ke0sVk7S6pDMl3Z7l3VtT3gHAmcD389CKwNk15O0s6S5JT0h6UtI0SU8Wivv0KMdmie0XgR0K9ejFHsBdko6W9Jo6gsZAN4BfSFqheiHpTcDJpcKaPk9mkB3Rku1D0i221+14XgT4he2tx1GnQ4HDSIbsQZLRBXgSONH2dwrlXk3qZH6u7Q3y2K221+lDxj7AviTDc22HbtOA/7H9i0Ld7gDWckM/Ekk32H6dpBs7PuvNttcrkHUn8BHgeuDFatz2o4W6XQ4cAXyT1Lh3P9L14YhCeTcBmwBXd3zWqbZfWyjvbmB723eU7J9lvAPYBtgN+FnHWxNJ33PRLFrSl4DFssynq3HbNxTKm0i6AdoPMHAK8BPb01qg28akm7TtgdcBXyZ9L38plNfoedJJa/q5BTPwbH5+RtLywKPAqv0IkLTzSO/3e8G3fQxwjKRDbB/bz76jkP0XSZ1DLw637TD7/xD4oaR32f55g6rdCiwH/LUhef+WNC/pgoWkpYGXCmU9YfvXDekFMMH2JZJk+37gSEm/Jxm8Ev5l+/nqe5U0H/lzF/K3OoYt8xBwHfBO0k1BxTTSjUIpm+fnozrGDGxZIsz2k5J+Dkwg3VDuBHxC0rcLfntN63atpA8DFwHPAW+1/Y8SWZmmz5PphHFrJ7+UtDjwVeAG0pd9Up8yth/hPQNFsxnbx0paB1gLWLBj/NQSecBfJG0OWNICwIfJLsoCVsx3vdOAE0l3lp+yfVE/QiSdR/ofLQrcLuka4F/V+7bfWajft4GzgGXyHfUuwOF96latl14q6auk77FTt6I7cuC5vNZzl6SDSbPzZQplAVwm6TMkV/FbgQ8C59WQd52kn5FcVp2fd9Tnse2bgZsl/Zg0u1+T9D3/0fbzpYrZ/o/SfbuRtD2wP7AacBqwie2/S1qI9Lvoy7g1pVvHb6JiIeAJ4AeS6vwmmj5PphNuyZajFO23oO0nxlsXAElHAG8mGbdfAe8ALre9S6G8pYBjgK1IF5yLgENL3GuVi0/S24APAZ8DTrHdVwBNXkcYFtuX9atbh+w1gbeQPusl/c5GJF06smouuiPP7qY7gMWBL5BcWUfbvqpQ3jzA+4CtSZ/1QtsnlsjK8k7pMWzb+xfI2oa0xnNP1m1V4KDSmbCkZUnuueVtv0PSWsBmtn9QIOtU4CTbv+vx3lts9xW81ZRuY/Wb6HWekD5/bcMUxq2l5NnMKnTMrktmR5IWI7mW3piHLgOOKjWWkqYC6wE3ZkOyLOlkHGmmOJK8JW3/s2tsVdv3Fciq1iiPAX5r+6zO9a0CeasCf7X9XH49AVjW9p/6lDMxu5qW7PV+9+cfBCQdml3ZI46NB3m9cjvbd+fXqwHn216zUN6vSetin82/iflIv4/a60Z1GQvd8m++ipC8xvbfa8haGHguB7+Q3fYvs/1MqcyKiJZsIWo2Su9kkptut/x4knSyl/Ks7ZeAF7IL8O/Aq2rIOy/LASBHiJW6Ja6XdBEpaOBCpZDi0jUtgDO69n8xj/XLjyv9SGs+3c99I+nL2XVdvV5C0hcL5Jwn6dzhHiW6ZfbpMbZvqTBJa0i6RNKt+fW6kvpy6Xbw98qwZe4lncelLGV7Cvlcsf0Cfa4bV0jaVNK1kp6S9LykF1UeFdqoblm/3YBrSBHcuwFXSyry2mQuIa0tVkwAflND3nRiza2dbERzUXqr2X5Xx+vP5wilUq7LF9UTSBfnp4Cra8j7MsnAbQtMAk4F9iqU9T5gfeBe289Iejkp4qyU+TrXYvLC9wL9CrG9XX7uKyhoFrzD9mc6jvFYdrf1e8HvOy1kJCTtCbwbWLXLOC5KCowq5UTgE+SQcdu35LWzvg06cJukXwFTSOtIuwLXKgdhFUTXPp3PtSpQaFPSelQJ3yGlA5xBug7sDby6UFbTugF8Fti4mq3loKjfkML5S1jQ9lPVC9tP5fXF2oRxaydNRuk9K2my7csh5UcxFI1ZwsGki9eywFuBlUhRU0XYPl/S/KS1tkWBHW3fVSjrJUn3AWtIWnCWO8yaf0h6p+1zASTtQM3Ow0o5Qiszo7t5pvWVUTCvpJfZ/leWOwHouxpLnfXDYbiSdN4uRcqprJgG3FJD7kK2r9GMUbUvFMpaEPgbUK0j/QNYkhSEVRJs9VHgXGA1SVcAS5MMZhG275Y0b3bVnSLpylJZTesGzNPlhnyUeh7ApyW9rgqEUspXrXN9mk4Yt3ayFM1F6X2AFCa/WH79GDXcQ6Sk45eALW0fJekJkmHqq0qBpGOZMfpqIsk9dEiOvvpwv4pJ+k/gUFIi6E3ApqQKJUVBFsD7gdMlfYe02P0X0p10EZK+QkrOv50h15CBEuP2I+CSHGhhUoTdDwt0msoIode21+1HXk4juJ+UrN4kj+S1sWoGsguFN3+268zme3EbyVBOIp0nf6T8gv9M9g7cJOlo0mesU72nSd0ALpB0IfCT/Hp3UmBZKYcBZ0h6KL9+RZZZmwgoaSHDRSbVjNKbmGXU8d83loislHg9LDl3rV/dppKM7FW211eKTPy87Vo/FqUkerkgibZLzh+BdavZVl2UkpKryMuLbF9YIGPEyjfZWJXoNo0ho7kAMD/wtO2Jw+81orxXkVzhm5Nu0O4D9irRT9KKpJD6LbKOl5MidB8o1O2G7ojcXmOjlLUyaf1vflLu3WLAd7vWCMdFt479dybFAwj4ne2zSmVlefMzZHzvtP3vOvIqYubWQpp0FUn6Mimk+/H8egngY7ZLF+MbSUQuMV6j4Dnbz0kiu+zulDSpVJhSGsa7yFGrlUvM9lEj7DYS95IuWo0Ytxy6XiuRu9R4jULuDPUBJe1IqkRRQ6S3ytF189iephTNWsIppCCfyj33njz21n6ESFoOWIFc9o2hyjgTKSz71vF9PAt8vkTGWOnWwRXAv0nXgGvqCMrrax8FVrZ9gFIZuEm2f1lTx5i5tZG86Hss8BrSXe+8FN71qkcofJ07N0l7kdwGryO5wXYBDrfdVxShpCm2dxvOLdavOyzLPIsUQHIYyRX5GDC/7W36lZXlXUBafO8ucfX1YXfqLadywa5ASqO4hBndzSUu2EbOEUmX257cNdOCdDF06UxrmGNdZXvTwn17zUCut71hgaybbK8/q7FRyGms7FvT7uEmdeuSuxupuMRvs8w3AJ+wXRRQopSYfz2wt+118trxH/r9LnoRM7d20itiavVCWY0EHlTYPl3S9Qy5w3Z0WVmkQ/PzdqW6VCjnxdneKQ8dqZTsvBhwQQ3RK9p+e139GAr3v560uN9J6d1lI1F1tifn50Yr9mvG8m/zkHTs+7Nm1/LawGJdMifSUSGnTx6R9B6G1o32pCCS082Wfat+Bx/Kz6fl572AvnO+Gtatk6ajJVezvXuOssX2s+qKGioljFtLaTBiqpHAgy7d7gTurCnjr/m5CbfYmcCGki6x/ZYstwnX7pWSXmt7ah0hlQtWwyQ215DbZFRdlUC7LDNGcv65UFxnUv8LwJ9INR37ZRLpwr94l8xpwAGFuu1Pujn4Juk3cWUeK6V22bfqdyBpC9tbdLz1qRzlWOoKb6QkXQdNR0s+n2+4q2WO1WjIbR/GrZ00FjFl+2hJtzBU3uoLJYEHTdPDDTb9Lfp3h82jVBZsDUkf7X7T9jcK1ZwM7KuUXvCvDt36dplm9iGVGutk3x5jo6HRqDpJh5Aq2fyNoTVUA6WfdR5SkEbnWu/X6dOI2D4HOEfSG7tTJpTSWvomG+zSWoi92N/2MUpl35YhucZPIUUR98vCmjF1Z3PqRUs2qRs0Hy15BMm78kpJp5OCfPatIW86YdzayXtJF4eDSRFTryQFNvRNXnT/re0L8usJklZxnyWkmqZhN9gewI6k87lJue9oQoiGT2yeSHlic2PnSOZQYJILW+b0YN3KsMH0JPOiMmiZb5FmHZ0c22NsWDRz+skMlKx9VqLz8zakWqY313Ct7U+ahS9G0vUJ6s0qm9QN25+Q9C6SERJwQp1oSdsXS7qBlLYj0g1RrVzSijBuLaTDVfccNSKmMmcw1PYChkpI1eqe2yZs/xH4ilIfqBnuIjVMPcfRiq6n2XQaT2xu+ByBlMPXZHHueSQtYfsxmP499H29kbQZ6fxdumtWPpEURNMP1drnFqTC31VPt12ZsQVOv1Rl31YFPq3Csm9KRYRf7VQDciIp4K/ud9KIbp3kNbxa63iS1szRzNXNSZWzuJKklVze3WLoGBEt2T6yu+VIZq5k0XcNx2Eiw4oaZLYdSecDOzjVz6vCoc8viajL+1cRbCIFL6xKao+ydg0dGyk629Q50mEw1iatb53PjJGcRS5dSXuTulufSfof7gZ8yfZpI+44s5w3kbpQvB/4Xsdb04DzXFDNJgcbbV3lUylXyHFhe5hslKqyb48rlbtawXbfNy6Sfmf7jbPecvbq1vAyApJOsH1g/i56RemWFl6YTszc2skP6NFluZDGS0i1mLOBM7Pb5JWkyMSPlwpzV+X0fJd5UKk8SbuSajn+lvQjPlZSaRh1U+dI5cb9c34skB+1sH2qpOtIKRkCdrZ9e4Gcy0g9v/6noeAjSN3kFwWqbgyL5LG+qGYfJOMB8KoGAv0ulvRxZu6cXdQ5wqkk3QvAG5U6AlT0Zdyajqa1fWD+cxtSD7fJJCP3e+D4Jo4RM7cWIulq269vSNZqwOmkHCtI7qf32r6nCfltQ9KHgLeTEq8Psl0rgrCH/Do5gjeTOhfPEEZdMotu8hxpM5K+ZfswzdwsEygrSSdpP9Kst+qN9ybgSPdZWKDH7EOdOpbMPnLwUjcu8dpkeSeTgoJuoyNQyAV98MYCSVNInUpOz0N7Aovb3q227DBu7UPS/yWtJzTVZbmxElJtpGstRqRgi6nAjVDLtdYpdx5gQ2BJ228rlDe1czaYXUY3d88QRymr0XNE0sXArl3RjT8t/axNIWlD29fn2cy1XW9PtF3UHknS8qTz5A5SxY6HuqMx+5C1G3CBU8++z5GCXL7QxLpRXSTdbnut8dZjOHotkTS1bBJuyXZS3ZF39nAzBQWA1dWsVFKtZqUtpdtlctYw4yVyq7u/F0h95uospDcZRt3YOZJZukd04zKFshrDdhXo8W5SN++pMD0C9TAKev+p+QLbh9ueImkyqYTX10mutb5n1hoqR7VSnhWuTopiLS1H9QdJa5W4hGcTN0ra1Lnju6TXk8p71SZmbgOOpJ+TWuhULpf3AuvZ3nn4vQIASRsDn2HGjuh18twaLzrbFEpVZ3bKOWBVAd+zSl2wTaNUOPlMUsWOyaSKLNuV3KSp4QLbyiXuJP03MNX2j1XYAV4Nl6OS9EbSDcDDNJOr2SiS7iAFMlXFAlYizaZfoqaeMXNrKUrNO9emo8SQywr2Nt2stLXkNaxPMvP/rfSO/EekgJRbqRk+3UGtorOS3mP7R+qRrA61EtY/C1yeZ/aQZvoHjrD9bMX2vZL2IAUN/YUU7Vja96vRAtvAg5K+TyqU8BWlgtulVTuaLkd1MkNu+qbO4SZporxdT8K4tRBJ3yOtA/wHcBKpOHFp9e2mm5W2mdNJUWbbkULH9yE1oizlH6VrOr3QzEVnS6Ilq2oVTUevXZCjQatk2o80lUxbB81cUHhJ0lrj1Up9/0ru7B9Q6iZ/Nik68THgoVnsMxK7kS7SX8vh9q8gdQ0voelyVH+uIqXbSIMRsDMRbskWIukW2+t2PC8C/ML21gWy1gNOJRURhlQpf5+SHJy2o1wlvvq/5bHLbPfsjzcKeW8hRW91V/EvrajeWLTkWKDmuoQ3hsao31yH/DeRC2zbfr6OrCaQtDVpFr0WqUTWFsB+ti8dccfh5X2XVJfzPBo4h+ckYubWTqqZ1TM5qutRUgJxX+RovEkeqnhQu1lpy6maHP41u3UfIgUNlLIfsCapB1tnvcXSC0NjRWfzGtQxpJmWSQERH7F9b6G8qkv4DCHjlHUJb4yxvLPP8hvrndgEti/K659NlaOaQDJqnTfGdc7hOYYwbu3kl9ltcjRDZYFO6ldITuA8GJgy4Eat4os5OvRjpLqDE0mJzqWsVxKmPwK9oiVLm43+GDgOqNr87JHllua+7Ui6EWqkIntQhoY6W5zfY6xvbO/XmHJzGOGWbCHZ5/4BUiPA6Vn7tp8rkPU50kywkYoHcxOSTgS+2WQYtWYsOlscLdkriVv1moH+mpTn9lTJ/kE9JC1IWme/lFRurLNz9q9tv6aG3Pcxc5BVK5K4x5Iwbi0kZ+1PI0XrQY2s/VzxoFdlh6KKB21G0hqk/KJlcxj1usA7bX+xUN4dwGpAUy1vKrkTmXFdq+8bjZzE/TjwU9L3uzupCe1xJTJzykgjXcKD/lHq63cYqQzYgwwZtyeBE21/p1DuGaTei+8m9YTbC7jDdnEfwTmFMG4tpMms/TwL7K7d9r0aYdStJYexfwL4fpVjJOlW2+sUyusZzFC6DiTpINIF5lnSulZlLEsKYneWaap+xNUFsW+ZkvbpNe4+S1IF9ZB0iO1jG5RX5eBVwWnzk5Lhaxcmbjux5tZOmsza/yHp7u/b+fWeeax27bYWspDta7rSgl4oFTYGwQwfB9ZuKMT+/9BgyacwYq3hYUmL2p4m6XDS9/rF0u+VoSCrxyWtQ0rmXqUBPVtPGLcW0ZHTMz+wt6Q/59crA6XrPpO6ZnyX5pD0QeSRnBdU5QjtwlCfqDZwD/BMQ7IaKfkkaYrt3XrkkwGU5pEF5XzO9hn5e30bqYtEUSmvzAlKdUIPJ3XJWAT4XCOatpwwbu1iuzGQOWa121rIh4ATgDUlPUhaK9trfFWagU8DV0q6mvrrWlWbm21JbuZzJB1ZIKdaexmLcy/on87v9fga32vFaaQO7aswVIJv2Rry5hjCuLWIJt1gYzQLbC05p28j21tJWpiUU9a2DgjfB/6XZkohNVLyyfZf8/OY5pMFo6bJUl4A55A6rF9PvUoncxwRUDKgjHVlhzaihrsYN42kK21v3pCshUgln6baviuXfHqt7YsK5e0MfAVYhhSYUtRhOajHGHyvxQFVczph3IKBoe05fZK+BNzPzKWQxl0/SXcD29u+Y7x1mdvJ622r2z4ll2hbxHavJqajkXUCcKxzq6C5iTBuwcDQ9py+YcL3W6GfpCtsbzHeesztSDqC1KNvku01cvm9M/r9bjqWJeYDVgfupYUtb8aSWHMLBom16JHTN64azUjP8P3xVCi7IwGuU+oldjZzWYHdlrETsAFwA4DthySVdICY6wOEwrgFg0Tbc/oa69jcINvnZ5PSFOa6Arst43nbllSlsyw8qx16MYhr6v0Sxi0YJNqe09dU+H5jVIV1Jf2QVIH+8fx6CZLxDWYvU3K05OKSDgD2B04cZ53mSMK4BYNE23P6mg7zbpJ1K8MGYPsxSRuMp0JzKf8CfkPyQEwC/sv2xeOr0pxJGLdgjmcOyulrsmNz08wjaQnbjwFIWpK4PowHy5IS628ATiYZuqCAiJYM5njmxpy+ppG0N6mCypmkG4PdgC/ZPm1cFZsLUSqOujWpWe5GwBTgB7bvGVfF5jDiziyY4wnjVR/bp0q6DtiSFC6+c5N97ILRkwNKHiYVOX4BWAI4U9LFtj85vtrNOcTMLQiCoCVI+jCwD/AIcBJwtu1/5/Jyd9lebVwVnIOImVsQBEF7WIo0a57BG2H7JUlzfe5aP8TMLQiCIBg42hKGHARBEASNEcYtCIIgGDjCuAU8RJhUAAAAGklEQVRBEAQDRxi3IAiCYOAI4xYEQRAMHP8fr+4s/WBSmCsAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Perplexity: -1.30\n", | |
"Topic 0: ['broccoli', 'banana', 'munch']\n", | |
"Topic 1: ['kitten', 'cute', 'sister']\n", | |
"I like to eat broccoli and bananas. ['(0, 96.3%)', '(1, 3.7%)']\n", | |
"I munched a banana and spinach smoothie for breakfast. ['(0, 97.0%)', '(1, 3.0%)']\n", | |
"Chinchillas and kittens are cute. ['(0, 4.8%)', '(1, 95.2%)']\n", | |
"My sister adopted a kitten yesterday. ['(0, 3.7%)', '(1, 96.3%)']\n", | |
"Look at this cute hamster munching on a piece of broccoli. ['(0, 40.4%)', '(1, 59.6%)']\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<gensim.models.ldamodel.LdaModel at 0x2b8c7c99710>" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"apriori_original = {\n", | |
" 'banana':0, 'broccoli':0, 'munch':0,\n", | |
" 'cute':1, 'kitten':1 # we'll leave out broccoli from this one!\n", | |
"}\n", | |
"eta = create_eta(apriori_original, dictionary, 2)\n", | |
"test_eta(eta, dictionary, 2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"When we guide the distribution to allocate a few words towards the 'foody' topic and others towards the 'animaly' topic, we actually get a more pronounced distribution in the same topic allocation direction, and we even get more probability assiged to both topics for the last sentence." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Perplexity: -1.46\n", | |
"Topic 0: ['kitten', 'cute', 'chinchilla']\n", | |
"Topic 1: ['broccoli', 'banana', 'munch']\n", | |
"I like to eat broccoli and bananas. ['(0, 8.3%)', '(1, 91.7%)']\n", | |
"I munched a banana and spinach smoothie for breakfast. ['(0, 6.9%)', '(1, 93.1%)']\n", | |
"Chinchillas and kittens are cute. ['(0, 86.8%)', '(1, 13.2%)']\n", | |
"My sister adopted a kitten yesterday. ['(0, 89.3%)', '(1, 10.7%)']\n", | |
"Look at this cute hamster munching on a piece of broccoli. ['(0, 22.9%)', '(1, 77.1%)']\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<gensim.models.ldamodel.LdaModel at 0x2b8c7c4aef0>" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"apriori_opposite = {\n", | |
" 'cute':0, 'kitten':0,\n", | |
" 'banana':1, 'broccoli':1, 'munch':1\n", | |
"}\n", | |
"eta = create_eta(apriori_opposite, dictionary, 2)\n", | |
"test_eta(eta, dictionary, 2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Success! We've pushed the model in the opposite direction. Terms that were previously assigned to topic 0 are now topic 1, and vice-versa. However it looks like the model struggled with this a bit! The distribution is not as clear-cut. What if we push a little harder?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Perplexity: -1.17\n", | |
"Topic 0: ['kitten', 'cute', 'look']\n", | |
"Topic 1: ['munch', 'banana', 'broccoli']\n", | |
"I like to eat broccoli and bananas. ['(0, 4.1%)', '(1, 95.9%)']\n", | |
"I munched a banana and spinach smoothie for breakfast. ['(0, 3.3%)', '(1, 96.7%)']\n", | |
"Chinchillas and kittens are cute. ['(0, 94.6%)', '(1, 5.4%)']\n", | |
"My sister adopted a kitten yesterday. ['(0, 95.8%)', '(1, 4.2%)']\n", | |
"Look at this cute hamster munching on a piece of broccoli. ['(0, 50.0%)', '(1, 50.0%)']\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<gensim.models.ldamodel.LdaModel at 0x2b8c7cefda0>" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"apriori_harder = {\n", | |
" 'cute':0, 'kitten':0, 'hamster':0, 'chinchilla':0, 'look':0,\n", | |
" 'banana':1, 'broccoli':1, 'piece':1, 'breakfast':1, 'munch':1\n", | |
"}\n", | |
"eta = create_eta(apriori_harder, dictionary, 2)\n", | |
"test_eta(eta, dictionary, 2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Clearly marking additional terms as associated with topics has provided more polarization to the model." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We have seen that by providing a prior term-topic distribution to the model we can guide the LDA towards a useful topic model." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"However I have one last question. Since the LDA training algorithm is iterative in nature, does the order of the words in the dictionary have an effect on the result? Let's find out. First let's take a look at the dictionary we have right now." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['banana', 'broccoli', 'eat', 'like', 'breakfast', 'munch', 'smoothie', 'spinach', 'chinchilla', 'cute', 'kitten', 'adopt', 'sister', 'yesterday', 'hamster', 'look', 'piece']\n" | |
] | |
} | |
], | |
"source": [ | |
"print([dictionary[w] for w in dictionary.keys()])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's create a new one with a different word ordering." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['adopt', 'banana', 'breakfast', 'broccoli', 'chinchilla', 'cute', 'eat', 'hamster', 'kitten', 'like', 'look', 'munch', 'piece', 'sister', 'smoothie', 'spinach', 'yesterday']\n" | |
] | |
} | |
], | |
"source": [ | |
"dictionary2 = gensim.corpora.Dictionary(\n", | |
" [['banana', 'broccoli', 'eat', 'like', 'breakfast', 'munch', 'smoothie', 'spinach', 'chinchilla',\n", | |
" 'cute', 'kitten', 'adopt', 'sister', 'yesterday', 'hamster', 'look', 'piece']]\n", | |
")\n", | |
"print([dictionary2[w] for w in dictionary2.keys()])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Perplexity: -1.17\n", | |
"Topic 0: ['kitten', 'cute', 'look']\n", | |
"Topic 1: ['munch', 'banana', 'broccoli']\n", | |
"I like to eat broccoli and bananas. ['(0, 4.1%)', '(1, 95.9%)']\n", | |
"I munched a banana and spinach smoothie for breakfast. ['(0, 3.3%)', '(1, 96.7%)']\n", | |
"Chinchillas and kittens are cute. ['(0, 94.6%)', '(1, 5.4%)']\n", | |
"My sister adopted a kitten yesterday. ['(0, 95.8%)', '(1, 4.2%)']\n", | |
"Look at this cute hamster munching on a piece of broccoli. ['(0, 50.0%)', '(1, 50.0%)']\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Perplexity: -1.17\n", | |
"Topic 0: ['cute', 'kitten', 'chinchilla']\n", | |
"Topic 1: ['munch', 'banana', 'broccoli']\n", | |
"I like to eat broccoli and bananas. ['(0, 3.9%)', '(1, 96.1%)']\n", | |
"I munched a banana and spinach smoothie for breakfast. ['(0, 3.2%)', '(1, 96.8%)']\n", | |
"Chinchillas and kittens are cute. ['(0, 94.8%)', '(1, 5.2%)']\n", | |
"My sister adopted a kitten yesterday. ['(0, 96.0%)', '(1, 4.0%)']\n", | |
"Look at this cute hamster munching on a piece of broccoli. ['(0, 50.0%)', '(1, 50.0%)']\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<gensim.models.ldamodel.LdaModel at 0x2b8c796e898>" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"eta = create_eta(apriori_harder, dictionary, 2)\n", | |
"test_eta(eta, dictionary, 2)\n", | |
"\n", | |
"eta = create_eta(apriori_harder, dictionary2, 2)\n", | |
"test_eta(eta, dictionary2, 2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"While there are minor differences in the ordering of the topic terms and the document topic probabilities, the two models are almost identical." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment