Created
July 17, 2018 17:33
-
-
Save aparrish/114dd7018134c5da80bae0a101866581 to your computer and use it in GitHub Desktop.
Semantic similarity chatbot (with movie dialog). Gist mirror of Colab notebook here: https://colab.research.google.com/drive/1XlmtcyMdPRQC6bw2HQYb3UPtVGKqUJ0a Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "semantic-similarity-chatbot.ipynb", | |
"version": "0.3.2", | |
"views": {}, | |
"default_view": {}, | |
"provenance": [], | |
"collapsed_sections": [] | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"metadata": { | |
"id": "8R0T0ei52FXS", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"# Semantic similarity chatbot (with movie dialog)\n", | |
"\n", | |
"By [Allison Parrish](http://www.decontextualize.com/)\n", | |
"\n", | |
"![bot screenshot](http://static.decontextualize.com/snaps/semantic-similarity-chatbot.png)\n", | |
"\n", | |
"I teach [programming, arts and design](https://itp.nyu.edu/) and a perennial project idea is to make a chatbot that mimics someone or somethingβa famous author, a historical figure, or even the student's own e-mails or messaging logs. This notebook and the software described herein is intended to give those students some sample code to work with and a bit of a head start on concepts and architecture. (In particular, this material was inspired by conversations I had with [Utsav Chadha](https://itp.nyu.edu/thesis2018/#/student/utsav-chadha) and [Nouf Aljowaysir](https://itp.nyu.edu/thesis2018/#/student/nouf-aljowaysir) during the Spring 2018 semester at ITP.)\n", | |
"\n", | |
"In the notebook, I'll show how the chatbot works and build an example chatbot using the [Cornell Movie Dialog Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). Even if you don't know anything about programming or natural language processing or machine learning or whatever, you can step through the cells in this notebook and play around with the chatbot itself at the very end.\n", | |
"\n", | |
"> **TLDR version**: To run the chatbot, just keep hitting shift+enter until you reach the end. (A bunch of stuff needs to download and build, so it'll take a few minutes. Sorry.) If you're using Google Colab, there will be a little chat widget right in the notebook. If you're in Jupyter Notebook, there will be a link you can click on to open the chat in a new browser window.\n", | |
"\n", | |
"> **Content warning**: The Cornell Movie Dialog Corpus has dialog from many movies, including some with potentially objectionable content. When playing around with this code, you might see text from the dialog of these films, including (in some cases) violent language and slurs directed at marginalized groups. If you make a chatbot with this code and this corpus and make it available to a wide audience, consider including a content warning similar to this one and/or filtering the corpus and output of the bot to exclude words and sentiments like this.\n", | |
"\n", | |
"## Making a chatbot the easy way\n", | |
"\n", | |
"There are [lots](https://www.rivescript.com/) [of](https://rasa.com/) [ways](https://botpress.io/) to author chatbots, but many of them are oriented toward particular use cases (i.e., automating customer service), and require extensive hand-authoring of content or hand-labelling of data. Others (i.e., those that use seq2seq) require you to train a neural network from scratch, which is fine if you're into that kind of thing, but can sometimes feel like a rotten way to spend your money and your afternoon (or weekend, or month, or whatever).\n", | |
"\n", | |
"The chatbot in this notebook won't pass a Turing test or push percentage points on any machine learning accuracy evaluations, but it's (a) easy to understand (b) works with any corpus (c) doesn't require training a new model and (d) uncannily faithful to whatever source material you give it while still being amusingly bizarre. From a technical perspective, you can think of it as a sort of low-rent version of [Talk to Books](https://books.google.com/talktobooks/), which (as I understand it) works along similar principles.\n", | |
"\n", | |
"So how does this chatbot work? To answer that question we have to think about how *conversations* work.\n", | |
"\n", | |
"### Defining the conversation\n", | |
"\n", | |
"For the purposes of this chatbot, let's make a very simple \"toy\" definition of conversation. We'll say that a conversation consists of *two people taking turns at making utterances.* We'll call any individual utterance a *turn*. When one participant finishes their turn, the next participant can take their own turn; we'll call this second turn a *response* to the first. The conversation continues this way, with each turn being a response to the previous turn, until it comes to an end (usually due to a mutual agreement reached by the participants, which in the case of our chatbot, means whenever the human gets sick of chatting and closes the browser tab).\n", | |
"\n", | |
"To illustrate, here's a simple conversation I just invented between two participants, A and B. The first column numbers the turns, the second column labels the participant, and the third column gives the text of the turn:\n", | |
"\n", | |
"| # | P | Text |\n", | |
"|-|-|:-|\n", | |
"| 1 | A | Hello. |\n", | |
"| 2 | B | Good to see you! |\n", | |
"| 3 | A | I'm reading a tutorial on semantic similarity and chatbots. It's quite interesting. |\n", | |
"| 4 | B | Thanks for letting me know. |\n", | |
"| 5 | A | Any time. Well, I gotta go. |\n", | |
"| 6 | B | Talk to you soon! |\n", | |
"| 7 | A | Goodbye. |\n", | |
"\n", | |
"This fascinating conversation has seven turns. Turn 2 is the response to turn 1, turn 3 is the response to turn 2, etc.\n", | |
"\n", | |
"> *Note:* I said this was a \"toy\" definition for a reasonβconversations are actually *way* more complicated than this. If you're interested in how conversations actually work, check out [conversation analysis](https://en.wikipedia.org/wiki/Conversation_analysis), a whole subfield of linguistics devoted to this kind of thing.\n", | |
"\n", | |
"### Taking a turn\n", | |
"\n", | |
"At a certain basic level, the job of a chatbot at any moment in a conversation is to produce a conversational turn that seems to plausibly be in response to the turn that preceded it. There are a number of different ways to solve this problem. Our strategy is going to be the following:\n", | |
"\n", | |
"1. Make a database of conversations and the turns that constitute them;\n", | |
"2. Assign a *vector* to each turn that corresponds to its meaning (more on this in a second);\n", | |
"3. When asked to respond to a conversational turn from the user, display the *response* to the turn in the database most similar in meaning to the user's turn.\n", | |
"\n", | |
"For example, take the conversation that I invented earlier. Imagine putting all of these turns into the database and assigning each turn a vector representing its meaning. Our chatbot now has a database of six possible responses (not counting the first turn, since it began the conversation and wasn't in response to any other turn). If the user typed in something like...\n", | |
"\n", | |
" > Howdy!\n", | |
" \n", | |
"... our chatbot would then search its database for the turn closest in meaning to `Howdy!` Maybe that turn is turn #1 (`Hello.`). The chatbot would then display the turn that happened *in response* to turn #1 (i.e., turn #2, `Good to see you!`). If the user typed in...\n", | |
"\n", | |
" > Thank you for the great conversation!\n", | |
" \n", | |
"... our chatbot would find the turn in its database closest in meaning, maybe turn #4 (`Thanks for letting me know.`), and then print out its associated response (turn #5, `Any time. Well, I gotta go.`). The final transcript of this imaginary (and admittedly a little contrived) conversation, with the human's turn labelled with `H` and the bot as `B`:\n", | |
"\n", | |
" H: Howdy!\n", | |
" B: Hello.\n", | |
" H: Thank you for the great conversation!\n", | |
" B: Any time. Well, I gotta go!\n", | |
" \n", | |
"Perfectly plausible!\n", | |
"\n", | |
"So you can think of this semantic similarity chatbot as a kind of search engine. When you type something into the chat, the chatbot *searches its database for the most appropriate response*.\n", | |
"\n", | |
"### Word vectors\n", | |
"\n", | |
"\"This is all well and good,\" you say. \"But how do you make a computer program that knows how similar in meaning two sentences are? How do you even *measure* similarity in meaning?\" Figuring out a way to measure similarity in meaning is one of the classic problems in computational linguistics, and it's still very much an open problem. But there are certain easy-to-use techniques that are \"good enough\" for our purposes. In particular, we're going to use *word vectors*.\n", | |
"\n", | |
"[I've written a more detailed introduction to word vectors here](https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb), if you want the whole story. But the short version is this: using machine learning techniques and a lot of data, it's possible to assign each word a sequence of numbers (i.e., a vector) that encodes the word's meaning. (Actually, it's encoding the word's *distribution*, or all of the other words that the word is usually seen alongside. But it turns out that this is a good substitute for representing a word's meaning.) \n", | |
"\n", | |
"A word vector looks a lot like the Cartesian X, Y coordinates you likely studied in school, except that they usually have many hundreds of dimensions, not just two. (More dimensions means more information about the word's distribution.) For example, here's the vector for the word \"cheese\" using the fifty-dimensional pre-trained vectors from GloVe:\n", | |
"\n", | |
" -0.053903 -0.30871 -1.3285 -0.43342 0.31779 1.5224 -0.6965 -0.037086 -0.83784 0.074107 -0.30532 -0.1783 1.2337 0.085473 0.17362 -0.19001 0.36907 0.49454 -0.024311 -1.0535 0.5237 -1.1489 0.95093 1.1538 -0.52286 -0.14931 -0.97614 1.3912 0.79875 -0.72134 1.5411 -0.15928 -0.30472 1.7265 0.13124 -0.054023 -0.74212 1.675 1.9502 -0.53274 1.1359 0.20027 0.02245 -0.39379 1.0609 1.585 0.17889 0.43556 0.68161 0.066202\n", | |
"\n", | |
"Experts have made [large databases of word vectors available for people to download and use](https://nlp.stanford.edu/projects/glove/), so that you don't have to train them yourself. (Though [you can train them yourself if you want to](https://radimrehurek.com/gensim/models/word2vec.html).)\n", | |
"\n", | |
"### Sentence vectors\n", | |
"\n", | |
"Importantly, two words with similar meanings will also have similar vectors (meaning, more or less, that all of the numbers in the vectors are similar in value). So you can tell if two words are synonymous by checking the similarity between their vectors.\n", | |
"\n", | |
"But what about the meaning of *entire sentences*? This is a little bit more difficult, and there are a number of different and sophisticated solutions (including Google's [Universal Sentence Encoder](https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2) and [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html)). It turns out, though, that you can get a pretty good vector for a sentence simply by *averaging together the vectors for the words in the sentence*. We'll call such vectors *sentence vectors* or *summary vectors*.\n", | |
"\n", | |
"Intuitively, this makes sense: finding the average is a time-tested method in statistics of characterizing a data set. It's apparently no different with word vectors. This method has the additional benefits of being fast and easy to explain.\n", | |
"\n", | |
"## Writing the code\n", | |
"\n", | |
"With your understanding of these concepts, we can actually start writing some code. For our semantic similarity chatbot, we need:\n", | |
"\n", | |
"* Pre-trained word vectors\n", | |
"* A corpus of conversations\n", | |
"* Some code to parse conversations into turns and map each turn to its response\n", | |
"* Some code that can average the word vectors in some text to produce a sentence vector\n", | |
"* A database that will allow us to store sentence vectors and look them up by similarity\n", | |
"* Some code to take an incoming conversational turn, turn it into a sentence vector, and then look up the most similar vector in the database\n", | |
"\n", | |
"Let's take these one-by-one." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "1gQ6PaStQ171", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"### Pre-trained word vectors\n", | |
"\n", | |
"We're going to use [spaCy](https://spacy.io), a wonderful Python library for natural language processing, both to tokenize text (i.e., turn text into a list of words) and for its database of word vectors. To install spaCy, run the cell below:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "XYaOlN9zRAn4", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 1287 | |
}, | |
"outputId": "27570d71-a626-4e33-dd27-c0fb007dde04", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531845768040, | |
"user_tz": 240, | |
"elapsed": 369653, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!pip install spacy" | |
], | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting spacy\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/3c/31/e60f88751e48851b002f78a35221d12300783d5a43d4ef12fbf10cca96c3/spacy-2.0.11.tar.gz (17.6MB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 17.6MB 1.8MB/s \n", | |
"\u001b[?25hRequirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.14.5)\n", | |
"Collecting murmurhash<0.29,>=0.28 (from spacy)\n", | |
" Downloading https://files.pythonhosted.org/packages/5e/31/c8c1ecafa44db30579c8c457ac7a0f819e8b1dbc3e58308394fff5ff9ba7/murmurhash-0.28.0.tar.gz\n", | |
"Collecting cymem<1.32,>=1.30 (from spacy)\n", | |
" Downloading https://files.pythonhosted.org/packages/f8/9e/273fbea507de99166c11cd0cb3fde1ac01b5bc724d9a407a2f927ede91a1/cymem-1.31.2.tar.gz\n", | |
"Collecting preshed<2.0.0,>=1.0.0 (from spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/1b/ac/7c17b1fd54b60972785b646d37da2826311cca70842c011c4ff84fbe95e0/preshed-1.0.0.tar.gz (89kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 92kB 19.7MB/s \n", | |
"\u001b[?25hCollecting thinc<6.11.0,>=6.10.1 (from spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/55/fd/e9f36081e6f53699943381858848f3b4d759e0dd03c43b98807dde34c252/thinc-6.10.2.tar.gz (1.2MB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 1.2MB 13.1MB/s \n", | |
"\u001b[?25hCollecting plac<1.0.0,>=0.9.6 (from spacy)\n", | |
" Downloading https://files.pythonhosted.org/packages/9e/9b/62c60d2f5bc135d2aa1d8c8a86aaf84edb719a59c7f11a4316259e61a298/plac-0.9.6-py2.py3-none-any.whl\n", | |
"Collecting pathlib (from spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/ac/aa/9b065a76b9af472437a0059f77e8f962fe350438b927cb80184c32f075eb/pathlib-1.0.1.tar.gz (49kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 51kB 14.1MB/s \n", | |
"\u001b[?25hCollecting ujson>=1.35 (from spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/16/c4/79f3409bc710559015464e5f49b9879430d8f87498ecdc335899732e5377/ujson-1.35.tar.gz (192kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 194kB 13.1MB/s \n", | |
"\u001b[?25hCollecting dill<0.3,>=0.2 (from spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/6f/78/8b96476f4ae426db71c6e86a8e6a81407f015b34547e442291cd397b18f3/dill-0.2.8.2.tar.gz (150kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 153kB 23.2MB/s \n", | |
"\u001b[?25hCollecting regex==2017.4.5 (from spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 604kB 18.5MB/s \n", | |
"\u001b[?25hCollecting wrapt (from thinc<6.11.0,>=6.10.1->spacy)\n", | |
" Downloading https://files.pythonhosted.org/packages/a0/47/66897906448185fcb77fc3c2b1bc20ed0ecca81a0f2f88eda3fc5a34fc3d/wrapt-1.10.11.tar.gz\n", | |
"Collecting tqdm<5.0.0,>=4.10.0 (from thinc<6.11.0,>=6.10.1->spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/93/24/6ab1df969db228aed36a648a8959d1027099ce45fad67532b9673d533318/tqdm-4.23.4-py2.py3-none-any.whl (42kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 51kB 16.8MB/s \n", | |
"\u001b[?25hCollecting cytoolz<0.9,>=0.8 (from thinc<6.11.0,>=6.10.1->spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/0f/e6/ccc124714dcc1bd511e64ddafb4d5d20ada2533b92e3173a4cf09e0d0831/cytoolz-0.8.2.tar.gz (386kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 389kB 21.6MB/s \n", | |
"\u001b[?25hRequirement already satisfied: six<2.0.0,>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy) (1.11.0)\n", | |
"Requirement already satisfied: termcolor in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy) (1.1.0)\n", | |
"Collecting msgpack-python (from thinc<6.11.0,>=6.10.1->spacy)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/8a/20/6eca772d1a5830336f84aca1d8198e5a3f4715cd1c7fc36d3cc7f7185091/msgpack-python-0.5.6.tar.gz (138kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 143kB 24.7MB/s \n", | |
"\u001b[?25hCollecting msgpack-numpy==0.4.1 (from thinc<6.11.0,>=6.10.1->spacy)\n", | |
" Downloading https://files.pythonhosted.org/packages/2e/43/393e30e2768b0357541ac95891f96b80ccc4d517e0dd2fa3042fc8926538/msgpack_numpy-0.4.1-py2.py3-none-any.whl\n", | |
"Requirement already satisfied: toolz>=0.8.0 in /usr/local/lib/python3.6/dist-packages (from cytoolz<0.9,>=0.8->thinc<6.11.0,>=6.10.1->spacy) (0.9.0)\n", | |
"Building wheels for collected packages: spacy, murmurhash, cymem, preshed, thinc, pathlib, ujson, dill, regex, wrapt, cytoolz, msgpack-python\n", | |
" Running setup.py bdist_wheel for spacy ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": [ | |
"\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/fb/00/28/75c85d5135e7d9a100639137d1847d41e914ed16c962d467e4\n", | |
" Running setup.py bdist_wheel for murmurhash ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/b8/94/a4/f69f8664cdc1098603df44771b7fec5fd1b3d8364cdd83f512\n", | |
" Running setup.py bdist_wheel for cymem ... \u001b[?25l-\b \b\\\b \b|\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/55/8d/4a/f6328252aa2aaec0b1cb906fd96a1566d77f0f67701071ad13\n", | |
" Running setup.py bdist_wheel for preshed ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/8f/85/06/2d132fb649a6bbcab22487e4147880a55b0dd0f4b18fdfd6b5\n", | |
" Running setup.py bdist_wheel for thinc ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/d8/5c/3e/9acf5d9974fb1c9e7b467563ea5429c9325f67306e93147961\n", | |
" Running setup.py bdist_wheel for pathlib ... \u001b[?25l-\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/f9/b2/4a/68efdfe5093638a9918bd1bb734af625526e849487200aa171\n", | |
" Running setup.py bdist_wheel for ujson ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/28/77/e4/0311145b9c2e2f01470e744855131f9e34d6919687550f87d1\n", | |
" Running setup.py bdist_wheel for dill ... \u001b[?25l-\b \b\\\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/e2/5d/17/f87cb7751896ac629b435a8696f83ee75b11029f5d6f6bda72\n", | |
" Running setup.py bdist_wheel for regex ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/75/07/38/3c16b529d50cb4e0cd3dbc7b75cece8a09c132692c74450b01\n", | |
" Running setup.py bdist_wheel for wrapt ... \u001b[?25l-\b \b\\" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": [ | |
"\b \bdone\r\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/48/5d/04/22361a593e70d23b1f7746d932802efe1f0e523376a74f321e\n", | |
" Running setup.py bdist_wheel for cytoolz ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/f8/b1/86/c92e4d36b690208fff8471711b85eaa6bc6d19860a86199a09\n", | |
" Running setup.py bdist_wheel for msgpack-python ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/d5/de/86/7fa56fda12511be47ea0808f3502bc879df4e63ab168ec0406\n", | |
"Successfully built spacy murmurhash cymem preshed thinc pathlib ujson dill regex wrapt cytoolz msgpack-python\n", | |
"Installing collected packages: murmurhash, cymem, preshed, wrapt, tqdm, cytoolz, plac, dill, pathlib, msgpack-python, msgpack-numpy, thinc, ujson, regex, spacy\n", | |
"Successfully installed cymem-1.31.2 cytoolz-0.8.2 dill-0.2.8.2 msgpack-numpy-0.4.1 msgpack-python-0.5.6 murmurhash-0.28.0 pathlib-1.0.1 plac-0.9.6 preshed-1.0.0 regex-2017.4.5 spacy-2.0.11 thinc-6.10.2 tqdm-4.23.4 ujson-1.35 wrapt-1.10.11\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "5q_HHp3xRNtA", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"It turns out that spaCy requires a \"model\" file, which is a bundle of statistical information that allows the library to parse text into words and parts of speech. While spaCy comes with a model when you install it, that model does *not* include word vectors, so you'll need to download a model that does include them. For English, I recommend `en_core_web_lg`, which you can download and install by running the cell below. (The model file is fairly large and might take a while to download.)" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "sDFEXEfhRMCB", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 243 | |
}, | |
"outputId": "770dd844-48b9-484c-a55f-5066c4f41056", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531845894258, | |
"user_tz": 240, | |
"elapsed": 126096, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!python -m spacy download en_core_web_lg" | |
], | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz\n", | |
"\u001b[?25l Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz (852.3MB)\n", | |
"\u001b[K 54% |ββββββββββββββββββ | 462.2MB 62.3MB/s eta 0:00:07" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": [ | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 852.3MB 71.2MB/s \n", | |
"\u001b[?25hInstalling collected packages: en-core-web-lg\n", | |
" Running setup.py install for en-core-web-lg ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \bdone\n", | |
"\u001b[?25hSuccessfully installed en-core-web-lg-2.0.0\n", | |
"\n", | |
"\u001b[93m Linking successful\u001b[0m\n", | |
" /usr/local/lib/python3.6/dist-packages/en_core_web_lg -->\n", | |
" /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_lg\n", | |
"\n", | |
" You can now load the model via spacy.load('en_core_web_lg')\n", | |
"\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "3I9yIqLWSSA9", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"The code in the following cell loads `spacy` and the model you just downloaded:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "1RY5ytQYSKZf", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"import spacy\n", | |
"nlp = spacy.load('en_core_web_lg')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "YMYW6B-KSXK3", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"You can look up the word vector for a particular word using spaCy right out the box like so:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "3J9-D9EESjbE", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 1076 | |
}, | |
"outputId": "d6d272ef-828c-479d-9d3f-e204b0345431", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846614563, | |
"user_tz": 240, | |
"elapsed": 582, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"nlp.vocab['cheese'].vector # replace cheese with whatever word you want!" | |
], | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"array([-5.5252e-01, 1.8894e-01, 6.8737e-01, -1.9789e-01, 7.0575e-02,\n", | |
" 1.0075e+00, 5.1789e-02, -1.5603e-01, 3.1941e-01, 1.1702e+00,\n", | |
" -4.7248e-01, 4.2867e-01, -4.2025e-01, 2.4803e-01, 6.8194e-01,\n", | |
" -6.7488e-01, 9.2401e-02, 1.3089e+00, -3.6278e-02, 2.0098e-01,\n", | |
" 7.6005e-01, -6.6718e-02, -7.7794e-02, 2.3844e-01, -2.4351e-01,\n", | |
" -5.4164e-01, -3.3540e-01, 2.9805e-01, 3.5269e-01, -8.0594e-01,\n", | |
" -4.3611e-01, 6.1535e-01, 3.4212e-01, -3.3603e-01, 3.3282e-01,\n", | |
" 3.8065e-01, 5.7427e-02, 9.9918e-02, 1.2525e-01, 1.1039e+00,\n", | |
" 3.6678e-02, 3.0490e-01, -1.4942e-01, 3.2912e-01, 2.3300e-01,\n", | |
" 4.3395e-01, 1.5666e-01, 2.2778e-01, -2.5830e-02, 2.4334e-01,\n", | |
" -5.8136e-02, -1.3486e-01, 2.4521e-01, -3.3459e-01, 4.2839e-01,\n", | |
" -4.8181e-01, 1.3403e-01, 2.6049e-01, 8.9933e-02, -9.3770e-02,\n", | |
" 3.7672e-01, -2.9558e-02, 4.3841e-01, 6.1212e-01, -2.5720e-01,\n", | |
" -7.8506e-01, 2.3880e-01, 1.3399e-01, -7.9315e-02, 7.0582e-01,\n", | |
" 3.9968e-01, 6.7779e-01, -2.0474e-03, 1.9785e-02, -4.2059e-01,\n", | |
" -5.3858e-01, -5.2155e-02, 1.7252e-01, 2.7547e-01, -4.4482e-01,\n", | |
" 2.3595e-01, -2.3445e-01, 3.0103e-01, -5.5096e-01, -3.1159e-02,\n", | |
" -3.4433e-01, 1.2386e+00, 1.0317e+00, -2.2728e-01, -9.5207e-03,\n", | |
" -2.5432e-01, -2.9792e-01, 2.5934e-01, -1.0421e-01, -3.3876e-01,\n", | |
" 4.2470e-01, 5.8335e-04, 1.3093e-01, 2.8786e-01, 2.3474e-01,\n", | |
" 2.5905e-02, -6.4359e-01, 6.1330e-02, 6.3842e-01, 1.4705e-01,\n", | |
" -6.1594e-01, 2.5097e-01, -4.4872e-01, 8.6825e-01, 9.9555e-02,\n", | |
" -4.4734e-02, -7.4239e-01, -5.9147e-01, -5.4929e-01, 3.8108e-01,\n", | |
" 5.5177e-02, -1.0487e-01, -1.2838e-01, 6.0521e-03, 2.8743e-01,\n", | |
" 2.1592e-01, 7.2871e-02, -3.1644e-01, -4.3321e-01, 1.8682e-01,\n", | |
" 6.7274e-02, 2.8115e-01, -4.6222e-02, -9.6803e-02, 5.6091e-01,\n", | |
" -6.7762e-01, -1.6645e-01, 1.5553e-01, 5.2301e-01, -3.0058e-01,\n", | |
" -3.7291e-01, 8.7895e-02, -1.7963e-01, -4.4193e-01, -4.4607e-01,\n", | |
" -2.4122e+00, 3.3738e-01, 6.2416e-01, 4.2787e-01, -2.5386e-01,\n", | |
" -6.1683e-01, -7.0097e-01, 4.9303e-01, 3.6916e-01, -9.7499e-02,\n", | |
" 6.1411e-01, -4.7572e-03, 4.3916e-01, -2.1551e-01, -5.6745e-01,\n", | |
" -4.0278e-01, 2.9459e-01, -3.0850e-01, 1.0103e-01, 7.9741e-02,\n", | |
" -6.3811e-01, 2.4781e-01, -4.4546e-01, 1.0828e-01, -2.3624e-01,\n", | |
" -5.0838e-01, -1.7001e-01, -7.8735e-01, 3.4073e-01, -3.1830e-01,\n", | |
" 4.5286e-01, -9.5118e-02, 2.0772e-01, -8.0183e-02, -3.7982e-01,\n", | |
" -4.9949e-01, 4.0759e-02, -3.7724e-01, -8.9705e-02, -6.8187e-01,\n", | |
" 2.2106e-01, -3.9931e-01, 3.2329e-01, -3.6180e-01, -7.2093e-01,\n", | |
" -6.3404e-01, 4.3125e-01, -4.9743e-01, -1.7395e-01, -3.8779e-01,\n", | |
" -3.2556e-01, 1.4423e-01, -8.3401e-02, -2.2994e-01, 2.7793e-01,\n", | |
" 4.9112e-01, 6.4511e-01, -7.8945e-02, 1.1171e-01, 3.7264e-01,\n", | |
" 1.3070e-01, -6.1607e-02, -4.3501e-01, 2.8999e-02, 5.6224e-01,\n", | |
" 5.8012e-02, 4.7078e-02, 4.2770e-01, 7.3245e-01, -2.1150e-02,\n", | |
" 1.1988e-01, 7.8823e-02, -1.9106e-01, 3.5278e-02, -3.1102e-01,\n", | |
" 1.3209e-01, -2.8606e-01, -1.5649e-01, -6.4339e-01, 4.4599e-01,\n", | |
" -3.0912e-01, 4.4520e-01, -3.6774e-01, 2.7327e-01, 6.7833e-01,\n", | |
" -8.3830e-02, -4.5120e-01, 1.0754e-01, -4.5908e-01, 1.5095e-01,\n", | |
" -4.5856e-01, 3.4465e-01, 7.8013e-02, -2.8319e-01, -2.8149e-02,\n", | |
" 2.4404e-01, -7.1345e-01, 5.2834e-02, -2.8085e-01, 2.5344e-02,\n", | |
" 4.2979e-02, 1.5663e-01, -7.4647e-01, -1.1301e+00, 4.4135e-01,\n", | |
" 3.1444e-01, -1.0018e-01, -5.3526e-01, -9.0601e-01, -6.4954e-01,\n", | |
" 4.2664e-02, -7.9927e-02, 3.2905e-01, -3.0797e-01, -1.9190e-02,\n", | |
" 4.2765e-01, 3.1460e-01, 2.9051e-01, -2.7386e-01, 6.8483e-01,\n", | |
" 1.9395e-02, -3.2884e-01, -4.8239e-01, -1.5747e-01, -1.6036e-01,\n", | |
" 4.9164e-01, -7.0352e-01, -3.5591e-01, -7.4887e-01, -5.2827e-01,\n", | |
" 4.4983e-02, 5.9247e-02, 4.6224e-01, 8.9697e-02, -7.5618e-01,\n", | |
" 6.3682e-01, 9.0680e-02, 6.8830e-02, 1.8296e-01, 1.0754e-01,\n", | |
" 6.7811e-01, -1.4716e-01, 1.7029e-01, -5.2630e-01, 1.9268e-01,\n", | |
" 9.3130e-01, 8.0363e-01, 6.1324e-01, -3.0494e-01, 2.0236e-01,\n", | |
" 5.8520e-01, 2.6484e-01, -4.5863e-01, 2.1035e-03, -5.6990e-01,\n", | |
" -4.9092e-01, 4.2511e-01, -1.0954e+00, 1.7124e-01, 2.2495e-01],\n", | |
" dtype=float32)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 8 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "Z06aIT2uGxe5", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"It might not look much, but that list of three hundred numbers is spaCy's idea of what \"cheese\" means." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "Q65ocsUUSsJV", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"### Parsing a corpus of conversations\n", | |
"\n", | |
"So now we need some data for the bot. In particular, we need some conversations: the text of the turns along with information about which turn is in response to which. Fortunately, some researchers at Cornell University have made available a very interesting corpus of conversations: [The Cornell Movie Dialog Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), containing \"220,579 conversational exchanges between 10,292 pairs of movie characters.\" Very cool. The data is stored in several plain text files, which you can download by running the following cells:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "U4DzyPM9AFjK", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 69 | |
}, | |
"outputId": "ca272586-d010-42a4-ec32-194adf4f00db", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531845918579, | |
"user_tz": 240, | |
"elapsed": 8701, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!curl -L -O http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip" | |
], | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
" % Total % Received % Xferd Average Speed Time Time Time Current\r\n", | |
" Dload Upload Total Spent Left Speed\n", | |
"100 9684k 100 9684k 0 0 1614k 0 0:00:06 0:00:06 --:--:-- 2687k\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "YnAPBxSnAV89", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 278 | |
}, | |
"outputId": "da8360d8-7610-4166-b52a-c40ad48e8db2", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531845920459, | |
"user_tz": 240, | |
"elapsed": 1741, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!unzip cornell_movie_dialogs_corpus.zip" | |
], | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Archive: cornell_movie_dialogs_corpus.zip\r\n", | |
" creating: cornell movie-dialogs corpus/\r\n", | |
" inflating: cornell movie-dialogs corpus/.DS_Store \r\n", | |
" creating: __MACOSX/\r\n", | |
" creating: __MACOSX/cornell movie-dialogs corpus/\r\n", | |
" inflating: __MACOSX/cornell movie-dialogs corpus/._.DS_Store \r\n", | |
" inflating: cornell movie-dialogs corpus/chameleons.pdf \r\n", | |
" inflating: __MACOSX/cornell movie-dialogs corpus/._chameleons.pdf \r\n", | |
" inflating: cornell movie-dialogs corpus/movie_characters_metadata.txt \n", | |
" inflating: cornell movie-dialogs corpus/movie_conversations.txt \n", | |
" inflating: cornell movie-dialogs corpus/movie_lines.txt \n", | |
" inflating: cornell movie-dialogs corpus/movie_titles_metadata.txt \n", | |
" inflating: cornell movie-dialogs corpus/raw_script_urls.txt \n", | |
" inflating: cornell movie-dialogs corpus/README.txt \n", | |
" inflating: __MACOSX/cornell movie-dialogs corpus/._README.txt \n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "1dAMYW22GNAJ", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"We'll be working with two files from this corpus. One file (`movie_lines.txt`) has the movie lines themselves, associated with a short unique identifier; another file (`movie_conversations.txt`) has lists of which lines occurred together in conversations, in the order in which they occurred. The following two cells parse these two files and create lookup dictionaries that associate unique IDs to lines (`movie_lines`) and each line to the line that follows it (`responses`)." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "MHr4xBS_AfBT", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"movie_lines = {}\n", | |
"for line in open(\"./cornell movie-dialogs corpus/movie_lines.txt\",\n", | |
" encoding=\"latin1\"):\n", | |
" line = line.strip()\n", | |
" parts = line.split(\" +++$+++ \")\n", | |
" if len(parts) == 5:\n", | |
" movie_lines[parts[0]] = parts[4]\n", | |
" else:\n", | |
" movie_lines[parts[0]] = \"\"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "arbpjBT5Aj7K", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"import json\n", | |
"responses = {}\n", | |
"for line in open(\"./cornell movie-dialogs corpus/movie_conversations.txt\",\n", | |
" encoding=\"latin1\"):\n", | |
" line = line.strip()\n", | |
" parts = line.split(\" +++$+++ \")\n", | |
" line_ids = json.loads(parts[3].replace(\"'\", '\"'))\n", | |
" for first, second in zip(line_ids[:-1], line_ids[1:]):\n", | |
" responses[first] = second" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "Kdjf5_NTG6_O", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Just to make sure everything works, the cell below prints out five random pairs of conversational turns from the corpus:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "eWJd0N2oAsAu", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 278 | |
}, | |
"outputId": "62a8f2f6-4500-49da-d8a4-4faa87555e62", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846625857, | |
"user_tz": 240, | |
"elapsed": 711, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"import random\n", | |
"for pair in random.sample(responses.items(), 5):\n", | |
" print(\"A:\", movie_lines[pair[0]])\n", | |
" print(\"B:\", movie_lines[pair[1]])\n", | |
" print()" | |
], | |
"execution_count": 9, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"A: She could be out. She could be sick in bed for all we know.\n", | |
"B: Okay. Okay. I'll bet there's...Look at this.\n", | |
"\n", | |
"A: I... You're not going to understand this.\n", | |
"B: Don't treat me like I'm stupid. It pisses me off.\n", | |
"\n", | |
"A: Oh, Miles. You're drunk.\n", | |
"B: Just some local Pinot, you know, then a little Burgundy. That old Cotes de Beaune!\n", | |
"\n", | |
"A: You talked to Stifler?\n", | |
"B: Well...I needed to find you. We are gonna have to practice that song.\n", | |
"\n", | |
"A: Take the elevator to the very bottom, go left, down the crewman's passage, then make a right.\n", | |
"B: Bottom, left, right. I have it.\n", | |
"\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "Vl38pCn4HlL_", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"### Making a sentence vector\n", | |
"\n", | |
"To make the sentence vector for each line of dialog, we're going to use spaCy. The function `sentence_mean` below takes the spaCy object that we loaded earlier (`nlp`) and uses it to tokenize the string that you pass into the function (i.e., break it up into words). It then uses numpy's `mean()` function to find the average of the vectors, producing a new vector. The shape of the resulting vector (i.e., the number of dimensions) should be the same as the shape of the individual word vectors.\n", | |
"\n", | |
"(Note: I disabled the `tagger` and `parser` parts of spaCy's pipeline to improve performance. We're not using part of speech tags or dependency relations in this chatbot, so there's no reason to spend time calculating them.)" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "JizJee4YBAdJ", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"outputId": "dcca908d-9af3-457f-d574-e18330f08808", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846645422, | |
"user_tz": 240, | |
"elapsed": 480, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"import numpy as np\n", | |
"def sentence_mean(nlp, s):\n", | |
" if s == \"\":\n", | |
" s = \" \"\n", | |
" doc = nlp(s, disable=['tagger', 'parser'])\n", | |
" return np.mean(np.array([w.vector for w in doc]), axis=0)\n", | |
"sentence_mean(nlp, \"This... is a test.\").shape" | |
], | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(300,)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 10 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "sN9qQhQvJG-L", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"### Similarity lookups\n", | |
"\n", | |
"Now that we have conversational turns and a way to vectorize those turns, we can make our database for semantic similarity lookup! The kind of \"database\" we'll need to use for this is an [approximate nearest neighbors](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods) lookup, which allows you to store items along with the vector that represents them, and then do fast searches to find items with similar vectors (even items that weren't in the original dataset).\n", | |
"\n", | |
"[I made a Python library to make it easy to build databases like this](https://pypi.org/project/simpleneighbors/) called Simple Neighbors. It's a lightweight wrapper around the industrial-strength approximate nearest neighbors lookup library called [Annoy](https://pypi.python.org/pypi/annoy). To install Simple Neighbors, run the cell below:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "ppulERZ5Tz5a", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 228 | |
}, | |
"outputId": "a1505337-f89c-44ce-9751-3244e6bb0299", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846664230, | |
"user_tz": 240, | |
"elapsed": 10274, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!pip install simpleneighbors" | |
], | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting simpleneighbors\n", | |
" Downloading https://files.pythonhosted.org/packages/a2/8e/b8ca38e4305bdf5c4cac5d9bf4b65022a2d3641a978b28ce92f9e4063c7b/simpleneighbors-0.0.1-py2.py3-none-any.whl\n", | |
"Collecting annoy (from simpleneighbors)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/f1/9a/3db2737d76a66201873dd0a4301df4774ed16127139efa3db313cdbca04b/annoy-1.12.0.tar.gz (632kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 634kB 5.8MB/s \n", | |
"\u001b[?25hBuilding wheels for collected packages: annoy\n", | |
" Running setup.py bdist_wheel for annoy ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/02/2c/74/05c1a37da305f1a8cc94d846dbc2b0b01dd43afe00d1f9c191\n", | |
"Successfully built annoy\n", | |
"Installing collected packages: annoy, simpleneighbors\n", | |
"Successfully installed annoy-1.12.0 simpleneighbors-0.0.1\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "-_sFq49LKISH", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"The cell below makes a new Simple Neighbors object called `nns` and initializes it with 300 dimensions (the shape of the word vectors in spaCy, and also the shape of our summary vectors). It then samples ten thousand random conversational turns from the Cornell corpus, finds sentence vectors for each of them, and adds them to the database. (The `np.any()` line just checks to make sure that we don't add any vectors that are all zeroes by accidentβthis can mess up the nearest-neighbor search.)\n", | |
"\n", | |
"Notes on the code below:\n", | |
"\n", | |
"* I decided to just sample ten thousand turns so that the index will build faster. You can change this number to your liking!\n", | |
"* It only adds *turns that have responses* to the database (i.e., keys in the `responses` lookup). Because of the way the bot works, we don't need to keep track of the last turn of a conversation, since it (by definition) will have no replies." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "X8jODdF-BHxR", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 211 | |
}, | |
"outputId": "12d40d9f-a888-4aa8-a4ef-256e1f3b3fd1", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846727885, | |
"user_tz": 240, | |
"elapsed": 60123, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"from simpleneighbors import SimpleNeighbors\n", | |
"\n", | |
"nns = SimpleNeighbors(300)\n", | |
"for i, line_id in enumerate(random.sample(list(responses.keys()), 10000)):\n", | |
" # show progress\n", | |
" if i % 1000 == 0: print(i, line_id, movie_lines[line_id])\n", | |
" line_text = movie_lines[line_id]\n", | |
" summary_vector = sentence_mean(nlp, line_text)\n", | |
" if np.any(summary_vector):\n", | |
" nns.add_one(line_id, summary_vector)\n", | |
"nns.build()" | |
], | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"0 L147263 You're not helping.\n", | |
"1000 L14186 An angel, when she was having one of her headaches.\n", | |
"2000 L399703 Nah... scotch.\n", | |
"3000 L447170 That COULD BE TOLD.\n", | |
"4000 L327382 Why fucking not! I deserve it.\n", | |
"5000 L603751 It's spectacular...\n", | |
"6000 L59033 Will you do something for me?\n", | |
"7000 L505431 That's a game isn't it? Anyway... There's been some interesting developments.\n", | |
"8000 L106937 Come on, Pop, all I want to know is one thing. Just one thing after he made such a big deal out of it. I bet it wasn't a big deal. Was it, Caesar?\n", | |
"9000 L159864 Don't shoot, man, don't shoot!\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "4UOeTbDtL8l3", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Let's take it for a spin! The code in the following cell finds the turn most similar to the string in the variable `sentence`. (You can change this string to whatever you want.) It then uses the Simple Neighbors object to find the turn in the database with the most similar vector, and then uses the `responses` lookup to find the *response* to that turn. That response will be our bot's output." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "l656oBJoBLaa", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 121 | |
}, | |
"outputId": "e63de4fb-6056-42f7-898b-f6eede2c8480", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846747831, | |
"user_tz": 240, | |
"elapsed": 444, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"sentence = \"I like making bots.\"\n", | |
"picked = nns.nearest(sentence_mean(nlp, sentence), 5)[0]\n", | |
"response_line_id = responses[picked]\n", | |
"\n", | |
"print(\"Your line:\\n\\t\", sentence)\n", | |
"print(\"Most similar turn:\\n\\t\", movie_lines[picked])\n", | |
"print(\"Response to most similar turn:\\n\\t\", movie_lines[response_line_id])" | |
], | |
"execution_count": 14, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Your line:\n", | |
"\t I like making bots.\n", | |
"Most similar turn:\n", | |
"\t I like that.\n", | |
"Response to most similar turn:\n", | |
"\t I still think we should have met them first.\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "J2AUaRgVPQco", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"## Putting it all together\n", | |
"\n", | |
"The code above is all you need to make a conversational chatbot based on semantic similarity. But there's a lot of stuff to keep track of! So I wrote a little bit of \"glue code\" to make it even easier. You can [see the source code on GitHub](https://github.com/aparrish/semanticsimilaritychatbot/); all the important stuff is [in this file](https://github.com/aparrish/semanticsimilaritychatbot/blob/master/semanticsimilaritychatbot/__init__.py). I'm going to use this library to rewrite the code above in just a few lines, and then we'll use the resulting object to make a chatbot you can use in the browser.\n", | |
"\n", | |
"First, download and install the library:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "TI4sCHjmQFfu", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 558 | |
}, | |
"outputId": "0376c5e7-9ff7-41f6-f08e-53268633f232", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846771685, | |
"user_tz": 240, | |
"elapsed": 3876, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!pip install https://github.com/aparrish/semanticsimilaritychatbot/archive/master.zip" | |
], | |
"execution_count": 15, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting https://github.com/aparrish/semanticsimilaritychatbot/archive/master.zip\n", | |
" Downloading https://github.com/aparrish/semanticsimilaritychatbot/archive/master.zip\n", | |
"\u001b[K - 10kB 12.8MB/s\n", | |
"Requirement already satisfied: simpleneighbors in /usr/local/lib/python3.6/dist-packages (from semanticsimilaritychatbot==0.0.1) (0.0.1)\n", | |
"Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (from semanticsimilaritychatbot==0.0.1) (2.0.11)\n", | |
"Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from semanticsimilaritychatbot==0.0.1) (1.14.5)\n", | |
"Requirement already satisfied: annoy in /usr/local/lib/python3.6/dist-packages (from simpleneighbors->semanticsimilaritychatbot==0.0.1) (1.12.0)\n", | |
"Requirement already satisfied: preshed<2.0.0,>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (1.0.0)\n", | |
"Requirement already satisfied: cymem<1.32,>=1.30 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (1.31.2)\n", | |
"Requirement already satisfied: thinc<6.11.0,>=6.10.1 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (6.10.2)\n", | |
"Requirement already satisfied: pathlib in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (1.0.1)\n", | |
"Requirement already satisfied: dill<0.3,>=0.2 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (0.2.8.2)\n", | |
"Requirement already satisfied: ujson>=1.35 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (1.35)\n", | |
"Requirement already satisfied: regex==2017.4.5 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (2017.4.5)\n", | |
"Requirement already satisfied: plac<1.0.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (0.9.6)\n", | |
"Requirement already satisfied: murmurhash<0.29,>=0.28 in /usr/local/lib/python3.6/dist-packages (from spacy->semanticsimilaritychatbot==0.0.1) (0.28.0)\n", | |
"Requirement already satisfied: cytoolz<0.9,>=0.8 in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (0.8.2)\n", | |
"Requirement already satisfied: msgpack-numpy==0.4.1 in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (0.4.1)\n", | |
"Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (4.23.4)\n", | |
"Requirement already satisfied: msgpack-python in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (0.5.6)\n", | |
"Requirement already satisfied: termcolor in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (1.1.0)\n", | |
"Requirement already satisfied: wrapt in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (1.10.11)\n", | |
"Requirement already satisfied: six<2.0.0,>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (1.11.0)\n", | |
"Requirement already satisfied: toolz>=0.8.0 in /usr/local/lib/python3.6/dist-packages (from cytoolz<0.9,>=0.8->thinc<6.11.0,>=6.10.1->spacy->semanticsimilaritychatbot==0.0.1) (0.9.0)\n", | |
"Building wheels for collected packages: semanticsimilaritychatbot\n", | |
" Running setup.py bdist_wheel for semanticsimilaritychatbot ... \u001b[?25l-\b \bdone\n", | |
"\u001b[?25h Stored in directory: /tmp/pip-ephem-wheel-cache-3wy7pf9p/wheels/f7/af/8e/8a8fbef31bfbfc3b935425efa03db03825795d85f4e23f8255\n", | |
"Successfully built semanticsimilaritychatbot\n", | |
"Installing collected packages: semanticsimilaritychatbot\n", | |
"Successfully installed semanticsimilaritychatbot-0.0.1\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "nPGClLIPQYBw", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Then create a chatbot object, passing in the spaCy language object (`nlp`) and the number of dimensions:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "xWbiYvA-K3xv", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"from semanticsimilaritychatbot import SemanticSimilarityChatbot\n", | |
"chatbot = SemanticSimilarityChatbot(nlp, 300)" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "TvsLgSCKQfIF", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"The `.add_pair()` method in the object takes two strings: a turn and the response to that turn. We'll get these from the `responses` and `movie_lines` lookups, again sampling ten thousand pairs at random. This cell will take a little while to run:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "XaEYCz70KyPg", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"sample_n = 10000\n", | |
"for first_id, second_id in random.sample(list(responses.items()), sample_n):\n", | |
" chatbot.add_pair(movie_lines[first_id], movie_lines[second_id])\n", | |
"chatbot.build()" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "FixHpUnoRLdC", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"Once you've built the database, the `.response_for()` method returns a plausible response from the database, based on semantic similarity. Try it out by changing the text between the quotation marks:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "B-sTY8OUK1ju", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"outputId": "63534634-5381-4a88-8540-0ecc531eb4bd", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531846865164, | |
"user_tz": 240, | |
"elapsed": 467, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"print(chatbot.response_for(\"Hello computer!\"))" | |
], | |
"execution_count": 19, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Hi there! You alright?\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "jdEXLwEHMKAh", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"To add variety, the `.response_for()` method actually selects randomly among several similar turns. You can change the number of turns it chooses from by passing a second parameter (a number) to the method. In general, the higher the number, the greater the chance is that you'll get an unusual result:" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "gmDFQy-2MiCr", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 538 | |
}, | |
"outputId": "ae605aa1-c467-4d72-9ad9-bd9557b622e9", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531847230905, | |
"user_tz": 240, | |
"elapsed": 293, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"my_turn = \"The weather's nice today, don't you think?\"\n", | |
"for i in range(5, 51, 5):\n", | |
" print(\"picking from\", i, \"possible responses:\")\n", | |
" print(chatbot.response_for(my_turn, i))\n", | |
" print()" | |
], | |
"execution_count": 27, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"picking from 5 possible responses:\n", | |
"Well, I'd like to be lady-like and think it over.\n", | |
"\n", | |
"picking from 10 possible responses:\n", | |
"What is it. Tammy?\n", | |
"\n", | |
"picking from 15 possible responses:\n", | |
"Everybody does?\n", | |
"\n", | |
"picking from 20 possible responses:\n", | |
"What sign?\n", | |
"\n", | |
"picking from 25 possible responses:\n", | |
"I don't know. What do you feel like doing?\n", | |
"\n", | |
"picking from 30 possible responses:\n", | |
"Have you ever considered piracy? You'd make a wonderful Dread Pirate Roberts.\n", | |
"\n", | |
"picking from 35 possible responses:\n", | |
"Ohhh. You're in therapy too, Marty?\n", | |
"\n", | |
"picking from 40 possible responses:\n", | |
"Yeah?\n", | |
"\n", | |
"picking from 45 possible responses:\n", | |
"Kid stuff or not, it doesn't happen every day, I want to heat it - and if you won't say it, you can sing it...\n", | |
"\n", | |
"picking from 50 possible responses:\n", | |
"Right. I promised my mother.\n", | |
"\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "tZw0DgPiRgz4", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"The Semantic Similarity Chatbot object has a `.save()` method that saves the pre-built database to disk, using a filename prefix you supply. (It saves three different files: `<prefix>.annoy`, `<prefix>-data.pkl`, and `<prefix>-chatbot.pkl`)." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "cPgATDpnLTYv", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"chatbot.save(\"movielines-10k-sample\")" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "FVM_GcjoR9pG", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"You can use a previously-saved database using the `.load()` class method, like so. (This means you don't have to build the database again: you can just load it and start calling `.response_for()`.)" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "Wutboh4MLkja", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"chatbot = SemanticSimilarityChatbot.load(\"movielines-10k-sample\", nlp)" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "-6IOQ9DPNrOR", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"outputId": "40663413-60d1-45ba-9586-5ccfda5cd659", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531847248390, | |
"user_tz": 240, | |
"elapsed": 338, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"print(chatbot.response_for(\"I'm going to go get some coffee.\"))" | |
], | |
"execution_count": 30, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Instant rice...?\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "7Dp0gmuzSkG_", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"If you're using this notebook on Google Colab, the following cell will download all of the files from the pre-built bot to your computer so you can use them later. (Note that you'll still have to download and install spaCy for the chatbot to work.) If you're running the notebook locally with Jupyter Notebook, the files will end up in the same directory as the notebook file itself." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "kCeo-cHdSUxg", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"from google.colab import files\n", | |
"files.download('movielines-10k-sample.annoy')\n", | |
"files.download('movielines-10k-sample-data.pkl')\n", | |
"files.download('movielines-10k-sample-chatbot.pkl')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "M1wElvooTBjp", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"## Making it interactive\n", | |
"\n", | |
"If you're using this notebook in Google Colab, the following cell will create a little interactive interface for chatting with the bot that you just built. Run the two cells below and start typing into the box." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "WSjtkXigBuRo", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"chatbot_html = \"\"\"\n", | |
"<style type=\"text/css\">#log p { margin: 5px; font-family: sans-serif; }</style>\n", | |
"<div id=\"log\"\n", | |
" style=\"box-sizing: border-box;\n", | |
" width: 600px;\n", | |
" height: 32em;\n", | |
" border: 1px grey solid;\n", | |
" padding: 2px;\n", | |
" overflow: scroll;\">\n", | |
"</div>\n", | |
"<input type=\"text\" id=\"typehere\" placeholder=\"type here!\"\n", | |
" style=\"box-sizing: border-box;\n", | |
" width: 600px;\n", | |
" margin-top: 5px;\">\n", | |
"<script>\n", | |
"function paraWithText(t) {\n", | |
" let tn = document.createTextNode(t);\n", | |
" let ptag = document.createElement('p');\n", | |
" ptag.appendChild(tn);\n", | |
" return ptag;\n", | |
"}\n", | |
"document.querySelector('#typehere').onchange = async function() {\n", | |
" let inputField = document.querySelector('#typehere');\n", | |
" let val = inputField.value;\n", | |
" inputField.value = \"\";\n", | |
" let resp = await getResp(val);\n", | |
" let objDiv = document.getElementById(\"log\");\n", | |
" objDiv.appendChild(paraWithText('π: ' + val));\n", | |
" objDiv.appendChild(paraWithText('π€: ' + resp));\n", | |
" objDiv.scrollTop = objDiv.scrollHeight;\n", | |
"};\n", | |
"async function colabGetResp(val) {\n", | |
" let resp = await google.colab.kernel.invokeFunction(\n", | |
" 'notebook.get_response', [val], {});\n", | |
" return resp.data['application/json']['result'];\n", | |
"}\n", | |
"async function webGetResp(val) {\n", | |
" let resp = await fetch(\"/response.json?sentence=\" + \n", | |
" encodeURIComponent(val));\n", | |
" let data = await resp.json();\n", | |
" return data['result'];\n", | |
"}\n", | |
"</script>\n", | |
"\"\"\"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "SHilVz_Yy3Th", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 492 | |
}, | |
"outputId": "5b29357f-29f4-4450-a827-c3383dd67e66", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531847293290, | |
"user_tz": 240, | |
"elapsed": 646, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"import IPython\n", | |
"from google.colab import output\n", | |
"\n", | |
"display(IPython.display.HTML(chatbot_html + \\\n", | |
" \"<script>let getResp = colabGetResp;</script>\"))\n", | |
"\n", | |
"def get_response(val):\n", | |
" resp = chatbot.response_for(val)\n", | |
" return IPython.display.JSON({'result': resp})\n", | |
"\n", | |
"output.register_callback('notebook.get_response', get_response)" | |
], | |
"execution_count": 32, | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"data": { | |
"text/html": [ | |
"\n", | |
"<style type=\"text/css\">#log p { margin: 5px; font-family: sans-serif; }</style>\n", | |
"<div id=\"log\"\n", | |
" style=\"box-sizing: border-box;\n", | |
" width: 600px;\n", | |
" height: 32em;\n", | |
" border: 1px grey solid;\n", | |
" padding: 2px;\n", | |
" overflow: scroll;\">\n", | |
"</div>\n", | |
"<input type=\"text\" id=\"typehere\" placeholder=\"type here!\"\n", | |
" style=\"box-sizing: border-box;\n", | |
" width: 600px;\n", | |
" margin-top: 5px;\">\n", | |
"<script>\n", | |
"function paraWithText(t) {\n", | |
" let tn = document.createTextNode(t);\n", | |
" let ptag = document.createElement('p');\n", | |
" ptag.appendChild(tn);\n", | |
" return ptag;\n", | |
"}\n", | |
"document.querySelector('#typehere').onchange = async function() {\n", | |
" let inputField = document.querySelector('#typehere');\n", | |
" let val = inputField.value;\n", | |
" inputField.value = \"\";\n", | |
" let resp = await getResp(val);\n", | |
" let objDiv = document.getElementById(\"log\");\n", | |
" objDiv.appendChild(paraWithText('π: ' + val));\n", | |
" objDiv.appendChild(paraWithText('π€: ' + resp));\n", | |
" objDiv.scrollTop = objDiv.scrollHeight;\n", | |
"};\n", | |
"async function colabGetResp(val) {\n", | |
" let resp = await google.colab.kernel.invokeFunction(\n", | |
" 'notebook.get_response', [val], {});\n", | |
" return resp.data['application/json']['result'];\n", | |
"}\n", | |
"async function webGetResp(val) {\n", | |
" let resp = await fetch(\"/response.json?sentence=\" + \n", | |
" encodeURIComponent(val));\n", | |
" let data = await resp.json();\n", | |
" return data['result'];\n", | |
"}\n", | |
"</script>\n", | |
"<script>let getResp = colabGetResp;</script>" | |
], | |
"text/plain": [ | |
"<IPython.core.display.HTML object>" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
} | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "Z1CJg2lG67KB", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"If you're not using Colab, try the following two cells to install [Flask](http://flask.pocoo.org) and run a little web server from your notebook that lets you chat with the bot. Click on the link that appears below the second cell to open up the chat in a new window." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "a4hcd0wU4jGK", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 350 | |
}, | |
"outputId": "499f750c-7cd6-4496-a191-db60b0016672", | |
"executionInfo": { | |
"status": "ok", | |
"timestamp": 1531847407776, | |
"user_tz": 240, | |
"elapsed": 4532, | |
"user": { | |
"displayName": "Allison Parrish", | |
"photoUrl": "//lh3.googleusercontent.com/-oIUlh1dj3RI/AAAAAAAAAAI/AAAAAAAAAAA/oSjOZttxipI/s50-c-k-no/photo.jpg", | |
"userId": "106202567649533873126" | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"!pip install flask" | |
], | |
"execution_count": 33, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting flask\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl (91kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 92kB 4.1MB/s \n", | |
"\u001b[?25hCollecting click>=5.1 (from flask)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/34/c1/8806f99713ddb993c5366c362b2f908f18269f8d792aff1abfd700775a77/click-6.7-py2.py3-none-any.whl (71kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 71kB 9.8MB/s \n", | |
"\u001b[?25hRequirement already satisfied: Werkzeug>=0.14 in /usr/local/lib/python3.6/dist-packages (from flask) (0.14.1)\n", | |
"Collecting itsdangerous>=0.24 (from flask)\n", | |
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/dc/b4/a60bcdba945c00f6d608d8975131ab3f25b22f2bcfe1dab221165194b2d4/itsdangerous-0.24.tar.gz (46kB)\n", | |
"\u001b[K 100% |ββββββββββββββββββββββββββββββββ| 51kB 13.9MB/s \n", | |
"\u001b[?25hRequirement already satisfied: Jinja2>=2.10 in /usr/local/lib/python3.6/dist-packages (from flask) (2.10)\n", | |
"Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from Jinja2>=2.10->flask) (1.0)\n", | |
"Building wheels for collected packages: itsdangerous\n", | |
" Running setup.py bdist_wheel for itsdangerous ... \u001b[?25l-\b \bdone\n", | |
"\u001b[?25h Stored in directory: /content/.cache/pip/wheels/2c/4a/61/5599631c1554768c6290b08c02c72d7317910374ca602ff1e5\n", | |
"Successfully built itsdangerous\n", | |
"Installing collected packages: click, itsdangerous, flask\n", | |
"Successfully installed click-6.7 flask-1.0.2 itsdangerous-0.24\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "25bjOkzX4dC-", | |
"colab_type": "code", | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"from flask import Flask, request, jsonify\n", | |
"app = Flask(__name__)\n", | |
"@app.route(\"/response.json\")\n", | |
"def response():\n", | |
" sentence = request.args['sentence']\n", | |
" return jsonify(\n", | |
" {'result': chatbot.response_for(sentence)})\n", | |
"@app.route(\"/\")\n", | |
"def home():\n", | |
" return chatbot_html + \"<script>let getResp = webGetResp;</script>\"\n", | |
"app.run()" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "iH-If5m07h8_", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"## Some things to try\n", | |
"\n", | |
"If you enjoyed following along, here are some things to try:\n", | |
"\n", | |
"* Use the metadata file that comes with the Cornell corpus to make a chatbot that only uses lines from a particular genre of movie. (How is a comedy chatbot different from an action chatbot?)\n", | |
"* Use a different corpus of conversation altogether. Your own chat logs? Conversational exchanges from a novel? Transcripts of interviews on news programs?\n", | |
"* Incorporate some context from the conversation when vectorizing the turns. (You might, for example, include the average of not just the given turn but also the turn that preceded it.)" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment