Created
March 18, 2019 14:49
-
-
Save AllenDowney/0985c55e2d49d6859f1b8ee6b9bd9956 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Text analysis with Python\n", | |
"\n", | |
"\n", | |
"Copyright 2019 Allen Downey\n", | |
"\n", | |
"[MIT License](https://opensource.org/licenses/MIT)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%matplotlib inline\n", | |
"\n", | |
"import matplotlib.pyplot as plt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Word Frequencies\n", | |
"----------------\n", | |
"\n", | |
"Let's look at frequencies of words, bigrams and trigrams in a text.\n", | |
"\n", | |
"The following function reads lines from a file or URL and splits them into words:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def iterate_words(filename):\n", | |
" \"\"\"Read lines from a file and split them into words.\"\"\"\n", | |
" for line in open(filename):\n", | |
" for word in line.split():\n", | |
" yield word.strip()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here's an example using a book from Project Gutenberg. `wc` is a Counter of words, that is, a dictionary that maps from each word to the number of times it appears:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from collections import Counter" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# FAIRY TALES\n", | |
"# By The Brothers Grimm\n", | |
"# http://www.gutenberg.org/cache/epub/2591/pg2591.txt'\n", | |
"wc = Counter(iterate_words('pg2591.txt'))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here are the 20 most common words:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('the', 6507),\n", | |
" ('and', 5250),\n", | |
" ('to', 2707),\n", | |
" ('a', 1932),\n", | |
" ('he', 1817),\n", | |
" ('of', 1450),\n", | |
" ('was', 1337),\n", | |
" ('in', 1080),\n", | |
" ('she', 1049),\n", | |
" ('that', 1021),\n", | |
" ('his', 1014),\n", | |
" ('you', 941),\n", | |
" ('it', 881),\n", | |
" ('her', 880),\n", | |
" ('had', 827),\n", | |
" ('I', 755),\n", | |
" ('they', 751),\n", | |
" ('for', 721),\n", | |
" ('with', 720),\n", | |
" ('as', 718)]" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"wc.most_common(20)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Word frequencies in natural languages follow a predictable pattern called Zipf's law (which is an instance of Stigler's law, which is also an instance of Stigler's law).\n", | |
"\n", | |
"We can see the pattern by lining up the words in descending order of frequency and plotting their counts (6507, 5250, 2707) versus ranks (1st, 2nd, 3rd, ...):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def counter_ranks(wc):\n", | |
" \"\"\"Returns ranks and counts as lists.\"\"\"\n", | |
" return zip(*enumerate(sorted(wc.values(), reverse=True)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEWCAYAAABxMXBSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XucHGWd7/HPd2bIhXBJAgFDEkjQ7CpegDhCAFeFuCFBj0GPuEFXAkZzPMuuut4Wzp49KBdvLwXl6CJRouEOsnKIgGIOV1kFMkBECJeEa8ZAMjAJQUIgl9/+UU8nlaF6unsynZ6ZfN+vV6erfvVU1fN0dfrX9dQz1YoIzMzMqtXU6AqYmVn/4sRhZmY1ceIwM7OaOHGYmVlNnDjMzKwmThxmZlYTJw6rmqSvSbq00fWw6km6TdKne7DeeEkhqSXN/1rSrN6vYeNJ+rmksxtdj/7EiaOfknS6pBu7xJaWic3csbXbMSS9T1J7o+uxM4iI6RExv9H1sL7BiaP/ugM4SlIzgKQ3ALsAk7rE3pTKVk0ZvzdySt+8B/o++5Kdvf19mT8c+q9FZInikDT/HuBW4NEusccjYgWApCMlLZL0Yno+srSx1KVxjqT/BNYBB0qaIOl2SS9JWgjs3V2FJM2QtFjSWkmPS5qW4vtJWiCpU9IySZ/JrbNNN0HXswhJT0n6sqQHUr2vkjRE0jDg18B+kv6SHvt1qc9kSc+VEmmKfVjSA2m6SdJpqa4vSLpa0si0rNRVM1vSM8Atab+XprJr0mu4b66e78/tZ0u3XnfrFbyGT0n6l1THlyW15Or4kqQlkj6cK3+ypDslfVfSaklPSppeZtuj0+v45e6OY5l1t3R5VdqnpD0lXSTpWUl/lnR27svMGyXdkl6L5yVdJml4d+3vUg9JOk/SqvR+eEDS29KyoZK+J+nptOxOSUPTsl+k98KLku6Q9NZu2vrB9D5eI+n3kt5R6+s10Dlx9FMR8RpwN1lyID3/DrizS+wOgPSBeANwPrAXcC5wg6S9cpv9JDAH2B14GrgcuJcsYZwFlO3jlnQYcDHwFWB42vdTafEVQDuwH/BR4BuSptTQ3I8B04AJwDuAkyPiZWA6sCIidkuPFfmVIuIu4GXgmFz446ldAJ8Djgfem+q2GvhRl32/F3gLcCxZ+/cExpG9hp8FXqmi/rWudyLwAWB4RGwEHgf+Jm3j68Clkkbnyh9O9oVhb+A7wEWSlN+gpPHA7cAPI+K7VdS5ku72OR/YSHa2eygwFShdZxHwTbLX+y1kr8nXumy7a/vzppK9t/6K7H32d8ALadl3gXcCRwIjga8Cm9OyXwMTgX2A+4DLiholaRIwD/gfZMfqQmCBpMHdvxw7mYjwo58+yP7DXZum/0j2H2Nal9isNP1J4J4u6/+B7EMY4DbgzNyy/cn+8w/LxS4HLi1TlwuB8wri44BNwO652DeBn6fpnwNn55a9D2jPzT8F/H1u/jvAj4vKlqnX2cC8NL07WSI5IM0/DEzJlR0NbABagPFAAAfmln8K+D3wjoL9PAW8v8uxubTSemW286kKZRYDM9L0ycCy3LJdU73fkDuu56btnljDe6vU/pbcdj5daZ/AvsCrwNDc8hOBW8vs53jg/mrbT/Yl4DFgMtCUizeRJeODq2jb8FTfPbu+B4ELgLO6lH8UeG8t/zcH+sNnHP3bHcC7JY0ARkXEUrIPqCNT7G1svb6xH9lZRN7TwJjc/PLc9H7A6si+2efLlzOO7JtxV/sBnRHxUjf7reS53PQ6YLca1r0c+Ej6xvgR4L6IKLXjAODa1CWxhiyRbCL78CvJvyaXADcBV0paIek7knapog61rpffJ5JOynWdrCE7rvluwy2vT0SsS5P51+gTwJ+Ba6qoa7XK7fMAsi7UZ3P1vZDsmz6S9pF0ZerCWgtcyuu7QJdTRkTcAvyQ7MxwpaS5kvZI2xhCwXtQUrOkb6XuvrVsPRMu6no9APhSqe6p/uPI3seWOHH0b38g676YA/wnQESsBVak2IqIeDKVXUH2nyJvf7IPlJL8rZKfBUYou5aQL1/OcuCNBfEVwEhJu5fZ78tk31hL3tDNPrqqeGvniFhClqims203VanO0yNieO4xJCIKX5OI2BARX4+Ig8i6Qz4InFSpHRXW67Zdkg4AfgL8I7BXRAwHHiTr8qnW14DngcuVu95TJ8vJzjj2zr2me0RE6ZrCN8na946I2AP4e17flm6Pa0ScHxHvBN5K1mX1FbL2raf4PfhxYAbwfrL/L+NTvOg1XA6c0+U9sWtEXNFtq3cyThz9WES8ArQBXyS7vlFyZ4rlR1PdCPyVpI+nC65/BxwEXF9m20+nbX9d0iBJ7wb+WzfVuQg4RdIUZRedx0h6c0QsJzsL+qayi8TvAGaztY95MXCcpJHKRoF9oYaXYCWwl6Q9K5S7nOx6xnuAX+TiPwbOSR/OSBolaUa5jUg6WtLb04fvWrJurU25dsyUtIukVrJrOdWsV8kwsg/SjrStU8jOOGqxATghbesSpRFzyi7g31bjtroVEc8CvwW+J2mP9F54o6T3piK7A38B1kgaQ/ahXzVJ75J0eDpje5ksWWyKiM1k1ybOVTYYo1nSEelMc3eyZPYCWXL/Rje7+Anw2bQPSRom6QNdvvjs9Jw4+r/byboB7szFfpdiWxJHRLxA9k33S2T/gb4KfDAinu9m2x8nuwjaCZxBdvG7UETcA5wCnAe8mOpVOsM5kexb3grgWuCMiFiYll1Cdi3mKbIPnKu6b+42+3yE7ML7E6lboVx3whVk10Nu6dLeHwALgN9Kegm4K7W3nDeQdfesJevWup2sqwXg38i+7a4mu4B9eZXrVWrjEuB7ZGeXK4G3k84uaxHZYIqPkL0v5qXkMa4n26rCScAgYAnZ63EN2fUjyF6bSWTvkRuAX9a47T3IPtxXk51JvkB2URzgy8CfyEYcdgLfJvuMuziV/XOq013lNh4RbcBnyLrDVgPLyK7pWI7SxR8z28lIWkw2OOCFioXNcpw4zMysJu6qMjOzmjhxmJlZTZw4zMysJgPyJmJ77713jB8/vtHVMDPrV+69997nI2JUpXIDMnGMHz+etra2RlfDzKxfkdTd3SG2cFeVmZnVxInDzMxq4sRhZmY1ceIwM7OaOHGYmVlNnDjMzKwmA3I4bk/d9cQLbN6c3bvr8AP3ormplp88MDPbOThx5Jzys0W8siH7mYQlZx7LroP88piZdeWuKjMzq4kTRxm+27yZWTEnjhz5koaZWUVOHGZmVhMnjjLcU2VmVsyJI8c9VWZmlTlxmJlZTZw4yggPqzIzK+TEkSMPqzIzq8iJw8zMauLEUYY7qszMijlx5LijysyssromDknDJV0j6RFJD0s6QtJISQslLU3PI1JZSTpf0jJJD0ialNvOrFR+qaRZ9ayzmZl1r95nHD8AfhMRbwYOBh4GTgNujoiJwM1pHmA6MDE95gAXAEgaCZwBHA4cBpxRSjb15EFVZmbF6pY4JO0BvAe4CCAiXouINcAMYH4qNh84Pk3PAC6OzF3AcEmjgWOBhRHRGRGrgYXAtPpUui5bNTMbUOp5xnEg0AH8TNL9kn4qaRiwb0Q8C5Ce90nlxwDLc+u3p1i5uJmZNUA9E0cLMAm4ICIOBV5ma7dUkaLv+9FNfNuVpTmS2iS1dXR09KS+FfZgZmZQ38TRDrRHxN1p/hqyRLIydUGRnlflyo/LrT8WWNFNfBsRMTciWiOiddSoUT2qsHuqzMwqq1viiIjngOWS/jqFpgBLgAVAaWTULOC6NL0AOCmNrpoMvJi6sm4CpkoakS6KT00xMzNrgHr/qPY/AZdJGgQ8AZxClqyuljQbeAY4IZW9ETgOWAasS2WJiE5JZwGLUrkzI6KzzvUm3FdlZlaorokjIhYDrQWLphSUDeDUMtuZB8zr3dq9nu9VZWZWmf9y3MzMauLEUYb/ANDMrJgTR457qszMKnPiMDOzmjhxlOGeKjOzYk4cOe6pMjOrzInDzMxq4sRRRnhYlZlZISeOHP8BoJlZZU4cZmZWEyeOMtxRZWZWzIkjxx1VZmaVOXGU4WvjZmbFnDhyfG3czKwyJw4zM6uJE0cZ/iEnM7NiThzbcF+VmVklThxmZlYTJ45y3FNlZlbIiSPHo6rMzCpz4jAzs5o4cZThniozs2J1TRySnpL0J0mLJbWl2EhJCyUtTc8jUlySzpe0TNIDkibltjMrlV8qaVbd6luvDZuZDSA74ozj6Ig4JCJa0/xpwM0RMRG4Oc0DTAcmpscc4ALIEg1wBnA4cBhwRinZmJnZjteIrqoZwPw0PR84Phe/ODJ3AcMljQaOBRZGRGdErAYWAtPqXUnfq8rMrFi9E0cAv5V0r6Q5KbZvRDwLkJ73SfExwPLcuu0pVi6+DUlzJLVJauvo6OhRZT2qysysspY6b/+oiFghaR9goaRHuilb9LEd3cS3DUTMBeYCtLa2+nzBzKxO6nrGEREr0vMq4FqyaxQrUxcU6XlVKt4OjMutPhZY0U28rnyvKjOzYnVLHJKGSdq9NA1MBR4EFgClkVGzgOvS9ALgpDS6ajLwYurKugmYKmlEuig+NcV6v84eV2VmVlE9u6r2Ba5VduGgBbg8In4jaRFwtaTZwDPACan8jcBxwDJgHXAKQER0SjoLWJTKnRkRnXWst5mZdaNuiSMingAOLoi/AEwpiAdwapltzQPm9XYdu+NRVWZmxfyX4zkeVWVmVpkTh5mZ1cSJowz3VJmZFXPiyHFPlZlZZU4cZmZWEyeOMsLDqszMCjlx5MjDqszMKnLiMDOzmjhxlOGeKjOzYk4cZmZWEycOMzOriROHmZnVxIkjx4OqzMwqc+IwM7OaOHGU4VFVZmbFnDhy3FVlZlaZE4eZmdXEiaOM8I3VzcwKOXHkyDdWNzOryInDzMxq4sRRhkdVmZkVq3vikNQs6X5J16f5CZLulrRU0lWSBqX44DS/LC0fn9vG6Sn+qKRj61fXem3ZzGzg2BFnHJ8HHs7Nfxs4LyImAquB2Sk+G1gdEW8CzkvlkHQQMBN4KzAN+HdJzTug3mZmVqCuiUPSWOADwE/TvIBjgGtSkfnA8Wl6RponLZ+Sys8AroyIVyPiSWAZcFg96w14TJWZWRn1PuP4PvBVYHOa3wtYExEb03w7MCZNjwGWA6TlL6byW+IF62whaY6kNkltHR0dPaqse6rMzCqrW+KQ9EFgVUTcmw8XFI0Ky7pbZ2sgYm5EtEZE66hRo2qub8H2tnsbZmYDUUsdt30U8CFJxwFDgD3IzkCGS2pJZxVjgRWpfDswDmiX1ALsCXTm4iX5dczMbAer2xlHRJweEWMjYjzZxe1bIuITwK3AR1OxWcB1aXpBmictvyWyr/0LgJlp1NUEYCJwTz3qLA+rMjOrqJ5nHOX8C3ClpLOB+4GLUvwi4BJJy8jONGYCRMRDkq4GlgAbgVMjYlO9K+mOKjOzYjskcUTEbcBtafoJCkZFRcR64IQy658DnFO/GmZ8vmFmVpn/ctzMzGrixFGGB1WZmRVz4shzX5WZWUVOHGZmVhMnjrLcV2VmVqSqxCHpqGpi/Z17qszMKqv2jOP/VhkzM7MBrtu/45B0BHAkMErSF3OL9gAG9K3NParKzKxYpT8AHATslsrtnouvZettQwYM33LEzKyybhNHRNwO3C7p5xHx9A6qk5mZ9WHV3nJksKS5wPj8OhFxTD0q1Re4p8rMrFi1ieMXwI/Jfsmv7jcYbBR3VJmZVVZt4tgYERfUtSZmZtYvVDsc91eS/kHSaEkjS4+61qzBPKrKzKxYtWccpR9Y+kouFsCBvVudxvKgKjOzyqpKHBExod4VMTOz/qGqxCHppKJ4RFzcu9XpO8LjqszMClXbVfWu3PQQYApwHzCgEoc8rsrMrKJqu6r+KT8vaU/gkrrUyMzM+rSe3lZ9HTCxNyvS13hUlZlZsWqvcfyKrX9M3Qy8Bbi6XpVqFI+qMjOrrNprHN/NTW8Eno6I9jrUx8zM+riquqrSzQ4fIbtD7gjgtUrrSBoi6R5Jf5T0kKSvp/gESXdLWirpKkmDUnxwml+Wlo/Pbev0FH9U0rG1N7N27qoyMytW7S8Afgy4BzgB+Bhwt6RKt1V/FTgmIg4GDgGmSZoMfBs4LyImAquB2an8bGB1RLwJOC+VQ9JBwEzgrcA04N8lDejfAjEz68uqvTj+r8C7ImJWRJwEHAb8W3crROYvaXaX9AjgGOCaFJ8PHJ+mZ6R50vIpyn4gYwZwZUS8GhFPAsvS/s3MrAGqTRxNEbEqN/9CNetKapa0GFgFLAQeB9ZExMZUpB0Yk6bHAMsB0vIXgb3y8YJ18vuaI6lNUltHR0eVzSrPfwBoZlas2sTxG0k3STpZ0snADcCNlVaKiE0RcQgwluws4S1FxdJz0Zim6CbedV9zI6I1IlpHjRpVqWqF/AuAZmaVVfrN8TcB+0bEVyR9BHg32Qf5H4DLqt1JRKyRdBswGRguqSWdVYwFVqRi7cA4oF1SC7An0JmLl+TXMTOzHazSGcf3gZcAIuKXEfHFiPhnsrON73e3oqRRkoan6aHA+4GHgVvZ+nvls4Dr0vQCtt6F96PALRERKT4zjbqaQPaHh/dU38Se8agqM7Nilf6OY3xEPNA1GBFt+eGyZYwG5qcRUE3A1RFxvaQlwJWSzgbuBy5K5S8CLpG0jOxMY2ba10OSrgaWkP0NyakRUZdfIXRHlZlZZZUSx5Bulg3tbsWUcA4tiD9BwaioiFhPNty3aFvnAOd0W1MzM9shKnVVLZL0ma5BSbOBe+tTJTMz68sqnXF8AbhW0ifYmihagUHAh+tZsUbwoCozs8q6TRwRsRI4UtLRwNtS+IaIuKXuNTMzsz6p2t/juJVsNNROw6OqzMyK9fT3OAYkd1WZmVXmxGFmZjVx4ijD96oyMyvmxJEj/wmgmVlFThxmZlYTJ44yPKrKzKyYE0eOR1WZmVXmxFGGTzjMzIo5ceT4hMPMrDInDjMzq4kTRxnhq+NmZoWcOPJ8ddzMrCInDjMzq4kTRxnuqDIzK+bEkeOOKjOzypw4zMysJk4cZXhQlZlZsbolDknjJN0q6WFJD0n6fIqPlLRQ0tL0PCLFJel8ScskPSBpUm5bs1L5pZJm1a/O9dqymdnAUc8zjo3AlyLiLcBk4FRJBwGnATdHxETg5jQPMB2YmB5zgAsgSzTAGcDhwGHAGaVkY2ZmO17dEkdEPBsR96Xpl4CHgTHADGB+KjYfOD5NzwAujsxdwHBJo4FjgYUR0RkRq4GFwLR61TvXgvrvwsysH9oh1zgkjQcOBe4G9o2IZyFLLsA+qdgYYHlutfYUKxfvuo85ktoktXV0dPSsnj1ay8xs51L3xCFpN+A/gC9ExNruihbEopv4toGIuRHRGhGto0aN6lllzcysoromDkm7kCWNyyLilym8MnVBkZ5XpXg7MC63+lhgRTfxuvKoKjOzYvUcVSXgIuDhiDg3t2gBUBoZNQu4Lhc/KY2umgy8mLqybgKmShqRLopPTbF61LkemzUzG1Ba6rjto4BPAn+StDjF/hfwLeBqSbOBZ4AT0rIbgeOAZcA64BSAiOiUdBawKJU7MyI661hvMzPrRt0SR0TcSfnrzVMKygdwapltzQPm9V7tKnNPlZlZMf/leI47qszMKnPiMDOzmjhxlOFRVWZmxZw4cjyoysysMicOMzOriRNHGeG+KjOzQk4cOfK4KjOzipw4zMysJk4cZbijysysmBNHnnuqzMwqcuIwM7OaOHGU4UFVZmbFnDhy3FNlZlaZE4eZmdXEiaOM8LgqM7NCThw5vleVmVllThxmZlYTJ45y3FNlZlbIiSPH96oyM6vMicPMzGrixFGGe6rMzIrVLXFImidplaQHc7GRkhZKWpqeR6S4JJ0vaZmkByRNyq0zK5VfKmlWveqb7aueWzczGxjqecbxc2Bal9hpwM0RMRG4Oc0DTAcmpscc4ALIEg1wBnA4cBhwRinZmJlZY9QtcUTEHUBnl/AMYH6ang8cn4tfHJm7gOGSRgPHAgsjojMiVgMLeX0y6jXNTVtPOV7btLleuzEz69d29DWOfSPiWYD0vE+KjwGW58q1p1i5eF0M3aV5y/SrGzbVazdmZv1aX7k4XnR1IbqJv34D0hxJbZLaOjo6elSJoYO2Jo5XnDjMzArt6MSxMnVBkZ5XpXg7MC5Xbiywopv460TE3IhojYjWUaNG9ahy+TOO9RvcVWVmVmRHJ44FQGlk1Czgulz8pDS6ajLwYurKugmYKmlEuig+NcXqYkgucbzyms84zMyKtNRrw5KuAN4H7C2pnWx01LeAqyXNBp4BTkjFbwSOA5YB64BTACKiU9JZwKJU7syI6HrBvddskzjcVWVmVqhuiSMiTiyzaEpB2QBOLbOdecC8XqxaWdt2VTlxmJkV6SsXx/uEoYO2vhzuqjIzK+bEkZPvqlq/0YnDzKyIE0fOthfHParKzKyIE0eOr3GYmVXmxJEz1KOqzMwqcuLIGeIzDjOzipw4crYZVeXEYWZWyIkjx385bmZWmRNHji+Om5lV5sSR41uOmJlV5sSR47vjmplV5sSRM2xwy5bfHX9p/QZe2+jkYWbWlRNHzqCWJkbuOgiAzQFrXnmtwTUyM+t7nDi62OZXAD2yyszsdZw4utg1lzjWOXGYmb2OE0cXuw/ZZct058vuqjIz68qJo4vxew3bMv3H9jUNrImZWd/kxNHFxH132zJ9x2MdDayJmVnf5MTRxbvGj9gy/djKv3hIrplZF04cXRw6bmvi6Hz5Na5qW97A2piZ9T1OHF00NYnPvveNW+b/z3UP8rul7rIyMytx4ijwySMOYPfBLQBEwOz5bXz7N4+wcu36BtfMzKzxFBGNrkNVJE0DfgA0Az+NiG+VK9va2hptbW3btb8lK9Zy8s/uYdVLr24Tf+OoYUzYexj7jxzGfsOHsNdugxg2qIU9h+7CroNaGDqomWGDmxnc0szgliZ2HdSMSvcxMTPrwyTdGxGtFcv1h8QhqRl4DPhboB1YBJwYEUuKyvdG4gBY3rmOz1zcxiPPvbRd22luEs1NoiU9D2puYnBLE5JoagIhmgRNEkrP+WkJBrc0bZmGbB22TLNNfMu0Urnc8my6FC2VKY6T31aKtTQ10dLc80TY0zW3J/luV9ru4crajr32tKnb087t+W7T07Zu1z57vO6OPy7bs9ee7vNDB4/hsAkje7C/6hJHS49qteMdBiyLiCcAJF0JzAAKE0dvGTdyV2743N/w24ee46d3Psni5WvYtLn2RLtpc7Bpc+A/JzSzHeHNb9ijR4mjWv0lcYwB8sOb2oHD8wUkzQHmAOy///69tuPmJjH97aOZ/vbRrF2/gSc7XuaZznU807mOVWvX07luA39Zv4GX1m/klQ2bePnV7Pm1jZtZ99omXvVwXjMbYPpL4ig6Ydvmq39EzAXmQtZVVY9K7DFkFw4eN5yDxw2vep2I7Gxj4+atz69u2MSGzcHmzUEEbI5ID4DseXMEmzdvXfbqxs2UehUjYkvjI2DLXFAYjxQvdUvGln+yMlu3m19/232Ulry2Kat3TwQ9XG87juZ2rdvjffZ8pz1ec7vauR317eGq2/MftOf73PHtzPa743f6rvH1O9uA/pM42oFxufmxwIoG1aUmkmhpFi3NueDQXcqWNzPr6/rLcNxFwERJEyQNAmYCCxpcJzOznVK/OOOIiI2S/hG4iWw47ryIeKjB1TIz2yn1i8QBEBE3Ajc2uh5mZju7/tJVZWZmfYQTh5mZ1cSJw8zMauLEYWZmNekX96qqlaQO4Ont2MTewPO9VJ2+ZiC3DQZ2+wZy28Dt6wsOiIhRlQoNyMSxvSS1VXOjr/5oILcNBnb7BnLbwO3rT9xVZWZmNXHiMDOzmjhxFJvb6ArU0UBuGwzs9g3ktoHb12/4GoeZmdXEZxxmZlYTJw4zM6uJE0eOpGmSHpW0TNJpja5PNSSNk3SrpIclPSTp8yk+UtJCSUvT84gUl6TzUxsfkDQpt61ZqfxSSbMa1aYikpol3S/p+jQ/QdLdqa5XpdvtI2lwml+Wlo/PbeP0FH9U0rGNacnrSRou6RpJj6TjeMRAOX6S/jm9Lx+UdIWkIf352EmaJ2mVpAdzsV47VpLeKelPaZ3zpe35pfM6igg/sus8zcDjwIHAIOCPwEGNrlcV9R4NTErTuwOPAQcB3wFOS/HTgG+n6eOAX5P9quJk4O4UHwk8kZ5HpOkRjW5frp1fBC4Hrk/zVwMz0/SPgf+Zpv8B+HGanglclaYPSsd0MDAhHevmRrcr1W0+8Ok0PQgYPhCOH9lPPj8JDM0ds5P787ED3gNMAh7MxXrtWAH3AEekdX4NTG/0+7PwdWh0BfrKIx2sm3LzpwOnN7pePWjHdcDfAo8Co1NsNPBomr4QODFX/tG0/ETgwlx8m3INbtNY4GbgGOD69J/qeaCl67Ej+82WI9J0SyqnrsczX67BbdsjfbiqS7zfH7+UOJanD8iWdOyO7e/HDhjfJXH0yrFKyx7Jxbcp15ce7qraqvQmL2lPsX4jndofCtwN7BsRzwKk531SsXLt7Mvt/z7wVWBzmt8LWBMRG9N8vq5b2pGWv5jK99X2HQh0AD9LXXE/lTSMAXD8IuLPwHeBZ4BnyY7FvQycY1fSW8dqTJruGu9znDi2KupL7DdjlSXtBvwH8IWIWNtd0YJYdBNvKEkfBFZFxL35cEHRqLCsT7aP7Jv1JOCCiDgUeJmsu6OcftO+1Nc/g6x7aT9gGDC9oGh/PXaV1NqeftNOJ46t2oFxufmxwIoG1aUmknYhSxqXRcQvU3ilpNFp+WhgVYqXa2dfbf9RwIckPQVcSdZd9X1guKTSL1jm67qlHWn5nkAnfbd97UB7RNyd5q8hSyQD4fi9H3gyIjoiYgPwS+BIBs6xK+mtY9WeprvG+xwnjq0WARPTiI9BZBfnFjS4ThWlURcXAQ9HxLm5RQuA0miNWWTXPkrxk9KIj8nAi+n0+iZgqqQR6Zvi1BRrqIg4PSLGRsR4smNyS0R8ArgV+Ggq1rV9pXZ/NJWPFJ+ZRu5MACaSXYjBAiV8AAACZklEQVRsqIh4Dlgu6a9TaAqwhIFx/J4BJkvaNb1PS20bEMcup1eOVVr2kqTJ6fU6KbetvqXRF1n60oNsFMRjZKM2/rXR9amyzu8mO519AFicHseR9Q3fDCxNzyNTeQE/Sm38E9Ca29angGXpcUqj21bQ1vexdVTVgWQfHsuAXwCDU3xIml+Wlh+YW/9fU7sfpQ+NVgEOAdrSMfx/ZCNtBsTxA74OPAI8CFxCNjKq3x474Aqy6zUbyM4QZvfmsQJa02v1OPBDugya6CsP33LEzMxq4q4qMzOriROHmZnVxInDzMxq4sRhZmY1ceIwM7OaOHGYbSdJmyQtTneA/ZWk4duxrdsktfZm/cx6mxOH2fZ7JSIOiYi3kf2l86mNrpBZPTlxmPWuP5BuTCdpN0k3S7ov/cbCjBQfr+x3N36Sfqvit5KG5jciqUnSfElnN6ANZt1y4jDrJZKayW6rUbpVzXrgwxExCTga+F7uh3kmAj+KiLcCa4D/nttUC3AZ8FhE/O8dUnmzGjhxmG2/oZIWAy+Q/fbEwhQX8A1JDwD/n+xMZN+07MmIWJym7yX7jYeSC8l+7+GcelfcrCecOMy23ysRcQhwANkv+JWucXwCGAW8My1fSXY/JoBXc+tvIjvLKPk9cLSkIZj1QU4cZr0kIl4EPgd8Od3qfk+y3xLZIOlossRSjYuAG4Ff5G4/btZnOHGY9aKIuJ/s97Fnkl2naJXURnb28UgN2zkXuA+4RJL/n1qf4rvjmplZTfxNxszMauLEYWZmNXHiMDOzmjhxmJlZTZw4zMysJk4cZmZWEycOMzOryX8BVQeFkurV4uUAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"ranks, counts = counter_ranks(wc)\n", | |
"plt.plot(ranks, counts, linewidth=3)\n", | |
"plt.xlabel('Rank')\n", | |
"plt.ylabel('Count')\n", | |
"plt.title('Word count versus rank, linear scale');" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Huh. Maybe that's not so clear after all. The problem is that the counts drop off very quickly. If we use the highest count to scale the figure, most of the other counts are indistinguishable from zero.\n", | |
"\n", | |
"Also, there are more than 10,000 words, but most of them appear only a few times, so we are wasting most of the space in the figure in a regime where nothing is happening.\n", | |
"\n", | |
"This kind of thing happens a lot. A common way to deal with it is to compute the log of the quantities or to plot them on a log scale:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEaCAYAAAAL7cBuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XecVNX5x/HPs42ll6V3FhREUFQExAioqFgQW2yx/gxookZjYmISk1ijMVGjUaNGDbFiiYlSLFgAC0oTQQGlCgtI721h9/n9MbM6uzuzO7vs1P2+X695sXPPuXeeOTvMs/ece88xd0dERKSsjEQHICIiyUkJQkREwlKCEBGRsJQgREQkLCUIEREJSwlCRETCUoIQAMzsFjN7NtFxSPTMbJKZ/bga+3U2MzezrBjElDKfo2AbdEt0HMlMCSIJmdlvzGxCmW0LI2w7P77RxYeZDTGzgkTHIVKbKUEkpynA0WaWCWBmrYFs4PAy27oF60bNAvR7DxGLv6ST8TVFqkpfFMlpOoGE0Cf4fBDwPvBVmW2L3X0VgJkNNLPpZrYl+O/AkoMFuyLuNLOPgJ1Avpl1MbPJZrbNzCYCzSsKyMxGmNlsM9tqZovNbFhwe1sze93MNprZIjMbGbLPaDO7I+R5qbMCM1tmZr80sznBuF80s1wzqw+8AbQ1s+3BR9sy8Qwws29LEmZw25lmNif4c4aZ3RSMdYOZvWRmzYJlJV0sV5jZcuC94Os+G6y7OdiGrULiHBryOt91o1S0X5g2XGZmvw7GuMPMskJi3GZm88zszJD6l5nZh2b2VzPbZGZLzezkCMduE2zHX1b0e4ywb0W/w7pm9u/g6883s19V5czOzE43sy+DbTPJzA4KKTvczD4LvveXg7//OyIcp1vw87rFzNab2YshZQeb2cRg/GvM7LfB7f3MbGrwtVeb2UNmlhPh+HWC7bw8eIxHzaxutO8zXSlBJCF3LwQ+JZAECP77AfBhmW1TAIJffOOBB4E84D5gvJnlhRz2YmAU0BD4BngemEkgMdwOXBopHjPrBzwN3Ag0Cb72smDxC0AB0BY4B/iTmR1fhbd7LjAM6AIcAlzm7juAk4FV7t4g+FgVupO7fwLsAI4L2Xxh8H0B/Aw4AxgcjG0T8HCZ1x4MHAScROD9NwY6EGjDq4BdUcRf1f0uAE4Fmrj7PmAxcEzwGLcCz5pZm5D6/Qn8YdAcuAd40sws9IBm1hmYDDzk7n+NIuayKvod/hHoDOQDJwAXRXtQMzsweOzrgRbABGCsmeUEv6j/C4wGmgXrnRnhUBD4jL4NNAXaA38PvkZD4B3gzWD83YB3g/sUAT8n0HZHAccDP41w/D8DBxL4A6wb0A74Q7TvNW25ux5J+ABuAf4b/Plz4AACX6Sh2y4N/nwxMK3M/lMJfNkCTAJuCynrCOwD6odsex54NkIsjwH3h9negcB/woYh2+4CRgd/Hg3cEVI2BCgIeb4MuCjk+T3Ao+HqRojrDuCp4M8NCSSMTsHn84HjQ+q2AfYCWQS+8BzIDyn/P+Bj4JAwr7MMGFrmd/NsZftFOM7/VVJnNjAi+PNlwKKQsnrBuFuH/F7vCx73gip8tkref1YUv8MlwEkhZT+u6PdSpm1+D7wUUpYBrAz+bgcFf7aQ8g9DPy9ljvs08DjQvsz2C4DPonzf1xP8/xN87gSSgQU/O11Dyo4Clkbbpun60BlE8poC/MDMmgIt3H0hgS+igcFtvfh+/KEtgbOCUN8Q+CuoxIqQn9sCmzzwl3po/Ug6EPhLt6y2wEZ331bB61bm25CfdwINqrDv88BZZlYHOAuY5e4l76MT8N9g98JmAgmjCAjt/gltk2eAt4AxZrbKzO4xs+woYqjqfqGviZldYoGuu5I4e1G6u++79nH3ncEfQ9voRwS+aF+JItZwKvsdti0T83c/m9mPQroA34hw7O8+V+5eHNy/XbBspQe/jcseO4xfEfginxbssvq/4PZIn03M7EAzG2eBrsitwJ8I35XagkDynRnye3gzuL1WU4JIXlMJdDuMAj4CcPetwKrgtlXuvjRYdxWBL8RQHQl8cZQI/Y+4Gmhqgb7+0PqRrAC6htm+CmgWPM0P97o7CPzHK9G6gtcoq9Jpht19HoEvoJMp3b1UEvPJ7t4k5JHr7mHbxN33uvut7t4TGAicBlxS2fuoZL8K35eZdQL+CVwD5Ll7E+ALAl+E0boFWA88byHjMVVQ2e9wNYEunRIdSn5w9+f8+y7AcGMjpT6Xwa6xDsFjrwbaleku60AE7v6tu49097bAlcAjFrhENdJnE+AfwALgAHdvBPyW8G27nkC34MEhn5XG7l6VP1bSkhJEknL3XcAM4AYC4w8lPgxuC716aQJwoJldGBz4PA/oCYyLcOxvgse+Ndgf/ANgeAXhPAlcbmbHW2Dwt52Z9XD3FQTOau6ywGDtIcAVwHPB/WYDp5hZMwtcdXV9FZpgDZBnZo0rqfc8gfGGQcDLIdsfBe4MfgljZi3MbESkg5jZsWbWO/glu5VAd1RRyPs438yyzawvgX76aParTH0CCWNd8FiXEziDqIq9wA+Dx3rGgleoWWAgfVJlO0fxO3wJ+I2ZNTWzdgSSWbReAk4Nfm6ygV8Ae4KvN5VAO10T/MyOAPpFOpCZ/dDMShLVJgLtVkTgM97azK4PDjQ3NLP+wXoNCfxOtptZD+AnEdqgmECivt/MWgZfr52ZnVSF95qWlCCS22SgJYGkUOKD4LbvEoS7byDwl+svgA0ETsdPc/f1FRz7QgIDoBsJDEQ+Hamiu08DLgfuB7YE4yr5y/ACAn3aqwgMOv7R3ScGy54hMFayjMAA43dXnlTG3RcQGLhcEjztbxuh6gsE+rTfK/N+HwBeB942s23AJ8H3G0lrAt00Wwl0R00GSm74+j2Bv1I3ERhIfj7K/Sp7j/OAewl8Wa4BehM8W6wKD1zUcBaBz8VTwSTRoQrHquh3eBuBAeylBAaDXyHwJR9NXF8RGNT+O4G/0ocDw929MCTmK4DNwXrjKjj2kcCnZradwO/1OndfGuwaOyF47G+BhcCxwX1+SeBzvo1AAqjo8/drYBHwSbA76h2gezTvM51Z6S5AEUkHZjabwCD9hho+7k+A8919cE0eN3jsTwlcpPCvmj62VI/OIETSkLv3qYnkYIH7K44Odi12J3CW+t/9jxDMbLCZtQ52MV1K4DLnN2vi2FIzkipBmFl9M5tpZqclOhYRASCHwGXO24D3gNeAR2ro2N0JdEFuIZB4znH31TV0bKkBMe1iMrOnCPSNr3X3XiHbhxHoI84EnnD3u4PbbyNwxciX7h52gFVEROIj1gliELAdeLokQQSv9viawMBSAYFpJS4gcF10cyAXWK8EISKSWDGdMMzdpwSnAQjVj8DdoUsAzGwMMILAzT/1CVyeucvMJgQvPxMRkQRIxIyS7Sh9x2QB0N/dr4HABGUEziDCJgczG0XgRjHq169/RI8ePWIbrYhImpk5c+Z6d6/0TvFEJIhwdzKG3tE6uqKd3f1xAnOy0LdvX58xY0aNBiciku7MrKKpdb6TiKuYCih9S317AjfoRM3MhpvZ41u2bKnRwERE5HuJSBDTgQMssB5BDnA+gTsjo+buY919VOPGlc3CICIi1RXTBGFmLxCYRqC7mRWY2RUemAP/GgIzYM4nMB3wl7GMQ0REqi7WVzFdEGH7BAITzFWLmQ0HhnfrpvXGRURiJanupI6WuphERGIvJROEiIjEXkomCF3FJCISeymZINTFJCISeymZIEREJPZSMkGoi0lEJPZSekW5vM4H+Sm/H12tfZvVz+HMw9oxpHvLmg1KRCTJmdlMd+9bWb1EzMVUY7bv2ccHCytadrlir81exemHtuWW0w+mWf2cGoxMRCT1pWQXU016/fNVnHDfZMbNWUUqn02JiNS0lEwQJWMQNXW8DTsKueb5z/jJs7NYu213TR1WRCSlpfQYRI9effzxVydWeb+dhUU8+O5C5q3eWq6sSb1s/ji8J2f0aYdZuJnJRURSW7RjECmdIPZnPYi9RcU8OmkxD763kL1F5dvguB4tufPMXrRpXHd/wxQRSSrRJoiU7GKqCdmZGVx7/AGMu/YYDm1f/oa79xas5cT7pjBm2nKNTYhIrVRrE0SJ7q0b8p+fDOQ3J/cgJ6t0c2zbs4+bXp3LxU9OY8XGnQmKUEQkMVIyQdT0jXJZmRlcObgrb1x3DH07NS1X/uGi9Zz0tyk8PXUZxcU6mxCR2qHWjkFEUlzs/HvqMu558yt27S0qV96vSzPuOfsQOjevX6OvKyISLxqDqKaMDOPyo7vw1vWDOCo/r1z5tKUbGfbAFJ74YAlFOpsQkTSmBBFBx7x6PPfj/tx5Zi8a1Cl9w/nuvcXcMX4+5zz6MYvWbktQhCIisaUEUYGMDONH/Tvx1s8HMejAFuXKP1u+mVMe+JCH31/EvqLiBEQoIhI7ShBRaNekLv++/Ej++sNDaZRb+myisKiYv7z1FWc88hHzw9x4JyKSqpQgomRmnHNEeybeMJihB7UqV/7Fyq0M//uH3D/xawr36WxCRFJfSiaIRK4H0apRLv+85AgeOL8PTetllyrbV+w88O5CTn/oQ6Yv2xj32EREapIuc90P67fv4Y+vf8n4OavDlp92SBtuOrkH7ZvWi3NkIiKR6TLXOGjeoA4PX3g4j150OM0b1ClXPm7Oao6/dzL3vv0VO/bsS0CEIiLVpwRRA4b1asPEnw/irMPalSvbs6+Yv7+3iOPuncR/ZhboTmwRSRlKEDWkaf0c7juvDy+OGsDBbRuVK1+zdQ+/ePlzznjkI2ZofEJEUoASRA3rn5/H69f8gHvOPiRst9Ocgi2c8+hUrn3hM1Zu3pWACEVEoqMEEQOZGca5R3Zg0o1D+MmQruRklm/msZ+v4ri/TuI+jU+ISJLSVUxxsHzDTu56Yz5vfPFt2PJWjepw40k9OPHgVjTKzQ5bR0SkpmhFuST0yZIN3DZ2XtilTks0b5BDl+b16ZxXny4t6pPfvD6dg89zszPjGK2IpKu0ThBmNhwY3q1bt5ELFy5MdDhVUlTsvDJzBX956yvWby+s0r7tmtTl2B4tuOGE7jSrnxOjCEUk3aV1giiRamcQobbt3svD7y/mqQ+XUljFif5aNKzDX845hCHdW8YoOhFJZ0oQKWL5hp08MmkR05dtZPnGnewtiv73cfGATvzmlB7Uy8mqvLKISJASRAraV1TMys27WLp+R7nHys27CPerym9en/vO60OfDk3iH7CIpCQliDSze28Rj7y/iIfeX0TZm7EzM4xrj+vG1cd2IzvMJbUiIqE0F1Oayc3O5IYTu/PyVQPpnFd68r+iYudv7yzk7H98zCszC9i4o2qD3yIi4egMIgXt2LOPOyfM5/lPl4ctzzA4olNTTujZiqEHtSK/RYM4RygiyUxdTLXAewvW8KtX5rJ++54K6x3cthF/OrM3h2qcQkRQF1OtcFyPVrx1/TEMO7h1hfW+XLWVC/75CR8vXh+nyEQkHShBpLi8BnV49OIjGHftD/jZ8QdwUJvyM8kC7Cws4vJ/Tef9BWvjHKGIpCp1MaWhgk07eWfeGt6et4aPF28oVZadaTxw/mGc0rtNgqITkURTF1Mt1r5pPS47ugvPjxzAb0/pUapsb5FzzfOzeGVmQYKiE5FUkTRnEGZ2EHAd0Bx4193/Udk+OoOIzjOffMPv//dFue159XPIb1Gfri0akN+iPvnNG9C1ZQM6NK1Llu6nEElb0Z5BxHSOBjN7CjgNWOvuvUK2DwMeADKBJ9z9bnefD1xlZhnAP2MZV21z8YBO1MvO5MZXPi91k92GHYVs2FHI9GWbStXPzjTaNqlLgzpZ1MvJpEGdLPp1yeOiAR1pqOnIRWqNmJ5BmNkgYDvwdEmCMLNM4GvgBKAAmA5c4O7zzOx04CbgIXd/vrLj6wyiat6Yu5qfjfmsSvM9hWpaL5urj+3GRQM6aepxkRSWFGMQ7j4FKLsAcz9gkbsvcfdCYAwwIlj/dXcfCPwolnHVVif3bsOzV/TnyM5Nw65yV5lNO/dyx/j5HPfXSbw0YwX7qjgLrYiklkRMA9oOWBHyvADob2ZDgLOAOsCESDub2ShgFEDHjh1jF2Wa6p+fx8tXDaSo2CnYtJMl63aweN12Fgf/XbJuR6U33q3asptfvTKHxyYvZmjPVmSakdegDif3ak3bJnXj9E5EJNZiPkhtZp2BcSFdTD8ETnL3HwefXwz0c/drq3psdTHFxpZde1m3bQ+7CovYWbiPqUs28MQHS9leydrZdbIyuHJwV64anK8pyEWSWFIMUkdQAHQIed4eWFWVA4SsKFeTcUlQ47rZNK77/WB0//w8Lh7QiUcmLeaZqd9EXOBoz75iHnx3If+ZWcA/L+lLz7bhb9oTkdSQiGsZpwMHmFkXM8sBzgder8oB3H2su49q3LhxTAKU8vIa1OH3p/Xk/RuH8MMj2pNhkeuu3LyL8x6bypSv15Esl1GLSNXF+iqmF4AhBO5tWAP80d2fNLNTgL8RuMz1KXe/szrHVxdT4ixdv4MPFq5j+5597Cos4oVpyyOusZ3fvD7nHtmBC47sSKO6WZhVkF1EJObSejbXkC6mkQsXLkx0OEJgje273lgQcQryEmaQnZlBvZxMBnbN41cn9aBz8/pxilJEIM0TRAmdQSQXd+cfkxdzz5tfRb1PTmYGJxzcih90a84ZfdpRN0f3V4jEmhKEJMyEuat56L1FLFq3ncJ90d8r0aJhHW4a1oOzj2gfw+hEJK0ThLqYUsfGHYU898k3vDRzBas276ao7ILaYbw4agD98/PiEJ1I7ZTWCaKEziBST1Gxs7eomFdnreSvb38Vdv3skw5uxWMXV/rZFZFqSub7IKQWy8wwMjMyubB/R846vB1zCrYw+uOlTJj77Xd13pm/ltVbdtGmse7KFkmklDyDUBdTenF3Trx/CgvXbv9uW4ZBq0a5tG1Sl7ZN6pLfvD5nH96ejnn1EhipSHpQF5OklH9/vIw/vv5lpfUa182mU149TundhisH5eueCpFqSIrZXEWidebh7Whar/K1Jrbs2sucgi3c/cYCRn+8LPaBidRiShCSFBrlZvPEpX05ulsezRvUiWqfW8fO45FJiyiO4sooEam6lOxi0hhE+tu9t4hvt+xm1eZdzP92Gy/PWMGCb7eFrXvxgE7cNuJgdTeJREljEJJ2du8t4o7x83j2k/LTefTp0ISrBufTv0seTevnJCA6kdShy1wl7eRmZ/K7U3oyf/U2Zn5Teh3t2Ss2c9WzszCDU3u34YHzDyOzoilnRaRSGoOQlFI3J5OXrzyKP5/dm3A9Su4wbs5qxs2p0hIjIhKGEoSknIwM47wjO/LmdYMY0r1F2DofLlwf56hE0k9KJggzG25mj2/ZsiXRoUgCdW/dkNGX92PctT+gW8sGpcpmLt8UYS8RiVZKJgitKCeherVrzGtXH11qzGHJuh0s+HZrAqMSSX0pmSBEyqpfJ4uD2jQste2sRz7mtdkrExSRSOpTgpC0MfSgVqWe7yws4roxs3lk0qIERSSS2pQgJG38dEg3zjqsXbnt97z5FfdN/DoBEYmkNiUISRs5WRnce+6h/OnM3uRklv5oP/juQsZMW04q3xgqEm9KEJJWzIwL+3fkmSv6kZtd+uN906tzOe/xT9i6e2+CohNJLSmZIHSZq1Smf34ej150RLnt05Zu5JBb3uaLlfrsiFQmJROELnOVaAzp3pJTe7cJW3ba3z/k6udm8cbc1ewrKo5zZCKpQXMxSVr78zmHkN+iPn9/r/yVTOPnrmb83NV0bFaPq4/tyuADW9K6cW4CohRJTprNVWqF9xes5cdPz6CokrUjhh7Uij8O70mHZlraVNKXVpQTCXFsj5ZMvnEId5/VmwNbNYhY7535azjmnvcZP2d1HKMTSU5KEFJrtG9aj/P7dWTCz47hiUv6MrBrXsS6Vz8fGJ8Qqc3UxSS12pyCzTw2eQnjIySD8/p24JcndadFw+iWQRVJBVpRTqQK1m7bzdB7J7N1976w5acd0obbR/TSanWSFjQGIVIFLRvmMvsPJ3LpUZ3Clo+bs5p+f3qHJz5YoruxpdZIyQShG+UkFjIyjFtH9OKZK/rRJszlrnuLnDvGz+eqZ2eyY0/4Mw2RdKIuJpEwtuzcyyOTFvHYlCVhy+tmZ3LXWb0Z0actFm7tU5EkpjEIkRqwessubnjxc6Yu2RC2vE5WBlN/czzNNDYhKURjECI1oE3jurwwagBv/3wQ+c3rlyvfs6+YxyOcZYikOiUIkSgc2KohY0YN4Kj88vdOvPmF7peQ9KQEIRKllo1yeWHUAO45+5BS25dt2MmlT01j3iqtgS3pRQlCpIp+2Lc9eWXGHCZ/vY5THvyAG16crfUmJG0oQYhUkZnxy5O6hy179bOVHHLL25zx8Ee8OH05G3cUxjk6kZoTVYIws6Oj2SZSW1zQryPPXtGfxnWzw5bPXrGZX/9nLkfcMZGfPjeThWu2xTlCkf0X7RnE36PcJlJr/OCA5rxx3TEMyG8WsY47TJj7LSfcP4X/zCyIY3Qi+6/CBYPM7ChgINDCzG4IKWoEZMYyMJFU0LZJXcaMOooN2/dw38Svee7T5RHr/uLlzxkzfTkPX3g4LRtpYSJJfhXeKGdmg4EhwFXAoyFF24Cx7r4wptFVQjfKSbLZV1TM+LmreWf+Wt6dv4adhUVh67VtnMuVg7tybt8O1M3R31oSXzV6J7WZdXL3b2oksopf5wzgVKAl8LC7v11RfSUISWaF+4r59X/m8N/PVlZY739XH02fDk3iFJVIzd9JXcfMHjezt83svZJHlIE8ZWZrzeyLMtuHmdlXZrbIzG4CcPf/uftI4DLgvChjE0lKOVkZ3HfuoZx+aNsK653zj491D4UkpWgTxMvAZ8DNwI0hj2iMBoaFbjCzTOBh4GSgJ3CBmfUMqXJzsFwkpZkZD15wGO/+YjDn9e0Qts6+Yufet7+Kc2QilYs2Qexz93+4+zR3n1nyiGZHd58CbCyzuR+wyN2XuHshMAYYYQF/Bt5w91nhjmdmo8xshpnNWLduXZThiyRW1xYN+PM5h7D0rlP42fEHlCt/d8Falq3fkYDIRCKLNkGMNbOfmlkbM2tW8tiP120HrAh5XhDcdi0wFDjHzK4Kt6O7P+7ufd29b4sWLfYjBJH4MzNuOOFAlvzplHJlQ/46id/+dy7Fxak7w7Kkl2gTxKUEupQ+BmYGH/szOhxuAn139wfd/Qh3v8rdHw1TRyQtZGQYvzzxwHLbn/90OUPvm5yAiETKiypBuHuXMI/8/XjdAiC0Q7Y9sCranbWinKSDSwZ2pmGd8rciLVm/g353vkORziQkwaK9zPWScNvd/emoXsSsMzDO3XsFn2cBXwPHAyuB6cCF7v5lVFEH6TJXSXVrtu5m6L2T2RZhCdMHzu/DiD7t4hyVpLuavsz1yJDHMcAtwOlRBvICMBXobmYFZnaFu+8DrgHeAuYDL1UlOegMQtJFq0a5zL31JK4fWn7gGuC6MbM57t5JpPLKj5K6qrXkqJk1Bp5x96iSRKzoDELSyZMfLuX2cfMili+4fRi52brrWvZfrJcc3QmE/5NHRKrlih904cVRAyKW9/j9m1wxenocI5LaLtrpvsea2evBx3jgK+C12IZWYTzqYpK01D8/jwW3D6NTXr2w5e8uWMtJ90+Jc1RSW0U7SD045Ok+4Bt3T/jcxepiknT2+JTF/GnCgrBlF/TryF1n9Y5zRJIuarSLyd0nAwuAhkBTQMtkicTYqEFdeX5k/7BlL0xbztF3RzUdmki1RdvFdC4wDfghcC7wqZmdE8vARAQGdm3OsrtPpV2TuuXKVm7exRMfLElAVFJbRDtI/TvgSHe/1N0vITCX0u9jF1bFNAYhtc1HNx0Xdvsd4+ezdffeOEcjtUW0CSLD3deGPN9QhX1rnLuPdfdRjRs3TlQIInG39K7y8zcBHHLL25q/SWIi2i/5N83sLTO7zMwuA8YDE2IXloiUZWYsuvPksGX5v53AxHlrlCikRlWYIMysm5kd7e43Ao8BhwCHErgz+vE4xCciIbIyM3j0oiPClo18ega//e/cOEck6ayyM4i/EVh/Gnd/1d1vcPefEzh7+Fusg4tEYxBSmw3r1Tri1Bxjpq9gxEMfxjkiSVeVJYjO7j6n7EZ3nwF0jklEUdAYhNR21w89kJtPPShs2ecFWzji9olxjkjSUWUJIreCsvLX3YlI3Pz4mHxm3Dw0bNmGHYV0vmk8Yz9fpYn+pNoqSxDTzWxk2Y1mdgWBRYNEJIGaN6gTceAa4NoXPqPLbyYw9vNVWl9CqqzCqTbMrBXwXwJ3TpckhL5ADnCmu38b8wgroKk2RL7X+abxldZZdvepcYhEkl2NTLXh7mvcfSBwK7As+LjV3Y9KZHLQILVIeUvvOoWzD29fYZ3ON41n/fY9cYpIUl211oNIFjqDEClv045Cbnp1Dm99uSZinfvPO5QzD6s4mUj6ivYMQglCJE1t3lnIhf/8lHmrt4Ytz29Rn3dvGIyZxTkySbRYLxgkIkmuSb0cJlx3DBcP6BS2fMm6HZyotSWkAkoQImnu9jN6RVypbuHa7XS+aTzbNOGfhKEEIVILlKxU17Redtjy3re8zYJvw3dFSe2lBCFSS+RmZ/LZH04kJzP8f/thf/uADxeuj3NUksxSMkHoMleR6pt320kc16Nl2LKLnvyUw2+fyJad6nKSFE0QmotJpPqyMjN46rIjefLS8BexbNxRyKG3vc3yDTvjHJkkm5RMECKy/44/qBUf/vrYiOWD/vI+T324lC27dDZRWylBiNRi7ZvW44NfHUvD3Kyw5beNm8eht77N/z5byb6i4jhHJ4mmBCFSy3VoVo+5t5zEdceHX2MC4PoXZ9Ptd2+wcvOuOEYmiaYEISIA/PyEA7ltxMEV1jn67vd468uEztEpcaQEISLfueSozvzv6qM5oGWDiHWufGYmN7w0W+tM1AJKECJSSp8OTZh4w2Ce/3H/iHVenbWSS56axl6NS6Q1JQgRCWtgt+YsuvNkLujXMWz5BwtmzFAJAAAOoklEQVTXc8TtE1m0dnucI5N4SckEoRvlROIjKzODu87qzQsjw8/ltHX3PobeN5lPlmyIc2QSDymZIHSjnEh8HdU1j6m/OS5i+fmPf8J9E79mV2FRHKOSWEvJBCEi8demcV2m/24oA7vmhS1/8N2FnPb3D3S/RBpRghCRqLVoWIfnRw7gd6ccFLZ88bod/OS5WXGOSmJFCUJEqmzkoHzuOqt32LKJ89Zw5TMzdIVTGlCCEJFquaBfRyb+fFDYsre+XMNPnp3Fmq274xyV1CQlCBGptgNaNWTGzUPDlr0zfw3H/Pl9PlqkNSZSlRKEiOyX5g3q8M4Ng8OWFRYV86MnPuXm/83VpbApSAlCRPZbt5YNmPa74yOWP/vJci564lO+XrMtjlHJ/lKCEJEa0bJhLi9deRT9ujQLW76v2Dnx/inc+PLnbN2tNSZSgRKEiNSYfl2a8dKVR0Vc0hTg5ZkF3PPmgjhGJdVlqTwjY9++fX3GjBmJDkNEyijcV8zHi9fzyKTFTFu6MWydDs3q0qphLjee1J3++eFvvpPYMLOZ7h5+zdkQSXMGYWb5Zvakmb2S6FhEZP/kZGUwpHtLnri0L4e0Dz8lzoqNu5jxzSZ++tws3X2dpGKaIMzsKTNba2ZflNk+zMy+MrNFZnYTgLsvcfcrYhmPiMRXo9xsXrv6aN68/piIdTbsKOTEv03hmudnMXWxrnRKJrE+gxgNDAvdYGaZwMPAyUBP4AIz6xnjOEQkQcyMHq0b8btTDiIrw8LWWbJuB+PmrObSp6axbtueOEcokcQ0Qbj7FKBsB2Q/YFHwjKEQGAOMiPaYZjbKzGaY2Yx169bVYLQiEksjB+Xz+R9PZMqNx9Klef2wdQqLirlt3DxmfrOR4uLUHR9NF4kYg2gHrAh5XgC0M7M8M3sUOMzMfhNpZ3d/3N37unvfFi1axDpWEalB9etk0TGvHn8Y3pPszPBnE2M/X8XZ/5jKnRPmxzk6KSsRCSLcp8LdfYO7X+XuXd39rrhHJSJxc2z3lnx003H8+//60b5p3bB1nvv0Gw1eJ1hWAl6zAOgQ8rw9sKoqBzCz4cDwbt261WRcIhJHLRvm0rJhLjefehBXPVt+ivDde4u58J+f0qhuFq0a5TJqUD6d8sJ3TUlsxPw+CDPrDIxz917B51nA18DxwEpgOnChu39Z1WPrPgiR9PDa7JVM/modr362MmKdznn1eP+XQzAL3zUl0UuK+yDM7AVgKtDdzArM7Ap33wdcA7wFzAdeqmpy0JrUIullRJ923Hden4j3TAAs27CTtbrCKa50J7WIJI3/fbaSn780m0hfS0d3y6Nudha52Rmc2rsNJ/duE98A00S0ZxBKECKSVFZs3MmCbwOzvt427ktWbNwVse7LVx3FkZ3DTw4okSVFF1OsqItJJH11aFaPE3q24oSerehcyaD0p1pjIqZSMkG4+1h3H9W4ceT+ShFJfZcc1ZnMCHdfAyxdv5NZyzcxa/kmvly1RZfF1jB1MYlIUlu5eRfzVm3F3Xln/hpemlEQsW5+i/q8fOVR5DWoE8cIU09adzGJSO3RrkldTujZihMPbs0h7ZtUWHfJuh1MmLs6TpGlv5RMEBqDEKmd+nSoOEEAuhS2BqVkgtAYhEjt1KtdY+4/71CO7pZHnw5N6NOhCW0b55aqs357ISs372Ll5l2s365ksT80BiEiKe1fHy3l1rHzIpb369yMJy7rS6Pc7DhGldw0BiEitUKdrMwKy6ct28i4zzUuUR0pmSA0BiEiJXq0aVhpndVbIt9sJ5GlZILQGISIlDisQxN+d8pBdGvZgLaNc2nbOJeGuaUnqi7cp/sjqiMR032LiNQYM2PkoHxGDsr/btsTHyzhjvHfLzi0Zdde1m7bXWq/utmZNNS4RIWUIEQk7dTJKt05Mmb6CsZMX1Fqmxmc3Ks1fzvvMHKyUrIzJebUKiKSdrIzK/9qc4cJc7/lo0Xr4xBRakrJBKFBahGpyIGtKx+4LrFyswawI0nJBKFBahGpyGEdmvCrYd3pnFeP5g3qlHrkZpf+2isqTt17wWJNYxAiknbMjJ8O6cZPh5Rft/7WsV/yr4+Wffd8nxJERCl5BiEiUl1ZZaYP1xThkekMQkRqlawyA9jj5qxm0drtEet3bFaPS4/uXCun6lCCEJFapewZxNyVW5i7suILXmYt38S/Lu8Xy7CSUkp2MekqJhGpruqcCXywcD2pPLFpdaVkgtBVTCJSXScd3Jom9aqWJPYVO7VxLFtdTCJSq3TMq8fb1w9i6pIN7NkbeYD6plfnlEoKRcVe4frY6UgJQkRqnZaNchnRp12FdW5+7YtSk/wVq4tJREQAMq302UJtvKFOCUJEJIyy3UlFOoMQERGAssMNRUW1L0FoDEJEJIyyN9Q9NmUJ9XIqXt60hAGHdmjCoANbxCCy+FGCEBEJI6PMGMSjkxdX+Rh3ntmLH/XvVFMhxZ26mEREwoj2bKEi4+esroFIEiclE4TupBaRWBvWq/V+H2P33qIaiCRxUrKLyd3HAmP79u07MtGxiEh6+vWwHhzctlGFE/mVtXrLbl6ZWfDd81S/MjYlE4SISKxlZlilN9OV9dnyTaUSRKrP35SSXUwiIsmo7MB2qp9BKEGIiNSQsgnCSe0MoQQhIlJDyuQHilN8sTolCBGRGlK+i0lnECIiQvkziBTPD0oQIiI1RWMQIiISVtkJ/nQVk4iIAGAagxARkXDKnkGkeH5Injupzaw+8AhQCExy9+cSHJKISJWk21VMMU0QZvYUcBqw1t17hWwfBjwAZAJPuPvdwFnAK+4+1sxeBJQgRCSllL2KqWDTLgbe9W7MXm9Yrzb8YXjPmB0/1mcQo4GHgKdLNphZJvAwcAJQAEw3s9eB9sDcYLXUngJRRGqlsmcQRcXOqi27Y/Z6m3cVxuzYEOMxCHefAmwss7kfsMjdl7h7ITAGGEEgWbSvLC4zG2VmM8xsxrp162IRtohIteQ1yCE70yqvmCISMUjdDlgR8rwguO1V4Gwz+wcwNtLO7v64u/d1974tWqT2cn4ikl7q5WTxixO7p02SSMQgdbiWc3ffAVwe1QHMhgPDu3XrVqOBiYjsr6sGd+WSozqxaefemL9Wvez9X/WuIolIEAVAh5Dn7YFVVTmAFgwSkWRWLyeLejlJc5FotSWii2k6cICZdTGzHOB84PUExCEiIhWIaYIwsxeAqUB3MyswsyvcfR9wDfAWMB94yd2/rOJxtSa1iEiMWSovide3b1+fMWNGosMQEUkpZjbT3ftWVk9TbYiISFgpmSDUxSQiEnspmSDcfay7j2rcuHGiQxERSVspPQZhZuuAb0I2NQa2RPm8ObA+RqGVfd2a3KeiepHKwm2vSltB7NqrOm0V7X6xaqtw21L9s1VZnVh9tlKxrSqrlwr/Dzu5e+V3Grt72jyAx6N9DsyIVxw1uU9F9SKVhdtelbaKZXtVp62i3S9WbVVZe6XiZ6uyOrH6bKViW1VWLxX/H0Z6pGQXUwXKTtFR2fN4xVGT+1RUL1JZuO2p3FbR7hertgq3LZnba3/bqqLydPts1db/h2GldBfT/jCzGR7FZV4SoPaKntoqemqrqol3e6XbGURVPJ7oAFKM2it6aqvoqa2qJq7tVWvPIEREpGK1+QxCREQqoAQhIiJhKUGIiEhYShBBZlbfzP5tZv80sx8lOp5kZmb5Zvakmb2S6FhSgZmdEfxcvWZmJyY6nmRmZgeZ2aNm9oqZ/STR8SS74PfWTDM7LRbHT+sEYWZPmdlaM/uizPZhZvaVmS0ys5uCm88CXnH3kcDpcQ82warSVh5YT/yKxESaHKrYXv8Lfq4uA85LQLgJVcW2mu/uVwHnArXu8tcqfmcB/Bp4KVbxpHWCAEYDw0I3mFkm8DBwMtATuMDMehJY2a5kreyiOMaYLEYTfVtJ9drr5mB5bTOaKrSVmZ0OfAi8G98wk8JoomwrMxsKzAPWxCqYtE4Q7j4F2Fhmcz9gUfCv4EJgDDCCwFKo7YN10rpdwqliW9V6VWkvC/gz8Ia7z4p3rIlW1c+Wu7/u7gOBWtfVW8W2OhYYAFwIjDSzGv/eSv1FU6uuHd+fKUAgMfQHHgQeMrNTSfDt7UkkbFuZWR5wJ3CYmf3G3e9KSHTJJ9Jn61pgKNDYzLq5+6OJCC7JRPpsDSHQ3VsHmJCAuJJR2LZy92sAzOwyYL27F9f0C9fGBGFhtrm77wAuj3cwSS5SW20Arop3MCkgUns9SOAPEPlepLaaBEyKbyhJL2xbffeD++hYvXCt60ohkH07hDxvD6xKUCzJTm1VNWqv6KmtopewtqqNCWI6cICZdTGzHOB84PUEx5Ss1FZVo/aKntoqeglrq7ROEGb2AjAV6G5mBWZ2hbvvA64B3gLmAy+5+5eJjDMZqK2qRu0VPbVV9JKtrTRZn4iIhJXWZxAiIlJ9ShAiIhKWEoSIiISlBCEiImEpQYiISFhKECIiEpYShEiUzKzIzGab2RdmNtbMmuzHsSaZWa2bzlpSixKESPR2uXsfd+9FYMbNqxMdkEgsKUGIVM9UArNsYmYNzOxdM5tlZnPNbERwe2czmx9cTe5LM3vbzOqGHsTMMoIrGd6RgPcgUiElCJEqCi7gcjzfz4ezGzjT3Q8nMEf/vWZWMgPnAcDD7n4wsBk4O+RQWcBzwNfufnNcghepAiUIkejVNbPZwAagGTAxuN2AP5nZHOAdAmcWrYJlS919dvDnmUDnkOM9Bnzh7nfGOnCR6lCCEIneLnfvA3QCcvh+DOJHQAvgiGD5GiA3WLYnZP8iSq/B8jFwrJnlIpKElCBEqsjdtwA/A35pZtlAY2Ctu+81s2MJJJBoPElg1bSXzaw2Lt4lSU4JQqQa3P0z4HMCc/M/B/Q1sxkEziYWVOE49wGzgGdisaawyP7QdN8iIhKW/mIREZGwlCBERCQsJQgREQlLCUJERMJSghARkbCUIEREJCwlCBERCUsJQkREwvp//U2J4a16kGYAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"ranks, counts = counter_ranks(wc)\n", | |
"plt.plot(ranks, counts, linewidth=4)\n", | |
"plt.xlabel('Rank')\n", | |
"plt.ylabel('Count')\n", | |
"plt.xscale('log')\n", | |
"plt.yscale('log')\n", | |
"plt.title('Word count versus rank, log-log scale');" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This (approximately) straight line is characteristic of Zipf's law.\n", | |
"\n", | |
"n-grams\n", | |
"-------\n", | |
"\n", | |
"On to the next topic: bigrams and trigrams." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from itertools import tee\n", | |
"\n", | |
"def pairwise(iterator):\n", | |
" \"\"\"Iterates through a sequence in overlapping pairs.\n", | |
" \n", | |
" If the sequence is 1, 2, 3, the result is (1, 2), (2, 3), (3, 4), etc.\n", | |
" \"\"\"\n", | |
" a, b = tee(iterator)\n", | |
" next(b, None)\n", | |
" return zip(a, b)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"`bigrams` is the histogram of word pairs:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"bigrams = Counter(pairwise(iterate_words('pg2591.txt')))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And here are the 20 most common:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[(('to', 'the'), 444),\n", | |
" (('in', 'the'), 399),\n", | |
" (('of', 'the'), 369),\n", | |
" (('and', 'the'), 349),\n", | |
" (('into', 'the'), 294),\n", | |
" (('said', 'the'), 251),\n", | |
" (('on', 'the'), 199),\n", | |
" (('and', 'when'), 168),\n", | |
" (('he', 'had'), 164),\n", | |
" (('he', 'was'), 164),\n", | |
" (('to', 'be'), 163),\n", | |
" (('it', 'was'), 152),\n", | |
" (('Then', 'the'), 151),\n", | |
" (('I', 'will'), 149),\n", | |
" (('that', 'he'), 143),\n", | |
" (('at', 'the'), 142),\n", | |
" (('came', 'to'), 138),\n", | |
" (('and', 'he'), 135),\n", | |
" (('she', 'was'), 129),\n", | |
" (('all', 'the'), 125)]" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bigrams.most_common(20)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Similarly, we can iterate the trigrams:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def triplewise(iterator):\n", | |
" a, b, c = tee(iterator, 3)\n", | |
" next(b)\n", | |
" next(c)\n", | |
" next(c)\n", | |
" return zip(a, b, c)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And make a Counter:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"trigrams = Counter(triplewise(iterate_words('pg2591.txt')))\n", | |
"\n", | |
"# Uncomment this line to run the analysis with Elvis Presley lyrics\n", | |
"#trigrams = Hist(triplewise(iterate_words('lyrics-elvis-presley.txt')))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here are the 20 most common:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[(('came', 'to', 'the'), 65),\n", | |
" (('and', 'when', 'he'), 50),\n", | |
" (('out', 'of', 'the'), 50),\n", | |
" (('said', 'to', 'the'), 34),\n", | |
" (('he', 'came', 'to'), 33),\n", | |
" (('and', 'when', 'she'), 33),\n", | |
" (('went', 'into', 'the'), 32),\n", | |
" (('went', 'to', 'the'), 31),\n", | |
" (('and', 'said', 'to'), 31),\n", | |
" (('one', 'of', 'the'), 30),\n", | |
" (('came', 'to', 'a'), 30),\n", | |
" (('and', 'as', 'he'), 29),\n", | |
" (('they', 'came', 'to'), 29),\n", | |
" (('he', 'did', 'not'), 28),\n", | |
" (('there', 'was', 'a'), 28),\n", | |
" (('that', 'he', 'had'), 28),\n", | |
" (('and', 'I', 'will'), 27),\n", | |
" (('that', 'it', 'was'), 25),\n", | |
" (('and', 'at', 'last'), 24),\n", | |
" (('and', 'when', 'the'), 24)]" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"trigrams.most_common(20)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Markov analysis\n", | |
"\n", | |
"And now for a little fun. I'll make a dictionary that maps from each word pair to a Counter of the words that can follow." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from collections import defaultdict\n", | |
"\n", | |
"d = defaultdict(Counter)\n", | |
"for a, b, c in trigrams:\n", | |
" d[a, b][c] += trigrams[a, b, c]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we can look up a pair and see what might come next:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Counter({'ran': 2,\n", | |
" 'on': 1,\n", | |
" 'of': 2,\n", | |
" 'that': 1,\n", | |
" 'came,': 1,\n", | |
" 'streamed': 1,\n", | |
" 'fell': 1,\n", | |
" 'might': 1,\n", | |
" 'ran.': 1})" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"d['the', 'blood']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here are the most common words that follow \"into the\":" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('forest', 15),\n", | |
" ('forest,', 13),\n", | |
" ('garden', 9),\n", | |
" ('kitchen,', 8),\n", | |
" ('cellar', 8),\n", | |
" ('wide', 7),\n", | |
" ('room,', 7),\n", | |
" ('water,', 7),\n", | |
" ('wood', 6),\n", | |
" ('kitchen', 6)]" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"d['into', 'the'].most_common(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here are the words that follow \"said the\":" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('old', 13),\n", | |
" ('man,', 12),\n", | |
" ('little', 10),\n", | |
" ('fisherman,', 8),\n", | |
" ('father,', 7),\n", | |
" ('ass,', 6),\n", | |
" ('other;', 5),\n", | |
" ('wife,', 5),\n", | |
" ('fish;', 5),\n", | |
" ('fish.', 5)]" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"d['said', 'the'].most_common(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The following function chooses a random word from the suffixes in a Counter:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import random\n", | |
"\n", | |
"def choice(counter):\n", | |
" \"\"\"Chooses a random element.\"\"\"\n", | |
" return random.choice(list(counter.elements()))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'fox,'" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"choice(d['said', 'the'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Given a prefix, we can choose a random suffix:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'fisherman,'" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"prefix = 'said', 'the'\n", | |
"suffix = choice(d[prefix])\n", | |
"suffix" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Then we can shift the words and compute the next prefix:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('the', 'fisherman,')" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"prefix = prefix[1], suffix\n", | |
"prefix" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Repeating this process, we can generate random new text that has the same correlation structure between words as the original:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"'how happily we shall be your waiting-maid any longer.' So they went up to the forest. Ah! what a blockhead that brother of the sick; the virtues of all one daughter. Although the little birds are singing; you walk gravely along as if they have a lad who takes care of this agreement violates the law of the mill went 'Click clack, click clack, click clack, click clack.' The bird settled on the ground, he thought to find the way homewards free from the roof with his hand into his ear and tell all she had scarcely touched her sister, " | |
] | |
} | |
], | |
"source": [ | |
"for i in range(100):\n", | |
" suffix = choice(d[prefix])\n", | |
" print(suffix, end=' ')\n", | |
" prefix = prefix[1], suffix" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"With a prefix of two words, we typically get text that flirts with sensibility." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment