Last active
December 14, 2015 08:49
-
-
Save paxswill/5060512 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "Virginia Cities" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "So I wanted to see which letters appear most frequently in the city names in Virgnia. Manually counting it all out, or even typing out the names and then using a computer was too much work, so I turned to Python and the BeautifulSoup and Requests packages." | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "I started with some basic setup. `Counter` is a counted set. `pprint` provides a nicer way to print lists and other containers by adding line breaks and such. [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a really easy to use HTML parser+navigator. [Requests](http://docs.python-requests.org/en/latest/) makes it easy to download things from the Internet." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from collections import Counter\nimport string\nfrom pprint import PrettyPrinter\n\nfrom bs4 import BeautifulSoup\nimport requests\n\n# A prettier way to print\npprint = PrettyPrinter().pprint", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 71 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Wikipedia provides a nice [list](http://en.wikipedia.org/wiki/Cities_in_virginia) of the cities in Virginia. I find all the tables (`<table>` elements) on the page, and the second one is the one I want. I then get a list of the text of all links (`<a>` elements) in that table." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "city_response = requests.get('http://en.wikipedia.org/wiki/Cities_in_virginia')\ncity_soup = BeautifulSoup(city_response.content)\n\ntables = city_soup.find_all('table')\n# Python starts counting at 0 like many other programming languages\ntable = tables[1]\n# There are some blank links, so drop those\ncities = [link.string for link in table.find_all('a') if link.string]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 72 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Now run through the list of cities, and add each one to a `Counter`. Then I print out how many cities there are and a list of the letters used, sorted by how often they were used." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "city_count = Counter()\nfor city in cities:\n # Add each letter of the lowercase city name to the Counter\n city_count.update(city.lower())\n\nprint(\"{} cities\".format(len(cities)))\npprint(city_count.most_common())", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "39 cities\n[('a', 35),\n ('r', 31),\n ('o', 30),\n ('n', 29),\n ('e', 27),\n ('s', 25),\n ('l', 24),\n ('i', 23),\n ('t', 17),\n ('h', 14),\n ('u', 11),\n ('c', 11),\n ('b', 10),\n ('f', 10),\n ('g', 10),\n ('p', 9),\n ('m', 9),\n ('d', 8),\n ('k', 7),\n ('v', 6),\n ('w', 6),\n (' ', 6),\n ('x', 4),\n ('y', 2),\n ('q', 1)]\n" | |
} | |
], | |
"prompt_number": 73 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "I guess space (' ') counts as a letter, for those cities that are two words. Now I want to see which letters aren't used . To do this I make a set of the letters that were found and subtract it from a set of all lowercase characters.\n\nSidenote: Confused from earlier where I said that `Counter` was a counted set? Well, it is but it isn't a *set* so the set operators won't work on it. " | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "not_in_cities = set(string.ascii_lowercase) - set(city_count)\npprint(not_in_cities)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "{'z', 'j'}\n" | |
} | |
], | |
"prompt_number": 74 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Bet you can't name a city that contains the letter 'J'!\n\nAnd because they're pretty big, let's look at incorporated towns. Again, Wikipedia has a [list](http://en.wikipedia.org/wiki/List_of_towns_in_Virginia) of them that I'll use. This list isn't in a table, so it's a bit harder to pull out." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "town_response = requests.get('http://en.wikipedia.org/wiki/List_of_towns_in_Virginia')\ntown_soup = BeautifulSoup(town_response.content)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 75 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "First I grab the main content area. Wikipedia displays the actual town names as elements (`<li> elements) within unordered lists ('<ul>` elements), one for each letter in the alphabet. Because there are some other unordered lists in the content, I find all unordered lists that are only one level down (this is starting to get into the specifics of HTML, but it'll be over soon). Then I go through each list, and save the link text for the first link I find in each list item. After than, the process is the same as the cities." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "town_content = town_soup.find('div', id='mw-content-text')\nlists = [element for element in town_content.find_all('ul') if element.parent == town_content]\ntowns = []\nfor ul in lists:\n for element in ul.find_all('li'):\n town_link = element.a\n # The references list has rel=\"nofollow\", and we don't want the references\n if 'rel' not in town_link.attrs:\n towns.append(town_link.string)\n\nfor town in towns:\n town_count.update(town.lower())\n\nprint(\"{} towns\".format(len(towns)))\npprint(town_count.most_common())", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "190 towns\n[('e', 310),\n ('a', 278),\n ('l', 272),\n ('n', 258),\n ('o', 248),\n ('r', 232),\n ('i', 216),\n ('t', 200),\n ('s', 174),\n ('c', 154),\n ('h', 102),\n ('d', 100),\n ('u', 94),\n ('b', 88),\n (' ', 82),\n ('g', 82),\n ('m', 70),\n ('p', 66),\n ('w', 66),\n ('v', 64),\n ('y', 64),\n ('k', 60),\n ('f', 38),\n ('x', 16),\n ('j', 6),\n ('.', 4),\n ('q', 4),\n ('z', 2)]\n" | |
} | |
], | |
"prompt_number": 76 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "not_in_towns = set(string.ascii_lowercase) - set(town_count)\npprint(not_in_towns)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "set()\n" | |
} | |
], | |
"prompt_number": 77 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "`set()` means that every letter is used (the '.'s are from St. Charles and St. Paul). Now I want to see what the counts are for the cities and towns combined. I create a new `Counter` from the cities` and then add the towns'." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "city_town_count = Counter(city_count)\ncity_town_count.update(town_count)\npprint(city_town_count.most_common())", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "[('e', 337),\n ('a', 313),\n ('l', 296),\n ('n', 287),\n ('o', 278),\n ('r', 263),\n ('i', 239),\n ('t', 217),\n ('s', 199),\n ('c', 165),\n ('h', 116),\n ('d', 108),\n ('u', 105),\n ('b', 98),\n ('g', 92),\n (' ', 88),\n ('m', 79),\n ('p', 75),\n ('w', 72),\n ('v', 70),\n ('k', 67),\n ('y', 66),\n ('f', 48),\n ('x', 20),\n ('j', 6),\n ('q', 5),\n ('.', 4),\n ('z', 2)]\n" | |
} | |
], | |
"prompt_number": 78 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment