Skip to content

Instantly share code, notes, and snippets.

@paxswill
Last active December 14, 2015 08:49
Show Gist options
  • Save paxswill/5060512 to your computer and use it in GitHub Desktop.
Save paxswill/5060512 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "Virginia Cities"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "So I wanted to see which letters appear most frequently in the city names in Virgnia. Manually counting it all out, or even typing out the names and then using a computer was too much work, so I turned to Python and the BeautifulSoup and Requests packages."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "I started with some basic setup. `Counter` is a counted set. `pprint` provides a nicer way to print lists and other containers by adding line breaks and such. [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a really easy to use HTML parser+navigator. [Requests](http://docs.python-requests.org/en/latest/) makes it easy to download things from the Internet."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from collections import Counter\nimport string\nfrom pprint import PrettyPrinter\n\nfrom bs4 import BeautifulSoup\nimport requests\n\n# A prettier way to print\npprint = PrettyPrinter().pprint",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 71
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Wikipedia provides a nice [list](http://en.wikipedia.org/wiki/Cities_in_virginia) of the cities in Virginia. I find all the tables (`<table>` elements) on the page, and the second one is the one I want. I then get a list of the text of all links (`<a>` elements) in that table."
},
{
"cell_type": "code",
"collapsed": false,
"input": "city_response = requests.get('http://en.wikipedia.org/wiki/Cities_in_virginia')\ncity_soup = BeautifulSoup(city_response.content)\n\ntables = city_soup.find_all('table')\n# Python starts counting at 0 like many other programming languages\ntable = tables[1]\n# There are some blank links, so drop those\ncities = [link.string for link in table.find_all('a') if link.string]",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 72
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now run through the list of cities, and add each one to a `Counter`. Then I print out how many cities there are and a list of the letters used, sorted by how often they were used."
},
{
"cell_type": "code",
"collapsed": false,
"input": "city_count = Counter()\nfor city in cities:\n # Add each letter of the lowercase city name to the Counter\n city_count.update(city.lower())\n\nprint(\"{} cities\".format(len(cities)))\npprint(city_count.most_common())",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "39 cities\n[('a', 35),\n ('r', 31),\n ('o', 30),\n ('n', 29),\n ('e', 27),\n ('s', 25),\n ('l', 24),\n ('i', 23),\n ('t', 17),\n ('h', 14),\n ('u', 11),\n ('c', 11),\n ('b', 10),\n ('f', 10),\n ('g', 10),\n ('p', 9),\n ('m', 9),\n ('d', 8),\n ('k', 7),\n ('v', 6),\n ('w', 6),\n (' ', 6),\n ('x', 4),\n ('y', 2),\n ('q', 1)]\n"
}
],
"prompt_number": 73
},
{
"cell_type": "markdown",
"metadata": {},
"source": "I guess space (' ') counts as a letter, for those cities that are two words. Now I want to see which letters aren't used . To do this I make a set of the letters that were found and subtract it from a set of all lowercase characters.\n\nSidenote: Confused from earlier where I said that `Counter` was a counted set? Well, it is but it isn't a *set* so the set operators won't work on it. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "not_in_cities = set(string.ascii_lowercase) - set(city_count)\npprint(not_in_cities)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{'z', 'j'}\n"
}
],
"prompt_number": 74
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Bet you can't name a city that contains the letter 'J'!\n\nAnd because they're pretty big, let's look at incorporated towns. Again, Wikipedia has a [list](http://en.wikipedia.org/wiki/List_of_towns_in_Virginia) of them that I'll use. This list isn't in a table, so it's a bit harder to pull out."
},
{
"cell_type": "code",
"collapsed": false,
"input": "town_response = requests.get('http://en.wikipedia.org/wiki/List_of_towns_in_Virginia')\ntown_soup = BeautifulSoup(town_response.content)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 75
},
{
"cell_type": "markdown",
"metadata": {},
"source": "First I grab the main content area. Wikipedia displays the actual town names as elements (`<li> elements) within unordered lists ('<ul>` elements), one for each letter in the alphabet. Because there are some other unordered lists in the content, I find all unordered lists that are only one level down (this is starting to get into the specifics of HTML, but it'll be over soon). Then I go through each list, and save the link text for the first link I find in each list item. After than, the process is the same as the cities."
},
{
"cell_type": "code",
"collapsed": false,
"input": "town_content = town_soup.find('div', id='mw-content-text')\nlists = [element for element in town_content.find_all('ul') if element.parent == town_content]\ntowns = []\nfor ul in lists:\n for element in ul.find_all('li'):\n town_link = element.a\n # The references list has rel=\"nofollow\", and we don't want the references\n if 'rel' not in town_link.attrs:\n towns.append(town_link.string)\n\nfor town in towns:\n town_count.update(town.lower())\n\nprint(\"{} towns\".format(len(towns)))\npprint(town_count.most_common())",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "190 towns\n[('e', 310),\n ('a', 278),\n ('l', 272),\n ('n', 258),\n ('o', 248),\n ('r', 232),\n ('i', 216),\n ('t', 200),\n ('s', 174),\n ('c', 154),\n ('h', 102),\n ('d', 100),\n ('u', 94),\n ('b', 88),\n (' ', 82),\n ('g', 82),\n ('m', 70),\n ('p', 66),\n ('w', 66),\n ('v', 64),\n ('y', 64),\n ('k', 60),\n ('f', 38),\n ('x', 16),\n ('j', 6),\n ('.', 4),\n ('q', 4),\n ('z', 2)]\n"
}
],
"prompt_number": 76
},
{
"cell_type": "code",
"collapsed": false,
"input": "not_in_towns = set(string.ascii_lowercase) - set(town_count)\npprint(not_in_towns)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "set()\n"
}
],
"prompt_number": 77
},
{
"cell_type": "markdown",
"metadata": {},
"source": "`set()` means that every letter is used (the '.'s are from St. Charles and St. Paul). Now I want to see what the counts are for the cities and towns combined. I create a new `Counter` from the cities` and then add the towns'."
},
{
"cell_type": "code",
"collapsed": false,
"input": "city_town_count = Counter(city_count)\ncity_town_count.update(town_count)\npprint(city_town_count.most_common())",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[('e', 337),\n ('a', 313),\n ('l', 296),\n ('n', 287),\n ('o', 278),\n ('r', 263),\n ('i', 239),\n ('t', 217),\n ('s', 199),\n ('c', 165),\n ('h', 116),\n ('d', 108),\n ('u', 105),\n ('b', 98),\n ('g', 92),\n (' ', 88),\n ('m', 79),\n ('p', 75),\n ('w', 72),\n ('v', 70),\n ('k', 67),\n ('y', 66),\n ('f', 48),\n ('x', 20),\n ('j', 6),\n ('q', 5),\n ('.', 4),\n ('z', 2)]\n"
}
],
"prompt_number": 78
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment