paxswill · December 14, 2015 08:49
diff --git a/Virgina-City-LetterFreq.json b/Virgina-City-LetterFreq.json
 {
 "metadata": {
  "name": "Virginia Cities"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": ""
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": ""
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": ""
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "So I wanted to see which letters appear most frequently in the city names in Virgnia. Manually counting it all out, or even typing out the names and then using a computer was too much work, so I turned to Python and the BeautifulSoup and Requests packages."
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "I started with some basic setup. `Counter` is a counted set. `pprint` provides a nicer way to print lists and other containers by adding line breaks and such. [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a really easy to use HTML parser+navigator. [Requests](http://docs.python-requests.org/en/latest/) makes it easy to download things from the Internet."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from collections import Counter\nimport string\nfrom pprint import PrettyPrinter\n\nfrom bs4 import BeautifulSoup\nimport requests\n\n# A prettier way to print\npprint = PrettyPrinter().pprint",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 71
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Wikipedia provides a nice [list](http://en.wikipedia.org/wiki/Cities_in_virginia) of the cities in Virginia. I find all the tables (`<table>` elements) on the page, and the second one is the one I want. I then get a list of the text of all links (`<a>` elements) in that table."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "city_response = requests.get('http://en.wikipedia.org/wiki/Cities_in_virginia')\ncity_soup = BeautifulSoup(city_response.content)\n\ntables = city_soup.find_all('table')\n# Python starts counting at 0 like many other programming languages\ntable = tables[1]\n# There are some blank links, so drop those\ncities = [link.string for link in table.find_all('a') if link.string]",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 72
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now run through the list of cities, and add each one to a `Counter`. Then I print out how many cities there are and a list of the letters used, sorted by how often they were used."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "city_count = Counter()\nfor city in cities:\n    # Add each letter of the lowercase city name to the Counter\n    city_count.update(city.lower())\n\nprint(\"{} cities\".format(len(cities)))\npprint(city_count.most_common())",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "39 cities\n[('a', 35),\n ('r', 31),\n ('o', 30),\n ('n', 29),\n ('e', 27),\n ('s', 25),\n ('l', 24),\n ('i', 23),\n ('t', 17),\n ('h', 14),\n ('u', 11),\n ('c', 11),\n ('b', 10),\n ('f', 10),\n ('g', 10),\n ('p', 9),\n ('m', 9),\n ('d', 8),\n ('k', 7),\n ('v', 6),\n ('w', 6),\n (' ', 6),\n ('x', 4),\n ('y', 2),\n ('q', 1)]\n"
      }
     ],
     "prompt_number": 73
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "I guess space (' ') counts as a letter, for those cities that are two words. Now I want to see which letters aren't used . To do this I make a set of the letters that were found and subtract it from a set of all lowercase characters.\n\nSidenote: Confused from earlier where I said that `Counter` was a counted set? Well, it is but it isn't a *set* so the set operators won't work on it. "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "not_in_cities = set(string.ascii_lowercase) - set(city_count)\npprint(not_in_cities)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "{'z', 'j'}\n"
      }
     ],
     "prompt_number": 74
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Bet you can't name a city that contains the letter 'J'!\n\nAnd because they're pretty big, let's look at incorporated towns. Again, Wikipedia has a [list](http://en.wikipedia.org/wiki/List_of_towns_in_Virginia) of them that I'll use. This list isn't in a table, so it's a bit harder to pull out."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "town_response = requests.get('http://en.wikipedia.org/wiki/List_of_towns_in_Virginia')\ntown_soup = BeautifulSoup(town_response.content)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 75
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "First I grab the main content area. Wikipedia displays the actual town names as elements (`<li> elements) within unordered lists ('<ul>` elements), one for each letter in the alphabet. Because there are some other unordered lists in the content, I find all unordered lists that are only one level down (this is starting to get into the specifics of HTML, but it'll be over soon). Then I go through each list, and save the link text for the first link I find in each list item. After than, the process is the same as the cities."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "town_content = town_soup.find('div', id='mw-content-text')\nlists = [element for element in town_content.find_all('ul') if element.parent == town_content]\ntowns = []\nfor ul in lists:\n    for element in ul.find_all('li'):\n        town_link = element.a\n        # The references list has rel=\"nofollow\", and we don't want the references\n        if 'rel' not in town_link.attrs:\n            towns.append(town_link.string)\n\nfor town in towns:\n    town_count.update(town.lower())\n\nprint(\"{} towns\".format(len(towns)))\npprint(town_count.most_common())",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "190 towns\n[('e', 310),\n ('a', 278),\n ('l', 272),\n ('n', 258),\n ('o', 248),\n ('r', 232),\n ('i', 216),\n ('t', 200),\n ('s', 174),\n ('c', 154),\n ('h', 102),\n ('d', 100),\n ('u', 94),\n ('b', 88),\n (' ', 82),\n ('g', 82),\n ('m', 70),\n ('p', 66),\n ('w', 66),\n ('v', 64),\n ('y', 64),\n ('k', 60),\n ('f', 38),\n ('x', 16),\n ('j', 6),\n ('.', 4),\n ('q', 4),\n ('z', 2)]\n"
      }
     ],
     "prompt_number": 76
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "not_in_towns = set(string.ascii_lowercase) - set(town_count)\npprint(not_in_towns)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "set()\n"
      }
     ],
     "prompt_number": 77
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "`set()` means that every letter is used (the '.'s are from St. Charles and St. Paul). Now I want to see what the counts are for the cities and towns combined. I create a new `Counter` from the cities` and then add the towns'."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "city_town_count = Counter(city_count)\ncity_town_count.update(town_count)\npprint(city_town_count.most_common())",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "[('e', 337),\n ('a', 313),\n ('l', 296),\n ('n', 287),\n ('o', 278),\n ('r', 263),\n ('i', 239),\n ('t', 217),\n ('s', 199),\n ('c', 165),\n ('h', 116),\n ('d', 108),\n ('u', 105),\n ('b', 98),\n ('g', 92),\n (' ', 88),\n ('m', 79),\n ('p', 75),\n ('w', 72),\n ('v', 70),\n ('k', 67),\n ('y', 66),\n ('f', 48),\n ('x', 20),\n ('j', 6),\n ('q', 5),\n ('.', 4),\n ('z', 2)]\n"
      }
     ],
     "prompt_number": 78
    }
   ],
   "metadata": {}
  }
 ]
 }
	{
	"metadata": {
	"name": "Virginia Cities"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": ""
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": ""
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": ""
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "So I wanted to see which letters appear most frequently in the city names in Virgnia. Manually counting it all out, or even typing out the names and then using a computer was too much work, so I turned to Python and the BeautifulSoup and Requests packages."
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "I started with some basic setup. `Counter` is a counted set. `pprint` provides a nicer way to print lists and other containers by adding line breaks and such. [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is a really easy to use HTML parser+navigator. [Requests](http://docs.python-requests.org/en/latest/) makes it easy to download things from the Internet."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from collections import Counter\nimport string\nfrom pprint import PrettyPrinter\n\nfrom bs4 import BeautifulSoup\nimport requests\n\n# A prettier way to print\npprint = PrettyPrinter().pprint",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 71
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Wikipedia provides a nice [list](http://en.wikipedia.org/wiki/Cities_in_virginia) of the cities in Virginia. I find all the tables (`<table>` elements) on the page, and the second one is the one I want. I then get a list of the text of all links (`<a>` elements) in that table."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "city_response = requests.get('http://en.wikipedia.org/wiki/Cities_in_virginia')\ncity_soup = BeautifulSoup(city_response.content)\n\ntables = city_soup.find_all('table')\n# Python starts counting at 0 like many other programming languages\ntable = tables[1]\n# There are some blank links, so drop those\ncities = [link.string for link in table.find_all('a') if link.string]",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 72
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Now run through the list of cities, and add each one to a `Counter`. Then I print out how many cities there are and a list of the letters used, sorted by how often they were used."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "city_count = Counter()\nfor city in cities:\n # Add each letter of the lowercase city name to the Counter\n city_count.update(city.lower())\n\nprint(\"{} cities\".format(len(cities)))\npprint(city_count.most_common())",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "39 cities\n[('a', 35),\n ('r', 31),\n ('o', 30),\n ('n', 29),\n ('e', 27),\n ('s', 25),\n ('l', 24),\n ('i', 23),\n ('t', 17),\n ('h', 14),\n ('u', 11),\n ('c', 11),\n ('b', 10),\n ('f', 10),\n ('g', 10),\n ('p', 9),\n ('m', 9),\n ('d', 8),\n ('k', 7),\n ('v', 6),\n ('w', 6),\n (' ', 6),\n ('x', 4),\n ('y', 2),\n ('q', 1)]\n"
	}
	],
	"prompt_number": 73
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "I guess space (' ') counts as a letter, for those cities that are two words. Now I want to see which letters aren't used . To do this I make a set of the letters that were found and subtract it from a set of all lowercase characters.\n\nSidenote: Confused from earlier where I said that `Counter` was a counted set? Well, it is but it isn't a set so the set operators won't work on it. "
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "not_in_cities = set(string.ascii_lowercase) - set(city_count)\npprint(not_in_cities)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "{'z', 'j'}\n"
	}
	],
	"prompt_number": 74
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Bet you can't name a city that contains the letter 'J'!\n\nAnd because they're pretty big, let's look at incorporated towns. Again, Wikipedia has a [list](http://en.wikipedia.org/wiki/List_of_towns_in_Virginia) of them that I'll use. This list isn't in a table, so it's a bit harder to pull out."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "town_response = requests.get('http://en.wikipedia.org/wiki/List_of_towns_in_Virginia')\ntown_soup = BeautifulSoup(town_response.content)",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 75
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "First I grab the main content area. Wikipedia displays the actual town names as elements (`<li> elements) within unordered lists ('<ul>` elements), one for each letter in the alphabet. Because there are some other unordered lists in the content, I find all unordered lists that are only one level down (this is starting to get into the specifics of HTML, but it'll be over soon). Then I go through each list, and save the link text for the first link I find in each list item. After than, the process is the same as the cities."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "town_content = town_soup.find('div', id='mw-content-text')\nlists = [element for element in town_content.find_all('ul') if element.parent == town_content]\ntowns = []\nfor ul in lists:\n for element in ul.find_all('li'):\n town_link = element.a\n # The references list has rel=\"nofollow\", and we don't want the references\n if 'rel' not in town_link.attrs:\n towns.append(town_link.string)\n\nfor town in towns:\n town_count.update(town.lower())\n\nprint(\"{} towns\".format(len(towns)))\npprint(town_count.most_common())",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "190 towns\n[('e', 310),\n ('a', 278),\n ('l', 272),\n ('n', 258),\n ('o', 248),\n ('r', 232),\n ('i', 216),\n ('t', 200),\n ('s', 174),\n ('c', 154),\n ('h', 102),\n ('d', 100),\n ('u', 94),\n ('b', 88),\n (' ', 82),\n ('g', 82),\n ('m', 70),\n ('p', 66),\n ('w', 66),\n ('v', 64),\n ('y', 64),\n ('k', 60),\n ('f', 38),\n ('x', 16),\n ('j', 6),\n ('.', 4),\n ('q', 4),\n ('z', 2)]\n"
	}
	],
	"prompt_number": 76
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "not_in_towns = set(string.ascii_lowercase) - set(town_count)\npprint(not_in_towns)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "set()\n"
	}
	],
	"prompt_number": 77
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "`set()` means that every letter is used (the '.'s are from St. Charles and St. Paul). Now I want to see what the counts are for the cities and towns combined. I create a new `Counter` from the cities` and then add the towns'."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "city_town_count = Counter(city_count)\ncity_town_count.update(town_count)\npprint(city_town_count.most_common())",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "[('e', 337),\n ('a', 313),\n ('l', 296),\n ('n', 287),\n ('o', 278),\n ('r', 263),\n ('i', 239),\n ('t', 217),\n ('s', 199),\n ('c', 165),\n ('h', 116),\n ('d', 108),\n ('u', 105),\n ('b', 98),\n ('g', 92),\n (' ', 88),\n ('m', 79),\n ('p', 75),\n ('w', 72),\n ('v', 70),\n ('k', 67),\n ('y', 66),\n ('f', 48),\n ('x', 20),\n ('j', 6),\n ('q', 5),\n ('.', 4),\n ('z', 2)]\n"
	}
	],
	"prompt_number": 78
	}
	],
	"metadata": {}
	}
	]
	}