Skip to content

Instantly share code, notes, and snippets.

@phdkiran
Last active November 3, 2016 20:25
Show Gist options
  • Save phdkiran/4c75c24d8100af96cee2d468a4c31e1a to your computer and use it in GitHub Desktop.
Save phdkiran/4c75c24d8100af96cee2d468a4c31e1a to your computer and use it in GitHub Desktop.
MLtext3/submissions/03_basic_regex_homework.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Basic Regex Homework"
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "# for Python 2: use print only as a function\nfrom __future__ import print_function",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Homework 1: FAA tower closures\n\nA list of FAA tower closures has been copied from a [PDF](http://www.faa.gov/news/media/fct_closed.pdf) into the file **`faa.txt`**, which is stored in the **`data`** directory of the course repository."
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "# read the file into a single string\nwith open('../data/faa.txt') as f:\n data = f.read()",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "# check the number of characters\nlen(data)",
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "5574"
},
"metadata": {},
"execution_count": 3
}
]
},
{
"metadata": {
"trusted": false,
"collapsed": false
},
"cell_type": "code",
"source": "# examine the first 500 characters\nprint(data[0:500])",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "FAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nDHN DOTHAN RGNL DOTHAN AL\nTCL TUSCALOOSA RGNL TUSCALOOSA AL\nFYV DRAKE FIELD FAYETTEVILLE AR\nTXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR\nGEU GLENDALE MUNI GLENDALE AZ\nGYR PHOENIX GOODYEAR GOODYEAR AZ\nIFP LAUGHLIN/BULLHEAD INTL BULLHEAD CITY AZ\nRYN RYAN FIELD TUCSON AZ\nFUL FULLERTON MUNI FULLERTON CA\nMER CASTLE ATWATER CA\nOXR OXNARD OXNARD CA\nRAL RIVERSIDE MUNI RIVERSIDE CA\nRNM RAMONA RAMONA CA\nSAC SACRAMENTO EXECU\n"
}
]
},
{
"metadata": {
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# examine the last 500 characters\nprint(data[-500:])",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": " YAKIMA WA\nCWA CENTRAL WISCONSIN MOSINEE WI\nEAU CHIPPEWA VALLEY RGNL EAU CLAIRE WI\nENW KENOSHA RGNL KENOSHA WI\nPage 3 of 4\nFAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nJVL SOUTHERN WISCONSIN RGNL JANESVILLE WI\nLSE LA CROSSE MUNI LA CROSSE WI\nMWC LAWRENCE J TIMMERMAN MILWAUKEE WI\nOSH WITTMAN RGNL OSHKOSH WI\nUES WAUKESHA COUNTY WAUKESHA WI\nHLG WHEELING OHIO CO WHEELING WV\nLWB GREENBRIER VALLEY LEWISBURG WV\nPKB MID-OHIO VALLEY RGNL PARKERSBURG WV\nPage 4 of 4\n\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Your assignment is to **create a list of tuples** containing the **tower IDs** and the **states** they are located in.\n\nHere is the **expected output:**\n\n> `faa = [('DHN', 'AL'), ('TCL', 'AL'), ..., ('PKB', 'WV')]`"
},
{
"metadata": {
"trusted": false,
"collapsed": false
},
"cell_type": "code",
"source": "import re\n# (?P<FAAcode>\\w{3}) .* (?P<state>\\w{2})$\n# faa = re.findall(r'(?P<FAAcode>\\w+) .* (?P<state>\\w{2})$', data, flags=re.M)\nfaa = re.findall(r'(\\w+) .* (\\w\\w)$', data, flags=re.M)\nfaa[:5]",
"execution_count": 24,
"outputs": [
{
"execution_count": 24,
"metadata": {},
"output_type": "execute_result",
"data": {
"text/plain": "[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ')]"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As a **bonus task**, use regular expressions to extract the **number of closures** listed in the second line of the file (149), and then use an **assertion** to check that the number of closures is equal to the length of the `faa` list."
},
{
"metadata": {
"trusted": false,
"collapsed": false
},
"cell_type": "code",
"source": "num = re.search(r'^\\((?P<num_match>\\d+) FCTs\\)$', data, flags=re.M)\nnum_closures = num.group('num_match')\nprint(num_closures)\nassert int(num_closures)== len(faa), 'list length should match the regex output'",
"execution_count": 40,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "149\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Homework 2: Stack Overflow reputation\n\nI have downloaded my **Stack Overflow reputation history** into the file **`reputation.txt`**, which is stored in the **`data`** directory of the course repository. (If you are a Stack Overflow user with a reputation of 10 or more, you should be able to [download your own reputation history](http://stackoverflow.com/reputation).)\n\nWe are only interested in the lines that **begin with two dashes**, such as:\n\n> `-- 2012-08-30 rep +5 = 6`\n\nThat line can be interpreted as follows: \"On 2012-08-30, my reputation increased by 5, bringing my reputation total to 6.\""
},
{
"metadata": {
"trusted": false,
"collapsed": false
},
"cell_type": "code",
"source": "with open('../data/reputation.txt', mode='r') as fd:\n reps = fd.read()\n# print(reps)\nreps[:100]",
"execution_count": 56,
"outputs": [
{
"execution_count": 56,
"metadata": {},
"output_type": "execute_result",
"data": {
"text/plain": "'total votes: 36\\n 2 12201376 (5)\\n-- 2012-08-30 rep +5 = 6 \\n 2 13822612 (10)\\n-- 2012-12-1'"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Your assignment is to **create a list of tuples** containing only these dated entries, including the **date**, **reputation change** (regardless of whether it is positive/negative/zero), and **running total**.\n\nHere is the **expected output:**\n\n> `rep = [('2012-08-30', '+5', '6'), ('2012-12-11', '+10', '16'), ..., ('2015-10-14', '-1', '317')]`"
},
{
"metadata": {
"trusted": false,
"collapsed": false
},
"cell_type": "code",
"source": "reps_all = re.findall(pattern=r'^-- (\\d{4}-\\d{2}-\\d{2}) rep ([+-]\\d+) \\s+ = (\\d+)', string=reps, flags=re.M)\nreps_all",
"execution_count": 61,
"outputs": [
{
"execution_count": 61,
"metadata": {},
"output_type": "execute_result",
"data": {
"text/plain": "[('2012-08-30', '+5', '6'),\n ('2012-12-11', '+10', '16'),\n ('2013-03-20', '+10', '26'),\n ('2014-03-19', '+2', '28'),\n ('2014-05-11', '+2', '30'),\n ('2014-05-12', '+12', '42'),\n ('2014-06-12', '+10', '52'),\n ('2014-06-26', '+10', '62'),\n ('2014-09-03', '+10', '72'),\n ('2014-11-14', '+10', '82'),\n ('2014-11-18', '+2', '84'),\n ('2014-12-08', '+2', '86'),\n ('2014-12-09', '+10', '96'),\n ('2014-12-12', '+2', '98'),\n ('2014-12-24', '+10', '108'),\n ('2015-02-20', '+10', '118'),\n ('2015-03-28', '+10', '128'),\n ('2015-04-26', '+10', '138'),\n ('2015-05-05', '+10', '148'),\n ('2015-05-26', '+10', '158'),\n ('2015-05-27', '+20', '178'),\n ('2015-07-03', '+10', '188'),\n ('2015-08-21', '+10', '308'),\n ('2015-09-07', '+10', '318'),\n ('2015-10-14', '-1', '317')]"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As a **bonus task**, convert this list of tuples into a **pandas DataFrame**. It should have appropriate column names, and the second and third columns should be of type integer (rather than string/object)."
},
{
"metadata": {
"trusted": false,
"collapsed": false
},
"cell_type": "code",
"source": "import pandas as pd\nfrom datetime import datetime\n\ndf = pd.DataFrame(data=reps_all, columns=('date', 'rep change', 'running total'))\n\ndef apply_dtypes(df, list_):\n for i, col in enumerate(df.columns):\n df[col] = df[col].astype(type(list_[i]))\n return df\n\ndf_ = apply_dtypes(df, [datetime.now(), 1, 1])\ndf_.dtypes",
"execution_count": 129,
"outputs": [
{
"execution_count": 129,
"metadata": {},
"output_type": "execute_result",
"data": {
"text/plain": "date object\nrep change int32\nrunning total int32\ndtype: object"
}
}
]
}
],
"metadata": {
"language_info": {
"file_extension": ".py",
"codemirror_mode": {
"version": 3,
"name": "ipython"
},
"mimetype": "text/x-python",
"version": "3.5.1",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython3"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"gist": {
"id": "d961766f0c7ac2b4c084febd470acd29",
"data": {
"description": "MLtext3/submissions/03_basic_regex_homework.ipynb",
"public": true
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/d961766f0c7ac2b4c084febd470acd29"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment