Last active
November 3, 2016 20:25
-
-
Save phdkiran/4c75c24d8100af96cee2d468a4c31e1a to your computer and use it in GitHub Desktop.
MLtext3/submissions/03_basic_regex_homework.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# Basic Regex Homework" | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "# for Python 2: use print only as a function\nfrom __future__ import print_function", | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Homework 1: FAA tower closures\n\nA list of FAA tower closures has been copied from a [PDF](http://www.faa.gov/news/media/fct_closed.pdf) into the file **`faa.txt`**, which is stored in the **`data`** directory of the course repository." | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "# read the file into a single string\nwith open('../data/faa.txt') as f:\n data = f.read()", | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "# check the number of characters\nlen(data)", | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "5574" | |
}, | |
"metadata": {}, | |
"execution_count": 3 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "# examine the first 500 characters\nprint(data[0:500])", | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "FAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nDHN DOTHAN RGNL DOTHAN AL\nTCL TUSCALOOSA RGNL TUSCALOOSA AL\nFYV DRAKE FIELD FAYETTEVILLE AR\nTXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR\nGEU GLENDALE MUNI GLENDALE AZ\nGYR PHOENIX GOODYEAR GOODYEAR AZ\nIFP LAUGHLIN/BULLHEAD INTL BULLHEAD CITY AZ\nRYN RYAN FIELD TUCSON AZ\nFUL FULLERTON MUNI FULLERTON CA\nMER CASTLE ATWATER CA\nOXR OXNARD OXNARD CA\nRAL RIVERSIDE MUNI RIVERSIDE CA\nRNM RAMONA RAMONA CA\nSAC SACRAMENTO EXECU\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "# examine the last 500 characters\nprint(data[-500:])", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": " YAKIMA WA\nCWA CENTRAL WISCONSIN MOSINEE WI\nEAU CHIPPEWA VALLEY RGNL EAU CLAIRE WI\nENW KENOSHA RGNL KENOSHA WI\nPage 3 of 4\nFAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nJVL SOUTHERN WISCONSIN RGNL JANESVILLE WI\nLSE LA CROSSE MUNI LA CROSSE WI\nMWC LAWRENCE J TIMMERMAN MILWAUKEE WI\nOSH WITTMAN RGNL OSHKOSH WI\nUES WAUKESHA COUNTY WAUKESHA WI\nHLG WHEELING OHIO CO WHEELING WV\nLWB GREENBRIER VALLEY LEWISBURG WV\nPKB MID-OHIO VALLEY RGNL PARKERSBURG WV\nPage 4 of 4\n\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Your assignment is to **create a list of tuples** containing the **tower IDs** and the **states** they are located in.\n\nHere is the **expected output:**\n\n> `faa = [('DHN', 'AL'), ('TCL', 'AL'), ..., ('PKB', 'WV')]`" | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "import re\n# (?P<FAAcode>\\w{3}) .* (?P<state>\\w{2})$\n# faa = re.findall(r'(?P<FAAcode>\\w+) .* (?P<state>\\w{2})$', data, flags=re.M)\nfaa = re.findall(r'(\\w+) .* (\\w\\w)$', data, flags=re.M)\nfaa[:5]", | |
"execution_count": 24, | |
"outputs": [ | |
{ | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ')]" | |
} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "As a **bonus task**, use regular expressions to extract the **number of closures** listed in the second line of the file (149), and then use an **assertion** to check that the number of closures is equal to the length of the `faa` list." | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "num = re.search(r'^\\((?P<num_match>\\d+) FCTs\\)$', data, flags=re.M)\nnum_closures = num.group('num_match')\nprint(num_closures)\nassert int(num_closures)== len(faa), 'list length should match the regex output'", | |
"execution_count": 40, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "149\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Homework 2: Stack Overflow reputation\n\nI have downloaded my **Stack Overflow reputation history** into the file **`reputation.txt`**, which is stored in the **`data`** directory of the course repository. (If you are a Stack Overflow user with a reputation of 10 or more, you should be able to [download your own reputation history](http://stackoverflow.com/reputation).)\n\nWe are only interested in the lines that **begin with two dashes**, such as:\n\n> `-- 2012-08-30 rep +5 = 6`\n\nThat line can be interpreted as follows: \"On 2012-08-30, my reputation increased by 5, bringing my reputation total to 6.\"" | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "with open('../data/reputation.txt', mode='r') as fd:\n reps = fd.read()\n# print(reps)\nreps[:100]", | |
"execution_count": 56, | |
"outputs": [ | |
{ | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "'total votes: 36\\n 2 12201376 (5)\\n-- 2012-08-30 rep +5 = 6 \\n 2 13822612 (10)\\n-- 2012-12-1'" | |
} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Your assignment is to **create a list of tuples** containing only these dated entries, including the **date**, **reputation change** (regardless of whether it is positive/negative/zero), and **running total**.\n\nHere is the **expected output:**\n\n> `rep = [('2012-08-30', '+5', '6'), ('2012-12-11', '+10', '16'), ..., ('2015-10-14', '-1', '317')]`" | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "reps_all = re.findall(pattern=r'^-- (\\d{4}-\\d{2}-\\d{2}) rep ([+-]\\d+) \\s+ = (\\d+)', string=reps, flags=re.M)\nreps_all", | |
"execution_count": 61, | |
"outputs": [ | |
{ | |
"execution_count": 61, | |
"metadata": {}, | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[('2012-08-30', '+5', '6'),\n ('2012-12-11', '+10', '16'),\n ('2013-03-20', '+10', '26'),\n ('2014-03-19', '+2', '28'),\n ('2014-05-11', '+2', '30'),\n ('2014-05-12', '+12', '42'),\n ('2014-06-12', '+10', '52'),\n ('2014-06-26', '+10', '62'),\n ('2014-09-03', '+10', '72'),\n ('2014-11-14', '+10', '82'),\n ('2014-11-18', '+2', '84'),\n ('2014-12-08', '+2', '86'),\n ('2014-12-09', '+10', '96'),\n ('2014-12-12', '+2', '98'),\n ('2014-12-24', '+10', '108'),\n ('2015-02-20', '+10', '118'),\n ('2015-03-28', '+10', '128'),\n ('2015-04-26', '+10', '138'),\n ('2015-05-05', '+10', '148'),\n ('2015-05-26', '+10', '158'),\n ('2015-05-27', '+20', '178'),\n ('2015-07-03', '+10', '188'),\n ('2015-08-21', '+10', '308'),\n ('2015-09-07', '+10', '318'),\n ('2015-10-14', '-1', '317')]" | |
} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "As a **bonus task**, convert this list of tuples into a **pandas DataFrame**. It should have appropriate column names, and the second and third columns should be of type integer (rather than string/object)." | |
}, | |
{ | |
"metadata": { | |
"trusted": false, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "import pandas as pd\nfrom datetime import datetime\n\ndf = pd.DataFrame(data=reps_all, columns=('date', 'rep change', 'running total'))\n\ndef apply_dtypes(df, list_):\n for i, col in enumerate(df.columns):\n df[col] = df[col].astype(type(list_[i]))\n return df\n\ndf_ = apply_dtypes(df, [datetime.now(), 1, 1])\ndf_.dtypes", | |
"execution_count": 129, | |
"outputs": [ | |
{ | |
"execution_count": 129, | |
"metadata": {}, | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "date object\nrep change int32\nrunning total int32\ndtype: object" | |
} | |
} | |
] | |
} | |
], | |
"metadata": { | |
"language_info": { | |
"file_extension": ".py", | |
"codemirror_mode": { | |
"version": 3, | |
"name": "ipython" | |
}, | |
"mimetype": "text/x-python", | |
"version": "3.5.1", | |
"nbconvert_exporter": "python", | |
"name": "python", | |
"pygments_lexer": "ipython3" | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"gist": { | |
"id": "d961766f0c7ac2b4c084febd470acd29", | |
"data": { | |
"description": "MLtext3/submissions/03_basic_regex_homework.ipynb", | |
"public": true | |
} | |
}, | |
"_draft": { | |
"nbviewer_url": "https://gist.github.com/d961766f0c7ac2b4c084febd470acd29" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment