Last active
November 4, 2016 03:56
-
-
Save phdkiran/feb8a658841e3e8946cfacd79fa286b8 to your computer and use it in GitHub Desktop.
MLtext3/submissions/04_intermediate_regex_homework-Copy1.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# Intermediate Regex Homework" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## UFO sightings\n\nThe [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the **duration of the sighting**, which includes **free-form text**. These are some example entries:\n\n- 45 minutes\n- 1-2 hrs\n- 20 seconds\n- 1/2 hour\n- about 3 mins\n- minutes\n- one hour?\n- 5min" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Here is **how to read in the file:**\n\n- Use the pandas **`read_csv()`** function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).\n- Use the **`header=None`** parameter to specify that the data does not have a header row.\n- Use the **`nrows=100`** parameter to specify that you only want to read in the first 100 rows.\n- Save the relevant Series as a Python list, just like we did in a class exercise." | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "import pandas as pd\nimport re", | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "ufo = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',\n header=None, nrows=100)\nufo.head(2)", | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": " 0 1 2 3 4 5 6 \\\n0 10/10/1949 20:30 san marcos tx us cylinder 2700 45 minutes \n1 10/10/1949 21:00 lackland afb tx NaN light 7200 1-2 hrs \n\n 7 8 9 \\\n0 This event took place in early fall around 194... 4/27/2004 29.883056 \n1 1949 Lackland AFB, TX. Lights racing acros... 12/16/2005 29.384210 \n\n 10 \n0 -97.941111 \n1 -98.581082 ", | |
"text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n <th>2</th>\n <th>3</th>\n <th>4</th>\n <th>5</th>\n <th>6</th>\n <th>7</th>\n <th>8</th>\n <th>9</th>\n <th>10</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10/10/1949 20:30</td>\n <td>san marcos</td>\n <td>tx</td>\n <td>us</td>\n <td>cylinder</td>\n <td>2700</td>\n <td>45 minutes</td>\n <td>This event took place in early fall around 194...</td>\n <td>4/27/2004</td>\n <td>29.883056</td>\n <td>-97.941111</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10/10/1949 21:00</td>\n <td>lackland afb</td>\n <td>tx</td>\n <td>NaN</td>\n <td>light</td>\n <td>7200</td>\n <td>1-2 hrs</td>\n <td>1949 Lackland AFB&#44 TX. Lights racing acros...</td>\n <td>12/16/2005</td>\n <td>29.384210</td>\n <td>-98.581082</td>\n </tr>\n </tbody>\n</table>\n</div>" | |
}, | |
"metadata": {}, | |
"execution_count": 2 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "#saving to a list\nd_list = ufo[6].tolist() ", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Your assignment is to **normalize the duration data for the first 100 rows** by splitting each entry into two parts:\n\n- The first part should be a **number**: either a whole number (such as '45') or a decimal (such as '0.5').\n- The second part should be a **unit of time**: either 'hr' or 'min' or 'sec'\n\nThe expected output is a **list of tuples**, containing the **original (unedited) string**, the **number**, and the **unit of time**. Here is a what the output should look like:\n\n> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`" | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "d_list[:3]", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "['45 minutes', '1-2 hrs', '20 seconds']" | |
}, | |
"metadata": {}, | |
"execution_count": 8 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "multi = \"\"\"\n (?P<duration>[\\d+/-]+) #Normal or range Durations\n [+/ -]* #Match space or -\n (?P<units>\\w+)$ #Units\n \"\"\"\npattern = re.compile(multi, re.VERBOSE)\n\nsub_dict = {'several': '30', 'few': '5', 'couple': '2' , 'one': '1', 'hour': 'hr', 'min': 'min', 'sec': 'sec', 'hr':'hr', 'or more min': 'min', '1min 39s': '99 sec'}\n\ndef sub_pattern(s):\n for k,v in sub_dict.items():\n pattern = k + '(\\S)*'\n s = re.sub(pattern, v, s, re.I)\n# print(pattern, s)\n# s = re.sub(r'min(\\S)*', r'min', s, re.I)\n# s = re.sub(r'hour(\\S)*|hr(\\S)*', r'hr', s, re.I)\n return s\n \n\nsub_pattern('mins')\n\nnomatch = []\ndef find_pattern(s):\n unit = None\n match = pattern.search(s)\n if match:\n d = match.groupdict()\n #logic for units \n if d['units'].startswith('min'):\n unit = 'min'\n elif d['units'].startswith(('hr', 'hour')):\n unit = 'hr'\n elif d['units'].startswith('sec'):\n unit = 'sec'\n else:\n print('no known unit for: {0} in {1}'.format(d['units'], s))\n converted = find_pattern(sub_pattern(s))\n print('converted to: ', converted)\n# nomatch.append(s)\n return (s, d['duration'], unit or d['units']) \n else:\n print('match failed for: ', s)\n converted = sub_pattern(s)\n print('redoing match on: ', converted)\n return find_pattern(converted)", | |
"execution_count": 9, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "clean_durations = [find_pattern(x) for x in d_list]\nlen(clean_durations)\n# sum([1 for x in clean_durations if x is None])\n# nomatch\nclean_durations", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "match failed for: several minutes\nredoing match on: 30 min\nmatch failed for: 5 min.\nredoing match on: 5 min\nmatch failed for: 30 min.\nredoing match on: 30 min\nmatch failed for: 20 sec.\nredoing match on: 20 sec\nmatch failed for: one hour?\nredoing match on: 1 hr\nmatch failed for: 4.5 or more min.\nredoing match on: 4.5 min\nmatch failed for: 30mins.\nredoing match on: 30min\nmatch failed for: couple minutes\nredoing match on: 2 min\nmatch failed for: few minutes\nredoing match on: 5 min\nmatch failed for: 2 sec.\nredoing match on: 2 sec\nmatch failed for: 1 hour(?)\nredoing match on: 1 hr\nno known unit for: s in 1min. 39s\nconverted to: ('99 sec', '99', 'sec')\nmatch failed for: 2 min.\nredoing match on: 2 min\nmatch failed for: 45min.\nredoing match on: 45min\nmatch failed for: 5-10 min.\nredoing match on: 5-10 min\nmatch failed for: several minutes\nredoing match on: 30 min\nmatch failed for: 30 min.\nredoing match on: 30 min\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[('45 minutes', '45', 'min'),\n ('1-2 hrs', '1-2', 'hr'),\n ('20 seconds', '20', 'sec'),\n ('1/2 hour', '1/2', 'hr'),\n ('15 minutes', '15', 'min'),\n ('5 minutes', '5', 'min'),\n ('about 3 mins', '3', 'min'),\n ('20 minutes', '20', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 seconds', '30', 'sec'),\n ('20minutes', '20', 'min'),\n ('2 minutes', '2', 'min'),\n ('20-30 min', '20-30', 'min'),\n ('20 sec', '20', 'sec'),\n ('45 minutes', '45', 'min'),\n ('20 minutes', '20', 'min'),\n ('1 hr', '1', 'hr'),\n ('5-6 minutes', '5-6', 'min'),\n ('1 minute', '1', 'min'),\n ('3 seconds', '3', 'sec'),\n ('30 seconds', '30', 'sec'),\n ('approx: 30 seconds', '30', 'sec'),\n ('5min', '5', 'min'),\n ('15 minutes', '15', 'min'),\n ('4.5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30min', '30', 'min'),\n ('3 min', '3', 'min'),\n ('5 minutes', '5', 'min'),\n ('3 to 5 min', '5', 'min'),\n ('2min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('2 min', '2', 'min'),\n ('15-20 seconds', '15-20', 'sec'),\n ('10min', '10', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 minutes', '10', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('2 sec', '2', 'sec'),\n ('approx 5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('3min', '3', 'min'),\n ('2 minutes', '2', 'min'),\n ('30 minutes', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('1 hr', '1', 'hr'),\n ('10 seconds', '10', 'sec'),\n ('1min. 39s', '39', 's'),\n ('30 seconds', '30', 'sec'),\n ('20 minutes', '20', 'min'),\n ('8 seconds', '8', 'sec'),\n ('less than 1 min', '1', 'min'),\n ('1 hour', '1', 'hr'),\n ('2 minutes', '2', 'min'),\n ('5 seconds', '5', 'sec'),\n ('~1 hour', '1', 'hr'),\n ('2 min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('3sec', '3', 'sec'),\n ('5 min', '5', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('4 hours', '4', 'hr'),\n ('30 seconds', '30', 'sec'),\n ('<5 minutes', '5', 'min'),\n ('1-hour', '1-', 'hr'),\n ('5 minutes', '5', 'min'),\n ('10 to 15 sec', '15', 'sec'),\n ('30 +/- min', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('45min', '45', 'min'),\n ('< 1 min', '1', 'min'),\n ('10 minutes', '10', 'min'),\n ('2 seconds', '2', 'sec'),\n ('2 hours', '2', 'hr'),\n ('15 seconds', '15', 'sec'),\n ('1 hour', '1', 'hr'),\n ('5-10 min', '5-10', 'min'),\n ('10 seconds', '10', 'sec'),\n ('1 hour', '1', 'hr'),\n ('45 secs', '45', 'sec'),\n ('60-90 sec', '60-90', 'sec'),\n ('3 hours', '3', 'hr'),\n ('5 min', '5', 'min'),\n ('30 min', '30', 'min'),\n ('4 minutes', '4', 'min'),\n ('45 minutes', '45', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 seconds', '10', 'sec'),\n ('30seconds', '30', 'sec'),\n ('45 seconds', '45', 'sec'),\n ('15 seconds', '15', 'sec'),\n ('30 min', '30', 'min'),\n ('4-5 seconds', '4-5', 'sec')]" | |
}, | |
"metadata": {}, | |
"execution_count": 10 | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Here are the **\"rules\" and guiding principles** for this assignment:\n\n- The normalized duration does not have to be exactly correct, but it must be at least **within the given range**. For example:\n - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.\n - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.\n- When a number is not given, you should make a **\"reasonable\" substitution for the words**. For example:\n - If the duration is ' minutes', you can approximate this as '5 min'.\n - If the duration is 'couple minutes', you can approximate this as '2 min'.\n- You are not allowed to **skip any entries**. (Your list of tuples should have a length of 100.)\n- Try to use **as few substitutions as possible**, and make your regular expression **as simple as possible**.\n- Just because you don't get an error doesn't mean that your code was successful. Instead, you should **check each entry by hand** to see if it produced an acceptable result." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Bonus tasks:**\n\n- Try reading in **more than 100 rows**, and see if your code still produces the correct results.\n- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), **calculate the exact midpoint** ('1.5 hr' or '12.5 sec') to use in your normalized data." | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "import numpy as np\n\ndef average(x):\n# print(x)\n try:\n l = [float(y) for y in x.split('-') if y]\n return np.mean(l)\n except ValueError:\n return mathops(x)\n \ndef mathops(x):\n l = [int(y) for y in x.split('/') if y]\n return l[0]/l[1]\n \n\naverage('1-2')\nmathops('1/2')", | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "0.5" | |
}, | |
"metadata": {}, | |
"execution_count": 11 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "final_list = [(x, average(y), z) for (x, y, z) in clean_durations]\nassert len(final_list) == ufo.shape[0]\nfinal_list", | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "[('45 minutes', 45.0, 'min'),\n ('1-2 hrs', 1.5, 'hr'),\n ('20 seconds', 20.0, 'sec'),\n ('1/2 hour', 0.5, 'hr'),\n ('15 minutes', 15.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('about 3 mins', 3.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 seconds', 30.0, 'sec'),\n ('20minutes', 20.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('20-30 min', 25.0, 'min'),\n ('20 sec', 20.0, 'sec'),\n ('45 minutes', 45.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('5-6 minutes', 5.5, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3 seconds', 3.0, 'sec'),\n ('30 seconds', 30.0, 'sec'),\n ('approx: 30 seconds', 30.0, 'sec'),\n ('5min', 5.0, 'min'),\n ('15 minutes', 15.0, 'min'),\n ('4.5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30min', 30.0, 'min'),\n ('3 min', 3.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('3 to 5 min', 5.0, 'min'),\n ('2min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 min', 2.0, 'min'),\n ('15-20 seconds', 17.5, 'sec'),\n ('10min', 10.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 sec', 2.0, 'sec'),\n ('approx 5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3min', 3.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('30 minutes', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('10 seconds', 10.0, 'sec'),\n ('1min. 39s', 39.0, 's'),\n ('30 seconds', 30.0, 'sec'),\n ('20 minutes', 20.0, 'min'),\n ('8 seconds', 8.0, 'sec'),\n ('less than 1 min', 1.0, 'min'),\n ('1 hour', 1.0, 'hr'),\n ('2 minutes', 2.0, 'min'),\n ('5 seconds', 5.0, 'sec'),\n ('~1 hour', 1.0, 'hr'),\n ('2 min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3sec', 3.0, 'sec'),\n ('5 min', 5.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('4 hours', 4.0, 'hr'),\n ('30 seconds', 30.0, 'sec'),\n ('<5 minutes', 5.0, 'min'),\n ('1-hour', 1.0, 'hr'),\n ('5 minutes', 5.0, 'min'),\n ('10 to 15 sec', 15.0, 'sec'),\n ('30 +/- min', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('45min', 45.0, 'min'),\n ('< 1 min', 1.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('2 seconds', 2.0, 'sec'),\n ('2 hours', 2.0, 'hr'),\n ('15 seconds', 15.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('5-10 min', 7.5, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('45 secs', 45.0, 'sec'),\n ('60-90 sec', 75.0, 'sec'),\n ('3 hours', 3.0, 'hr'),\n ('5 min', 5.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('4 minutes', 4.0, 'min'),\n ('45 minutes', 45.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('30seconds', 30.0, 'sec'),\n ('45 seconds', 45.0, 'sec'),\n ('15 seconds', 15.0, 'sec'),\n ('30 min', 30.0, 'min'),\n ('4-5 seconds', 4.5, 'sec')]" | |
}, | |
"metadata": {}, | |
"execution_count": 12 | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "markdown", | |
"source": "#testing\n'1-'.split('-')\nint('30 +/'.rstrip('+/'))\nint('1/2')" | |
} | |
], | |
"metadata": { | |
"language_info": { | |
"file_extension": ".py", | |
"codemirror_mode": { | |
"version": 3, | |
"name": "ipython" | |
}, | |
"mimetype": "text/x-python", | |
"version": "3.5.1", | |
"nbconvert_exporter": "python", | |
"name": "python", | |
"pygments_lexer": "ipython3" | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"gist": { | |
"id": "feb8a658841e3e8946cfacd79fa286b8", | |
"data": { | |
"description": "MLtext3/submissions/04_intermediate_regex_homework-Copy1.ipynb", | |
"public": true | |
} | |
}, | |
"_draft": { | |
"nbviewer_url": "https://gist.github.com/feb8a658841e3e8946cfacd79fa286b8" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment