Skip to content

Instantly share code, notes, and snippets.

@phdkiran
Last active November 4, 2016 03:56
Show Gist options
  • Save phdkiran/feb8a658841e3e8946cfacd79fa286b8 to your computer and use it in GitHub Desktop.
Save phdkiran/feb8a658841e3e8946cfacd79fa286b8 to your computer and use it in GitHub Desktop.
MLtext3/submissions/04_intermediate_regex_homework-Copy1.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Intermediate Regex Homework"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## UFO sightings\n\nThe [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the **duration of the sighting**, which includes **free-form text**. These are some example entries:\n\n- 45 minutes\n- 1-2 hrs\n- 20 seconds\n- 1/2 hour\n- about 3 mins\n- minutes\n- one hour?\n- 5min"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Here is **how to read in the file:**\n\n- Use the pandas **`read_csv()`** function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).\n- Use the **`header=None`** parameter to specify that the data does not have a header row.\n- Use the **`nrows=100`** parameter to specify that you only want to read in the first 100 rows.\n- Save the relevant Series as a Python list, just like we did in a class exercise."
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "import pandas as pd\nimport re",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "ufo = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',\n header=None, nrows=100)\nufo.head(2)",
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": " 0 1 2 3 4 5 6 \\\n0 10/10/1949 20:30 san marcos tx us cylinder 2700 45 minutes \n1 10/10/1949 21:00 lackland afb tx NaN light 7200 1-2 hrs \n\n 7 8 9 \\\n0 This event took place in early fall around 194... 4/27/2004 29.883056 \n1 1949 Lackland AFB&#44 TX. Lights racing acros... 12/16/2005 29.384210 \n\n 10 \n0 -97.941111 \n1 -98.581082 ",
"text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n <th>2</th>\n <th>3</th>\n <th>4</th>\n <th>5</th>\n <th>6</th>\n <th>7</th>\n <th>8</th>\n <th>9</th>\n <th>10</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10/10/1949 20:30</td>\n <td>san marcos</td>\n <td>tx</td>\n <td>us</td>\n <td>cylinder</td>\n <td>2700</td>\n <td>45 minutes</td>\n <td>This event took place in early fall around 194...</td>\n <td>4/27/2004</td>\n <td>29.883056</td>\n <td>-97.941111</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10/10/1949 21:00</td>\n <td>lackland afb</td>\n <td>tx</td>\n <td>NaN</td>\n <td>light</td>\n <td>7200</td>\n <td>1-2 hrs</td>\n <td>1949 Lackland AFB&amp;#44 TX. Lights racing acros...</td>\n <td>12/16/2005</td>\n <td>29.384210</td>\n <td>-98.581082</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"execution_count": 2
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "#saving to a list\nd_list = ufo[6].tolist() ",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Your assignment is to **normalize the duration data for the first 100 rows** by splitting each entry into two parts:\n\n- The first part should be a **number**: either a whole number (such as '45') or a decimal (such as '0.5').\n- The second part should be a **unit of time**: either 'hr' or 'min' or 'sec'\n\nThe expected output is a **list of tuples**, containing the **original (unedited) string**, the **number**, and the **unit of time**. Here is a what the output should look like:\n\n> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "d_list[:3]",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "['45 minutes', '1-2 hrs', '20 seconds']"
},
"metadata": {},
"execution_count": 8
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "multi = \"\"\"\n (?P<duration>[\\d+/-]+) #Normal or range Durations\n [+/ -]* #Match space or -\n (?P<units>\\w+)$ #Units\n \"\"\"\npattern = re.compile(multi, re.VERBOSE)\n\nsub_dict = {'several': '30', 'few': '5', 'couple': '2' , 'one': '1', 'hour': 'hr', 'min': 'min', 'sec': 'sec', 'hr':'hr', 'or more min': 'min', '1min 39s': '99 sec'}\n\ndef sub_pattern(s):\n for k,v in sub_dict.items():\n pattern = k + '(\\S)*'\n s = re.sub(pattern, v, s, re.I)\n# print(pattern, s)\n# s = re.sub(r'min(\\S)*', r'min', s, re.I)\n# s = re.sub(r'hour(\\S)*|hr(\\S)*', r'hr', s, re.I)\n return s\n \n\nsub_pattern('mins')\n\nnomatch = []\ndef find_pattern(s):\n unit = None\n match = pattern.search(s)\n if match:\n d = match.groupdict()\n #logic for units \n if d['units'].startswith('min'):\n unit = 'min'\n elif d['units'].startswith(('hr', 'hour')):\n unit = 'hr'\n elif d['units'].startswith('sec'):\n unit = 'sec'\n else:\n print('no known unit for: {0} in {1}'.format(d['units'], s))\n converted = find_pattern(sub_pattern(s))\n print('converted to: ', converted)\n# nomatch.append(s)\n return (s, d['duration'], unit or d['units']) \n else:\n print('match failed for: ', s)\n converted = sub_pattern(s)\n print('redoing match on: ', converted)\n return find_pattern(converted)",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "clean_durations = [find_pattern(x) for x in d_list]\nlen(clean_durations)\n# sum([1 for x in clean_durations if x is None])\n# nomatch\nclean_durations",
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": "match failed for: several minutes\nredoing match on: 30 min\nmatch failed for: 5 min.\nredoing match on: 5 min\nmatch failed for: 30 min.\nredoing match on: 30 min\nmatch failed for: 20 sec.\nredoing match on: 20 sec\nmatch failed for: one hour?\nredoing match on: 1 hr\nmatch failed for: 4.5 or more min.\nredoing match on: 4.5 min\nmatch failed for: 30mins.\nredoing match on: 30min\nmatch failed for: couple minutes\nredoing match on: 2 min\nmatch failed for: few minutes\nredoing match on: 5 min\nmatch failed for: 2 sec.\nredoing match on: 2 sec\nmatch failed for: 1 hour(?)\nredoing match on: 1 hr\nno known unit for: s in 1min. 39s\nconverted to: ('99 sec', '99', 'sec')\nmatch failed for: 2 min.\nredoing match on: 2 min\nmatch failed for: 45min.\nredoing match on: 45min\nmatch failed for: 5-10 min.\nredoing match on: 5-10 min\nmatch failed for: several minutes\nredoing match on: 30 min\nmatch failed for: 30 min.\nredoing match on: 30 min\n",
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": "[('45 minutes', '45', 'min'),\n ('1-2 hrs', '1-2', 'hr'),\n ('20 seconds', '20', 'sec'),\n ('1/2 hour', '1/2', 'hr'),\n ('15 minutes', '15', 'min'),\n ('5 minutes', '5', 'min'),\n ('about 3 mins', '3', 'min'),\n ('20 minutes', '20', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 seconds', '30', 'sec'),\n ('20minutes', '20', 'min'),\n ('2 minutes', '2', 'min'),\n ('20-30 min', '20-30', 'min'),\n ('20 sec', '20', 'sec'),\n ('45 minutes', '45', 'min'),\n ('20 minutes', '20', 'min'),\n ('1 hr', '1', 'hr'),\n ('5-6 minutes', '5-6', 'min'),\n ('1 minute', '1', 'min'),\n ('3 seconds', '3', 'sec'),\n ('30 seconds', '30', 'sec'),\n ('approx: 30 seconds', '30', 'sec'),\n ('5min', '5', 'min'),\n ('15 minutes', '15', 'min'),\n ('4.5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30min', '30', 'min'),\n ('3 min', '3', 'min'),\n ('5 minutes', '5', 'min'),\n ('3 to 5 min', '5', 'min'),\n ('2min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('2 min', '2', 'min'),\n ('15-20 seconds', '15-20', 'sec'),\n ('10min', '10', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 minutes', '10', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('2 sec', '2', 'sec'),\n ('approx 5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('3min', '3', 'min'),\n ('2 minutes', '2', 'min'),\n ('30 minutes', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('1 hr', '1', 'hr'),\n ('10 seconds', '10', 'sec'),\n ('1min. 39s', '39', 's'),\n ('30 seconds', '30', 'sec'),\n ('20 minutes', '20', 'min'),\n ('8 seconds', '8', 'sec'),\n ('less than 1 min', '1', 'min'),\n ('1 hour', '1', 'hr'),\n ('2 minutes', '2', 'min'),\n ('5 seconds', '5', 'sec'),\n ('~1 hour', '1', 'hr'),\n ('2 min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('3sec', '3', 'sec'),\n ('5 min', '5', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('4 hours', '4', 'hr'),\n ('30 seconds', '30', 'sec'),\n ('<5 minutes', '5', 'min'),\n ('1-hour', '1-', 'hr'),\n ('5 minutes', '5', 'min'),\n ('10 to 15 sec', '15', 'sec'),\n ('30 +/- min', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('45min', '45', 'min'),\n ('< 1 min', '1', 'min'),\n ('10 minutes', '10', 'min'),\n ('2 seconds', '2', 'sec'),\n ('2 hours', '2', 'hr'),\n ('15 seconds', '15', 'sec'),\n ('1 hour', '1', 'hr'),\n ('5-10 min', '5-10', 'min'),\n ('10 seconds', '10', 'sec'),\n ('1 hour', '1', 'hr'),\n ('45 secs', '45', 'sec'),\n ('60-90 sec', '60-90', 'sec'),\n ('3 hours', '3', 'hr'),\n ('5 min', '5', 'min'),\n ('30 min', '30', 'min'),\n ('4 minutes', '4', 'min'),\n ('45 minutes', '45', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 seconds', '10', 'sec'),\n ('30seconds', '30', 'sec'),\n ('45 seconds', '45', 'sec'),\n ('15 seconds', '15', 'sec'),\n ('30 min', '30', 'min'),\n ('4-5 seconds', '4-5', 'sec')]"
},
"metadata": {},
"execution_count": 10
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Here are the **\"rules\" and guiding principles** for this assignment:\n\n- The normalized duration does not have to be exactly correct, but it must be at least **within the given range**. For example:\n - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.\n - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.\n- When a number is not given, you should make a **\"reasonable\" substitution for the words**. For example:\n - If the duration is ' minutes', you can approximate this as '5 min'.\n - If the duration is 'couple minutes', you can approximate this as '2 min'.\n- You are not allowed to **skip any entries**. (Your list of tuples should have a length of 100.)\n- Try to use **as few substitutions as possible**, and make your regular expression **as simple as possible**.\n- Just because you don't get an error doesn't mean that your code was successful. Instead, you should **check each entry by hand** to see if it produced an acceptable result."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Bonus tasks:**\n\n- Try reading in **more than 100 rows**, and see if your code still produces the correct results.\n- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), **calculate the exact midpoint** ('1.5 hr' or '12.5 sec') to use in your normalized data."
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "import numpy as np\n\ndef average(x):\n# print(x)\n try:\n l = [float(y) for y in x.split('-') if y]\n return np.mean(l)\n except ValueError:\n return mathops(x)\n \ndef mathops(x):\n l = [int(y) for y in x.split('/') if y]\n return l[0]/l[1]\n \n\naverage('1-2')\nmathops('1/2')",
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "0.5"
},
"metadata": {},
"execution_count": 11
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "final_list = [(x, average(y), z) for (x, y, z) in clean_durations]\nassert len(final_list) == ufo.shape[0]\nfinal_list",
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "[('45 minutes', 45.0, 'min'),\n ('1-2 hrs', 1.5, 'hr'),\n ('20 seconds', 20.0, 'sec'),\n ('1/2 hour', 0.5, 'hr'),\n ('15 minutes', 15.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('about 3 mins', 3.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 seconds', 30.0, 'sec'),\n ('20minutes', 20.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('20-30 min', 25.0, 'min'),\n ('20 sec', 20.0, 'sec'),\n ('45 minutes', 45.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('5-6 minutes', 5.5, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3 seconds', 3.0, 'sec'),\n ('30 seconds', 30.0, 'sec'),\n ('approx: 30 seconds', 30.0, 'sec'),\n ('5min', 5.0, 'min'),\n ('15 minutes', 15.0, 'min'),\n ('4.5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30min', 30.0, 'min'),\n ('3 min', 3.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('3 to 5 min', 5.0, 'min'),\n ('2min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 min', 2.0, 'min'),\n ('15-20 seconds', 17.5, 'sec'),\n ('10min', 10.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 sec', 2.0, 'sec'),\n ('approx 5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3min', 3.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('30 minutes', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('10 seconds', 10.0, 'sec'),\n ('1min. 39s', 39.0, 's'),\n ('30 seconds', 30.0, 'sec'),\n ('20 minutes', 20.0, 'min'),\n ('8 seconds', 8.0, 'sec'),\n ('less than 1 min', 1.0, 'min'),\n ('1 hour', 1.0, 'hr'),\n ('2 minutes', 2.0, 'min'),\n ('5 seconds', 5.0, 'sec'),\n ('~1 hour', 1.0, 'hr'),\n ('2 min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3sec', 3.0, 'sec'),\n ('5 min', 5.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('4 hours', 4.0, 'hr'),\n ('30 seconds', 30.0, 'sec'),\n ('<5 minutes', 5.0, 'min'),\n ('1-hour', 1.0, 'hr'),\n ('5 minutes', 5.0, 'min'),\n ('10 to 15 sec', 15.0, 'sec'),\n ('30 +/- min', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('45min', 45.0, 'min'),\n ('< 1 min', 1.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('2 seconds', 2.0, 'sec'),\n ('2 hours', 2.0, 'hr'),\n ('15 seconds', 15.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('5-10 min', 7.5, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('45 secs', 45.0, 'sec'),\n ('60-90 sec', 75.0, 'sec'),\n ('3 hours', 3.0, 'hr'),\n ('5 min', 5.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('4 minutes', 4.0, 'min'),\n ('45 minutes', 45.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('30seconds', 30.0, 'sec'),\n ('45 seconds', 45.0, 'sec'),\n ('15 seconds', 15.0, 'sec'),\n ('30 min', 30.0, 'min'),\n ('4-5 seconds', 4.5, 'sec')]"
},
"metadata": {},
"execution_count": 12
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "markdown",
"source": "#testing\n'1-'.split('-')\nint('30 +/'.rstrip('+/'))\nint('1/2')"
}
],
"metadata": {
"language_info": {
"file_extension": ".py",
"codemirror_mode": {
"version": 3,
"name": "ipython"
},
"mimetype": "text/x-python",
"version": "3.5.1",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython3"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"gist": {
"id": "feb8a658841e3e8946cfacd79fa286b8",
"data": {
"description": "MLtext3/submissions/04_intermediate_regex_homework-Copy1.ipynb",
"public": true
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/feb8a658841e3e8946cfacd79fa286b8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment