phdkiran · November 4, 2016 03:56
diff --git a/04_intermediate_regex_homework-Copy1.ipynb b/04_intermediate_regex_homework-Copy1.ipynb
 {
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# Intermediate Regex Homework"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## UFO sightings\n\nThe [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the **duration of the sighting**, which includes **free-form text**. These are some example entries:\n\n- 45 minutes\n- 1-2 hrs\n- 20 seconds\n- 1/2 hour\n- about 3 mins\n-  minutes\n- one hour?\n- 5min"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Here is **how to read in the file:**\n\n- Use the pandas **`read_csv()`** function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).\n- Use the **`header=None`** parameter to specify that the data does not have a header row.\n- Use the **`nrows=100`** parameter to specify that you only want to read in the first 100 rows.\n- Save the relevant Series as a Python list, just like we did in a class exercise."
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": true
      },
      "cell_type": "code",
      "source": "import pandas as pd\nimport re",
      "execution_count": 1,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "ufo = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',\n                 header=None, nrows=100)\nufo.head(2)",
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "                 0             1   2    3         4     5           6   \\\n0  10/10/1949 20:30    san marcos  tx   us  cylinder  2700  45 minutes   \n1  10/10/1949 21:00  lackland afb  tx  NaN     light  7200     1-2 hrs   \n\n                                                  7           8          9   \\\n0  This event took place in early fall around 194...   4/27/2004  29.883056   \n1  1949 Lackland AFB&#44 TX.  Lights racing acros...  12/16/2005  29.384210   \n\n          10  \n0 -97.941111  \n1 -98.581082  ",
            "text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n      <th>1</th>\n      <th>2</th>\n      <th>3</th>\n      <th>4</th>\n      <th>5</th>\n      <th>6</th>\n      <th>7</th>\n      <th>8</th>\n      <th>9</th>\n      <th>10</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>10/10/1949 20:30</td>\n      <td>san marcos</td>\n      <td>tx</td>\n      <td>us</td>\n      <td>cylinder</td>\n      <td>2700</td>\n      <td>45 minutes</td>\n      <td>This event took place in early fall around 194...</td>\n      <td>4/27/2004</td>\n      <td>29.883056</td>\n      <td>-97.941111</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>10/10/1949 21:00</td>\n      <td>lackland afb</td>\n      <td>tx</td>\n      <td>NaN</td>\n      <td>light</td>\n      <td>7200</td>\n      <td>1-2 hrs</td>\n      <td>1949 Lackland AFB&amp;#44 TX.  Lights racing acros...</td>\n      <td>12/16/2005</td>\n      <td>29.384210</td>\n      <td>-98.581082</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {},
          "execution_count": 2
        }
      ]
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "#saving to a list\nd_list = ufo[6].tolist() ",
      "execution_count": 3,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Your assignment is to **normalize the duration data for the first 100 rows** by splitting each entry into two parts:\n\n- The first part should be a **number**: either a whole number (such as '45') or a decimal (such as '0.5').\n- The second part should be a **unit of time**: either 'hr' or 'min' or 'sec'\n\nThe expected output is a **list of tuples**, containing the **original (unedited) string**, the **number**, and the **unit of time**. Here is a what the output should look like:\n\n> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`"
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "d_list[:3]",
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "['45 minutes', '1-2 hrs', '20 seconds']"
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "multi = \"\"\"\n    (?P<duration>[\\d+/-]+) #Normal or range Durations\n    [+/ -]*                    #Match space or -\n    (?P<units>\\w+)$         #Units\n    \"\"\"\npattern = re.compile(multi, re.VERBOSE)\n\nsub_dict = {'several': '30', 'few': '5', 'couple': '2' , 'one': '1', 'hour': 'hr', 'min': 'min', 'sec': 'sec', 'hr':'hr', 'or more min': 'min', '1min 39s': '99 sec'}\n\ndef sub_pattern(s):\n    for k,v in sub_dict.items():\n        pattern = k + '(\\S)*'\n        s = re.sub(pattern, v, s, re.I)\n#         print(pattern, s)\n#     s = re.sub(r'min(\\S)*', r'min', s, re.I)\n#     s = re.sub(r'hour(\\S)*|hr(\\S)*', r'hr', s, re.I)\n    return s\n    \n\nsub_pattern('mins')\n\nnomatch = []\ndef find_pattern(s):\n    unit = None\n    match = pattern.search(s)\n    if match:\n        d = match.groupdict()\n        #logic for units \n        if d['units'].startswith('min'):\n            unit = 'min'\n        elif d['units'].startswith(('hr', 'hour')):\n            unit = 'hr'\n        elif d['units'].startswith('sec'):\n            unit = 'sec'\n        else:\n            print('no known unit for: {0} in {1}'.format(d['units'], s))\n            converted = find_pattern(sub_pattern(s))\n            print('converted to: ', converted)\n#             nomatch.append(s)\n        return (s, d['duration'], unit or d['units']) \n    else:\n        print('match failed for: ', s)\n        converted = sub_pattern(s)\n        print('redoing match on: ', converted)\n        return find_pattern(converted)",
      "execution_count": 9,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": true
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "clean_durations = [find_pattern(x) for x in d_list]\nlen(clean_durations)\n# sum([1 for x in clean_durations if x is None])\n# nomatch\nclean_durations",
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "text": "match failed for:  several minutes\nredoing match on:  30 min\nmatch failed for:  5 min.\nredoing match on:  5 min\nmatch failed for:  30 min.\nredoing match on:  30 min\nmatch failed for:  20 sec.\nredoing match on:  20 sec\nmatch failed for:  one hour?\nredoing match on:  1 hr\nmatch failed for:  4.5 or more min.\nredoing match on:  4.5 min\nmatch failed for:  30mins.\nredoing match on:  30min\nmatch failed for:  couple minutes\nredoing match on:  2 min\nmatch failed for:  few minutes\nredoing match on:  5 min\nmatch failed for:  2 sec.\nredoing match on:  2 sec\nmatch failed for:  1 hour(?)\nredoing match on:  1 hr\nno known unit for: s in 1min. 39s\nconverted to:  ('99 sec', '99', 'sec')\nmatch failed for:  2 min.\nredoing match on:  2 min\nmatch failed for:  45min.\nredoing match on:  45min\nmatch failed for:  5-10 min.\nredoing match on:  5-10 min\nmatch failed for:  several minutes\nredoing match on:  30 min\nmatch failed for:  30 min.\nredoing match on:  30 min\n",
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "[('45 minutes', '45', 'min'),\n ('1-2 hrs', '1-2', 'hr'),\n ('20 seconds', '20', 'sec'),\n ('1/2 hour', '1/2', 'hr'),\n ('15 minutes', '15', 'min'),\n ('5 minutes', '5', 'min'),\n ('about 3 mins', '3', 'min'),\n ('20 minutes', '20', 'min'),\n ('3  minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 seconds', '30', 'sec'),\n ('20minutes', '20', 'min'),\n ('2 minutes', '2', 'min'),\n ('20-30 min', '20-30', 'min'),\n ('20 sec', '20', 'sec'),\n ('45 minutes', '45', 'min'),\n ('20 minutes', '20', 'min'),\n ('1 hr', '1', 'hr'),\n ('5-6 minutes', '5-6', 'min'),\n ('1 minute', '1', 'min'),\n ('3 seconds', '3', 'sec'),\n ('30 seconds', '30', 'sec'),\n ('approx: 30 seconds', '30', 'sec'),\n ('5min', '5', 'min'),\n ('15 minutes', '15', 'min'),\n ('4.5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30min', '30', 'min'),\n ('3 min', '3', 'min'),\n ('5 minutes', '5', 'min'),\n ('3 to 5 min', '5', 'min'),\n ('2min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('2 min', '2', 'min'),\n ('15-20 seconds', '15-20', 'sec'),\n ('10min', '10', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 minutes', '10', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('2 sec', '2', 'sec'),\n ('approx 5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('3min', '3', 'min'),\n ('2 minutes', '2', 'min'),\n ('30 minutes', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('1 hr', '1', 'hr'),\n ('10 seconds', '10', 'sec'),\n ('1min. 39s', '39', 's'),\n ('30 seconds', '30', 'sec'),\n ('20 minutes', '20', 'min'),\n ('8 seconds', '8', 'sec'),\n ('less than 1 min', '1', 'min'),\n ('1 hour', '1', 'hr'),\n ('2 minutes', '2', 'min'),\n ('5 seconds', '5', 'sec'),\n ('~1 hour', '1', 'hr'),\n ('2 min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('3sec', '3', 'sec'),\n ('5 min', '5', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('4 hours', '4', 'hr'),\n ('30 seconds', '30', 'sec'),\n ('<5 minutes', '5', 'min'),\n ('1-hour', '1-', 'hr'),\n ('5 minutes', '5', 'min'),\n ('10 to 15 sec', '15', 'sec'),\n ('30 +/- min', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('45min', '45', 'min'),\n ('< 1 min', '1', 'min'),\n ('10 minutes', '10', 'min'),\n ('2 seconds', '2', 'sec'),\n ('2 hours', '2', 'hr'),\n ('15 seconds', '15', 'sec'),\n ('1 hour', '1', 'hr'),\n ('5-10 min', '5-10', 'min'),\n ('10 seconds', '10', 'sec'),\n ('1 hour', '1', 'hr'),\n ('45 secs', '45', 'sec'),\n ('60-90 sec', '60-90', 'sec'),\n ('3 hours', '3', 'hr'),\n ('5 min', '5', 'min'),\n ('30 min', '30', 'min'),\n ('4 minutes', '4', 'min'),\n ('45 minutes', '45', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 seconds', '10', 'sec'),\n ('30seconds', '30', 'sec'),\n ('45 seconds', '45', 'sec'),\n ('15 seconds', '15', 'sec'),\n ('30 min', '30', 'min'),\n ('4-5 seconds', '4-5', 'sec')]"
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Here are the **\"rules\" and guiding principles** for this assignment:\n\n- The normalized duration does not have to be exactly correct, but it must be at least **within the given range**. For example:\n    - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.\n    - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.\n- When a number is not given, you should make a **\"reasonable\" substitution for the words**. For example:\n    - If the duration is ' minutes', you can approximate this as '5 min'.\n    - If the duration is 'couple minutes', you can approximate this as '2 min'.\n- You are not allowed to **skip any entries**. (Your list of tuples should have a length of 100.)\n- Try to use **as few substitutions as possible**, and make your regular expression **as simple as possible**.\n- Just because you don't get an error doesn't mean that your code was successful. Instead, you should **check each entry by hand** to see if it produced an acceptable result."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "**Bonus tasks:**\n\n- Try reading in **more than 100 rows**, and see if your code still produces the correct results.\n- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), **calculate the exact midpoint** ('1.5 hr' or '12.5 sec') to use in your normalized data."
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "import numpy as np\n\ndef average(x):\n#     print(x)\n    try:\n        l = [float(y) for y in x.split('-') if y]\n        return np.mean(l)\n    except ValueError:\n        return mathops(x)\n    \ndef mathops(x):\n    l = [int(y) for y in x.split('/') if y]\n    return l[0]/l[1]\n    \n\naverage('1-2')\nmathops('1/2')",
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "0.5"
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "code",
      "source": "final_list = [(x, average(y), z) for (x, y, z) in clean_durations]\nassert len(final_list) == ufo.shape[0]\nfinal_list",
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": "[('45 minutes', 45.0, 'min'),\n ('1-2 hrs', 1.5, 'hr'),\n ('20 seconds', 20.0, 'sec'),\n ('1/2 hour', 0.5, 'hr'),\n ('15 minutes', 15.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('about 3 mins', 3.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('3  minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 seconds', 30.0, 'sec'),\n ('20minutes', 20.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('20-30 min', 25.0, 'min'),\n ('20 sec', 20.0, 'sec'),\n ('45 minutes', 45.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('5-6 minutes', 5.5, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3 seconds', 3.0, 'sec'),\n ('30 seconds', 30.0, 'sec'),\n ('approx: 30 seconds', 30.0, 'sec'),\n ('5min', 5.0, 'min'),\n ('15 minutes', 15.0, 'min'),\n ('4.5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30min', 30.0, 'min'),\n ('3 min', 3.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('3 to 5 min', 5.0, 'min'),\n ('2min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 min', 2.0, 'min'),\n ('15-20 seconds', 17.5, 'sec'),\n ('10min', 10.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 sec', 2.0, 'sec'),\n ('approx 5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3min', 3.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('30 minutes', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('10 seconds', 10.0, 'sec'),\n ('1min. 39s', 39.0, 's'),\n ('30 seconds', 30.0, 'sec'),\n ('20 minutes', 20.0, 'min'),\n ('8 seconds', 8.0, 'sec'),\n ('less than 1 min', 1.0, 'min'),\n ('1 hour', 1.0, 'hr'),\n ('2 minutes', 2.0, 'min'),\n ('5 seconds', 5.0, 'sec'),\n ('~1 hour', 1.0, 'hr'),\n ('2 min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3sec', 3.0, 'sec'),\n ('5 min', 5.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('4 hours', 4.0, 'hr'),\n ('30 seconds', 30.0, 'sec'),\n ('<5 minutes', 5.0, 'min'),\n ('1-hour', 1.0, 'hr'),\n ('5 minutes', 5.0, 'min'),\n ('10 to 15 sec', 15.0, 'sec'),\n ('30 +/- min', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('45min', 45.0, 'min'),\n ('< 1 min', 1.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('2 seconds', 2.0, 'sec'),\n ('2 hours', 2.0, 'hr'),\n ('15 seconds', 15.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('5-10 min', 7.5, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('45 secs', 45.0, 'sec'),\n ('60-90 sec', 75.0, 'sec'),\n ('3 hours', 3.0, 'hr'),\n ('5 min', 5.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('4 minutes', 4.0, 'min'),\n ('45 minutes', 45.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('30seconds', 30.0, 'sec'),\n ('45 seconds', 45.0, 'sec'),\n ('15 seconds', 15.0, 'sec'),\n ('30 min', 30.0, 'min'),\n ('4-5 seconds', 4.5, 'sec')]"
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "metadata": {
        "trusted": true,
        "collapsed": false
      },
      "cell_type": "markdown",
      "source": "#testing\n'1-'.split('-')\nint('30 +/'.rstrip('+/'))\nint('1/2')"
    }
  ],
  "metadata": {
    "language_info": {
      "file_extension": ".py",
      "codemirror_mode": {
        "version": 3,
        "name": "ipython"
      },
      "mimetype": "text/x-python",
      "version": "3.5.1",
      "nbconvert_exporter": "python",
      "name": "python",
      "pygments_lexer": "ipython3"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "gist": {
      "id": "feb8a658841e3e8946cfacd79fa286b8",
      "data": {
        "description": "MLtext3/submissions/04_intermediate_regex_homework-Copy1.ipynb",
        "public": true
      }
    },
    "_draft": {
      "nbviewer_url": "https://gist.github.com/feb8a658841e3e8946cfacd79fa286b8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "# Intermediate Regex Homework"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## UFO sightings\n\nThe [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the duration of the sighting, which includes free-form text. These are some example entries:\n\n- 45 minutes\n- 1-2 hrs\n- 20 seconds\n- 1/2 hour\n- about 3 mins\n- minutes\n- one hour?\n- 5min"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Here is how to read in the file:\n\n- Use the pandas `read_csv()` function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).\n- Use the `header=None` parameter to specify that the data does not have a header row.\n- Use the `nrows=100` parameter to specify that you only want to read in the first 100 rows.\n- Save the relevant Series as a Python list, just like we did in a class exercise."
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": true
	},
	"cell_type": "code",
	"source": "import pandas as pd\nimport re",
	"execution_count": 1,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "ufo = pd.read_csv('https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv',\n header=None, nrows=100)\nufo.head(2)",
	"execution_count": 2,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": " 0 1 2 3 4 5 6 \\\n0 10/10/1949 20:30 san marcos tx us cylinder 2700 45 minutes \n1 10/10/1949 21:00 lackland afb tx NaN light 7200 1-2 hrs \n\n 7 8 9 \\\n0 This event took place in early fall around 194... 4/27/2004 29.883056 \n1 1949 Lackland AFB&#44 TX. Lights racing acros... 12/16/2005 29.384210 \n\n 10 \n0 -97.941111 \n1 -98.581082 ",
	"text/html": "<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n <th>2</th>\n <th>3</th>\n <th>4</th>\n <th>5</th>\n <th>6</th>\n <th>7</th>\n <th>8</th>\n <th>9</th>\n <th>10</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10/10/1949 20:30</td>\n <td>san marcos</td>\n <td>tx</td>\n <td>us</td>\n <td>cylinder</td>\n <td>2700</td>\n <td>45 minutes</td>\n <td>This event took place in early fall around 194...</td>\n <td>4/27/2004</td>\n <td>29.883056</td>\n <td>-97.941111</td>\n </tr>\n <tr>\n <th>1</th>\n <td>10/10/1949 21:00</td>\n <td>lackland afb</td>\n <td>tx</td>\n <td>NaN</td>\n <td>light</td>\n <td>7200</td>\n <td>1-2 hrs</td>\n <td>1949 Lackland AFB&#44 TX. Lights racing acros...</td>\n <td>12/16/2005</td>\n <td>29.384210</td>\n <td>-98.581082</td>\n </tr>\n </tbody>\n</table>\n</div>"
	},
	"metadata": {},
	"execution_count": 2
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "#saving to a list\nd_list = ufo[6].tolist() ",
	"execution_count": 3,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Your assignment is to normalize the duration data for the first 100 rows by splitting each entry into two parts:\n\n- The first part should be a number: either a whole number (such as '45') or a decimal (such as '0.5').\n- The second part should be a unit of time: either 'hr' or 'min' or 'sec'\n\nThe expected output is a list of tuples, containing the original (unedited) string, the number, and the unit of time. Here is a what the output should look like:\n\n> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`"
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "d_list[:3]",
	"execution_count": 8,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "['45 minutes', '1-2 hrs', '20 seconds']"
	},
	"metadata": {},
	"execution_count": 8
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "multi = \"\"\"\n (?P<duration>[\\d+/-]+) #Normal or range Durations\n [+/ -]* #Match space or -\n (?P<units>\\w+)$ #Units\n \"\"\"\npattern = re.compile(multi, re.VERBOSE)\n\nsub_dict = {'several': '30', 'few': '5', 'couple': '2' , 'one': '1', 'hour': 'hr', 'min': 'min', 'sec': 'sec', 'hr':'hr', 'or more min': 'min', '1min 39s': '99 sec'}\n\ndef sub_pattern(s):\n for k,v in sub_dict.items():\n pattern = k + '(\\S)'\n s = re.sub(pattern, v, s, re.I)\n# print(pattern, s)\n# s = re.sub(r'min(\\S)', r'min', s, re.I)\n# s = re.sub(r'hour(\\S)\|hr(\\S)', r'hr', s, re.I)\n return s\n \n\nsub_pattern('mins')\n\nnomatch = []\ndef find_pattern(s):\n unit = None\n match = pattern.search(s)\n if match:\n d = match.groupdict()\n #logic for units \n if d['units'].startswith('min'):\n unit = 'min'\n elif d['units'].startswith(('hr', 'hour')):\n unit = 'hr'\n elif d['units'].startswith('sec'):\n unit = 'sec'\n else:\n print('no known unit for: {0} in {1}'.format(d['units'], s))\n converted = find_pattern(sub_pattern(s))\n print('converted to: ', converted)\n# nomatch.append(s)\n return (s, d['duration'], unit or d['units']) \n else:\n print('match failed for: ', s)\n converted = sub_pattern(s)\n print('redoing match on: ', converted)\n return find_pattern(converted)",
	"execution_count": 9,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": true
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "clean_durations = [find_pattern(x) for x in d_list]\nlen(clean_durations)\n# sum([1 for x in clean_durations if x is None])\n# nomatch\nclean_durations",
	"execution_count": 10,
	"outputs": [
	{
	"output_type": "stream",
	"text": "match failed for: several minutes\nredoing match on: 30 min\nmatch failed for: 5 min.\nredoing match on: 5 min\nmatch failed for: 30 min.\nredoing match on: 30 min\nmatch failed for: 20 sec.\nredoing match on: 20 sec\nmatch failed for: one hour?\nredoing match on: 1 hr\nmatch failed for: 4.5 or more min.\nredoing match on: 4.5 min\nmatch failed for: 30mins.\nredoing match on: 30min\nmatch failed for: couple minutes\nredoing match on: 2 min\nmatch failed for: few minutes\nredoing match on: 5 min\nmatch failed for: 2 sec.\nredoing match on: 2 sec\nmatch failed for: 1 hour(?)\nredoing match on: 1 hr\nno known unit for: s in 1min. 39s\nconverted to: ('99 sec', '99', 'sec')\nmatch failed for: 2 min.\nredoing match on: 2 min\nmatch failed for: 45min.\nredoing match on: 45min\nmatch failed for: 5-10 min.\nredoing match on: 5-10 min\nmatch failed for: several minutes\nredoing match on: 30 min\nmatch failed for: 30 min.\nredoing match on: 30 min\n",
	"name": "stdout"
	},
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "[('45 minutes', '45', 'min'),\n ('1-2 hrs', '1-2', 'hr'),\n ('20 seconds', '20', 'sec'),\n ('1/2 hour', '1/2', 'hr'),\n ('15 minutes', '15', 'min'),\n ('5 minutes', '5', 'min'),\n ('about 3 mins', '3', 'min'),\n ('20 minutes', '20', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 min', '30', 'min'),\n ('3 minutes', '3', 'min'),\n ('30 seconds', '30', 'sec'),\n ('20minutes', '20', 'min'),\n ('2 minutes', '2', 'min'),\n ('20-30 min', '20-30', 'min'),\n ('20 sec', '20', 'sec'),\n ('45 minutes', '45', 'min'),\n ('20 minutes', '20', 'min'),\n ('1 hr', '1', 'hr'),\n ('5-6 minutes', '5-6', 'min'),\n ('1 minute', '1', 'min'),\n ('3 seconds', '3', 'sec'),\n ('30 seconds', '30', 'sec'),\n ('approx: 30 seconds', '30', 'sec'),\n ('5min', '5', 'min'),\n ('15 minutes', '15', 'min'),\n ('4.5 min', '5', 'min'),\n ('3 minutes', '3', 'min'),\n ('30min', '30', 'min'),\n ('3 min', '3', 'min'),\n ('5 minutes', '5', 'min'),\n ('3 to 5 min', '5', 'min'),\n ('2min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('2 min', '2', 'min'),\n ('15-20 seconds', '15-20', 'sec'),\n ('10min', '10', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 minutes', '10', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('2 sec', '2', 'sec'),\n ('approx 5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('3min', '3', 'min'),\n ('2 minutes', '2', 'min'),\n ('30 minutes', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('1 hr', '1', 'hr'),\n ('10 seconds', '10', 'sec'),\n ('1min. 39s', '39', 's'),\n ('30 seconds', '30', 'sec'),\n ('20 minutes', '20', 'min'),\n ('8 seconds', '8', 'sec'),\n ('less than 1 min', '1', 'min'),\n ('1 hour', '1', 'hr'),\n ('2 minutes', '2', 'min'),\n ('5 seconds', '5', 'sec'),\n ('~1 hour', '1', 'hr'),\n ('2 min', '2', 'min'),\n ('1 minute', '1', 'min'),\n ('3sec', '3', 'sec'),\n ('5 min', '5', 'min'),\n ('5 min', '5', 'min'),\n ('1 minute', '1', 'min'),\n ('4 hours', '4', 'hr'),\n ('30 seconds', '30', 'sec'),\n ('<5 minutes', '5', 'min'),\n ('1-hour', '1-', 'hr'),\n ('5 minutes', '5', 'min'),\n ('10 to 15 sec', '15', 'sec'),\n ('30 +/- min', '30', 'min'),\n ('10 minutes', '10', 'min'),\n ('45min', '45', 'min'),\n ('< 1 min', '1', 'min'),\n ('10 minutes', '10', 'min'),\n ('2 seconds', '2', 'sec'),\n ('2 hours', '2', 'hr'),\n ('15 seconds', '15', 'sec'),\n ('1 hour', '1', 'hr'),\n ('5-10 min', '5-10', 'min'),\n ('10 seconds', '10', 'sec'),\n ('1 hour', '1', 'hr'),\n ('45 secs', '45', 'sec'),\n ('60-90 sec', '60-90', 'sec'),\n ('3 hours', '3', 'hr'),\n ('5 min', '5', 'min'),\n ('30 min', '30', 'min'),\n ('4 minutes', '4', 'min'),\n ('45 minutes', '45', 'min'),\n ('3 minutes', '3', 'min'),\n ('10 seconds', '10', 'sec'),\n ('30seconds', '30', 'sec'),\n ('45 seconds', '45', 'sec'),\n ('15 seconds', '15', 'sec'),\n ('30 min', '30', 'min'),\n ('4-5 seconds', '4-5', 'sec')]"
	},
	"metadata": {},
	"execution_count": 10
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Here are the \"rules\" and guiding principles for this assignment:\n\n- The normalized duration does not have to be exactly correct, but it must be at least within the given range. For example:\n - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.\n - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.\n- When a number is not given, you should make a \"reasonable\" substitution for the words. For example:\n - If the duration is ' minutes', you can approximate this as '5 min'.\n - If the duration is 'couple minutes', you can approximate this as '2 min'.\n- You are not allowed to skip any entries. (Your list of tuples should have a length of 100.)\n- Try to use as few substitutions as possible, and make your regular expression as simple as possible.\n- Just because you don't get an error doesn't mean that your code was successful. Instead, you should check each entry by hand to see if it produced an acceptable result."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Bonus tasks:\n\n- Try reading in more than 100 rows, and see if your code still produces the correct results.\n- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), calculate the exact midpoint ('1.5 hr' or '12.5 sec') to use in your normalized data."
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "import numpy as np\n\ndef average(x):\n# print(x)\n try:\n l = [float(y) for y in x.split('-') if y]\n return np.mean(l)\n except ValueError:\n return mathops(x)\n \ndef mathops(x):\n l = [int(y) for y in x.split('/') if y]\n return l[0]/l[1]\n \n\naverage('1-2')\nmathops('1/2')",
	"execution_count": 11,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "0.5"
	},
	"metadata": {},
	"execution_count": 11
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "final_list = [(x, average(y), z) for (x, y, z) in clean_durations]\nassert len(final_list) == ufo.shape[0]\nfinal_list",
	"execution_count": 12,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "[('45 minutes', 45.0, 'min'),\n ('1-2 hrs', 1.5, 'hr'),\n ('20 seconds', 20.0, 'sec'),\n ('1/2 hour', 0.5, 'hr'),\n ('15 minutes', 15.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('about 3 mins', 3.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30 seconds', 30.0, 'sec'),\n ('20minutes', 20.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('20-30 min', 25.0, 'min'),\n ('20 sec', 20.0, 'sec'),\n ('45 minutes', 45.0, 'min'),\n ('20 minutes', 20.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('5-6 minutes', 5.5, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3 seconds', 3.0, 'sec'),\n ('30 seconds', 30.0, 'sec'),\n ('approx: 30 seconds', 30.0, 'sec'),\n ('5min', 5.0, 'min'),\n ('15 minutes', 15.0, 'min'),\n ('4.5 min', 5.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('30min', 30.0, 'min'),\n ('3 min', 3.0, 'min'),\n ('5 minutes', 5.0, 'min'),\n ('3 to 5 min', 5.0, 'min'),\n ('2min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 min', 2.0, 'min'),\n ('15-20 seconds', 17.5, 'sec'),\n ('10min', 10.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('2 sec', 2.0, 'sec'),\n ('approx 5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3min', 3.0, 'min'),\n ('2 minutes', 2.0, 'min'),\n ('30 minutes', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('1 hr', 1.0, 'hr'),\n ('10 seconds', 10.0, 'sec'),\n ('1min. 39s', 39.0, 's'),\n ('30 seconds', 30.0, 'sec'),\n ('20 minutes', 20.0, 'min'),\n ('8 seconds', 8.0, 'sec'),\n ('less than 1 min', 1.0, 'min'),\n ('1 hour', 1.0, 'hr'),\n ('2 minutes', 2.0, 'min'),\n ('5 seconds', 5.0, 'sec'),\n ('~1 hour', 1.0, 'hr'),\n ('2 min', 2.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('3sec', 3.0, 'sec'),\n ('5 min', 5.0, 'min'),\n ('5 min', 5.0, 'min'),\n ('1 minute', 1.0, 'min'),\n ('4 hours', 4.0, 'hr'),\n ('30 seconds', 30.0, 'sec'),\n ('<5 minutes', 5.0, 'min'),\n ('1-hour', 1.0, 'hr'),\n ('5 minutes', 5.0, 'min'),\n ('10 to 15 sec', 15.0, 'sec'),\n ('30 +/- min', 30.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('45min', 45.0, 'min'),\n ('< 1 min', 1.0, 'min'),\n ('10 minutes', 10.0, 'min'),\n ('2 seconds', 2.0, 'sec'),\n ('2 hours', 2.0, 'hr'),\n ('15 seconds', 15.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('5-10 min', 7.5, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('1 hour', 1.0, 'hr'),\n ('45 secs', 45.0, 'sec'),\n ('60-90 sec', 75.0, 'sec'),\n ('3 hours', 3.0, 'hr'),\n ('5 min', 5.0, 'min'),\n ('30 min', 30.0, 'min'),\n ('4 minutes', 4.0, 'min'),\n ('45 minutes', 45.0, 'min'),\n ('3 minutes', 3.0, 'min'),\n ('10 seconds', 10.0, 'sec'),\n ('30seconds', 30.0, 'sec'),\n ('45 seconds', 45.0, 'sec'),\n ('15 seconds', 15.0, 'sec'),\n ('30 min', 30.0, 'min'),\n ('4-5 seconds', 4.5, 'sec')]"
	},
	"metadata": {},
	"execution_count": 12
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "markdown",
	"source": "#testing\n'1-'.split('-')\nint('30 +/'.rstrip('+/'))\nint('1/2')"
	}
	],
	"metadata": {
	"language_info": {
	"file_extension": ".py",
	"codemirror_mode": {
	"version": 3,
	"name": "ipython"
	},
	"mimetype": "text/x-python",
	"version": "3.5.1",
	"nbconvert_exporter": "python",
	"name": "python",
	"pygments_lexer": "ipython3"
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3",
	"language": "python"
	},
	"gist": {
	"id": "feb8a658841e3e8946cfacd79fa286b8",
	"data": {
	"description": "MLtext3/submissions/04_intermediate_regex_homework-Copy1.ipynb",
	"public": true
	}
	},
	"_draft": {
	"nbviewer_url": "https://gist.github.com/feb8a658841e3e8946cfacd79fa286b8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}