Last active
October 3, 2020 20:10
-
-
Save vaclavdekanovsky/4458418101fdeca5df08db468fe887a8 to your computer and use it in GitHub Desktop.
Create groups while upsampling to calculate the average in each group.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Upsampling with average\n", | |
| "groups and indices paramters and how to use them to upsample to average" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import pandas as pd\n", | |
| "import io" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "data=\"\"\"\n", | |
| "CAT|DATE|VALUE\n", | |
| "abc|0101|10\n", | |
| "abc|0103|20\n", | |
| "abc|0106|15\"\"\"\n", | |
| "df = pd.read_csv(io.StringIO(data), skiprows=1, sep=\"|\", dtype={\"DATE\":\"str\"})\n", | |
| "df[\"DATE\"] = pd.to_datetime(df[\"DATE\"], format=\"%m%d\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "expected=\"\"\"\n", | |
| "CAT|DATE|VALUE\n", | |
| "abc|0102|10\n", | |
| "abc|0103|10\n", | |
| "abc|0104|5\n", | |
| "abc|0105|5\n", | |
| "abc|0106|5\"\"\"" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The resample method has two attributes:\n", | |
| " \n", | |
| "* `indices` - showing where the input data appeared\n", | |
| "* `groups` - indexing each datatime value with its group index.\n", | |
| "\n", | |
| "https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "defaultdict(list,\n", | |
| " {Timestamp('1900-01-01 00:00:00', freq='D'): [0],\n", | |
| " Timestamp('1900-01-03 00:00:00', freq='D'): [1],\n", | |
| " Timestamp('1900-01-06 00:00:00', freq='D'): [2]})" | |
| ] | |
| }, | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# the data came on first, third and sixth of Jan\n", | |
| "df.set_index(\"DATE\")\\\n", | |
| ".resample(\"D\").indices" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{Timestamp('1900-01-01 00:00:00', freq='D'): 1,\n", | |
| " Timestamp('1900-01-02 00:00:00', freq='D'): 1,\n", | |
| " Timestamp('1900-01-03 00:00:00', freq='D'): 2,\n", | |
| " Timestamp('1900-01-04 00:00:00', freq='D'): 2,\n", | |
| " Timestamp('1900-01-05 00:00:00', freq='D'): 2,\n", | |
| " Timestamp('1900-01-06 00:00:00', freq='D'): 3}" | |
| ] | |
| }, | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# the groups are \n", | |
| "\"\"\" \n", | |
| " Jan-1 Group 1\n", | |
| " Jan-2 Group 1\n", | |
| " Jan-3 Group 2\n", | |
| " Jan-4 Group 2\n", | |
| " Jan-5 Group 2\n", | |
| " Jan-6 Group 3\n", | |
| "\"\"\"\n", | |
| "df.set_index(\"DATE\")\\\n", | |
| ".resample(\"D\").groups" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In reality our data from Jan-1 cover unknown period of time, the value from Jan-3 covers the third and second and the last value covers fourth, fifth and sixth of Jan. This is achived by `label` parameter of the resample method.\n" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "{Timestamp('1900-01-02 00:00:00', freq='D'): 1,\n", | |
| " Timestamp('1900-01-03 00:00:00', freq='D'): 1,\n", | |
| " Timestamp('1900-01-04 00:00:00', freq='D'): 2,\n", | |
| " Timestamp('1900-01-05 00:00:00', freq='D'): 2,\n", | |
| " Timestamp('1900-01-06 00:00:00', freq='D'): 2,\n", | |
| " Timestamp('1900-01-07 00:00:00', freq='D'): 3}" | |
| ] | |
| }, | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "df.set_index(\"DATE\").resample(\"D\", label=\"right\").groups" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Alternatively, we can turn the dictionary returned by the `.groups` parameter into a dataframe and shift the group ids by 1 (row) so that data match our scenario." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>group</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>1900-01-01</th>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-02</th>\n", | |
| " <td>1.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-03</th>\n", | |
| " <td>1.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-04</th>\n", | |
| " <td>2.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-05</th>\n", | |
| " <td>2.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-06</th>\n", | |
| " <td>2.0</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " group\n", | |
| "1900-01-01 NaN\n", | |
| "1900-01-02 1.0\n", | |
| "1900-01-03 1.0\n", | |
| "1900-01-04 2.0\n", | |
| "1900-01-05 2.0\n", | |
| "1900-01-06 2.0" | |
| ] | |
| }, | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "# Calculate the number of hours in each group\n", | |
| "resampled_groups = df.set_index(\"DATE\").resample(\"D\").groups\n", | |
| "df_groups = pd.DataFrame(resampled_groups, index=[\"group\"]).T.shift(1)\n", | |
| "df_groups" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Knowing which dates belong to each group, let us calculate the number of occurences (days) between each measurement. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "group\n", | |
| "1.0 2\n", | |
| "2.0 3\n", | |
| "Name: count, dtype: int64" | |
| ] | |
| }, | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "s = df_groups.groupby(\"group\").size()\n", | |
| "s.name = \"count\"\n", | |
| "s" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "By joining the counts back to the groups we will see how many days were covered each date so that we can calculate the average - measured value at the end of the period/period lenght. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "1900-01-01 NaN\n", | |
| "1900-01-02 2.0\n", | |
| "1900-01-03 2.0\n", | |
| "1900-01-04 3.0\n", | |
| "1900-01-05 3.0\n", | |
| "1900-01-06 3.0\n", | |
| "Name: count, dtype: float64" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "s_counts = df_groups.join(s, on=\"group\")[\"count\"]\n", | |
| "s_counts" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Mering this back to the original dataframe, with backfilled sum allow us to calculate the average. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>CAT</th>\n", | |
| " <th>VALUE</th>\n", | |
| " <th>count</th>\n", | |
| " <th>average</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>DATE</th>\n", | |
| " <th></th>\n", | |
| " <th></th>\n", | |
| " <th></th>\n", | |
| " <th></th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>1900-01-01</th>\n", | |
| " <td>abc</td>\n", | |
| " <td>10</td>\n", | |
| " <td>NaN</td>\n", | |
| " <td>NaN</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-02</th>\n", | |
| " <td>abc</td>\n", | |
| " <td>20</td>\n", | |
| " <td>2.0</td>\n", | |
| " <td>10.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-03</th>\n", | |
| " <td>abc</td>\n", | |
| " <td>20</td>\n", | |
| " <td>2.0</td>\n", | |
| " <td>10.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-04</th>\n", | |
| " <td>abc</td>\n", | |
| " <td>15</td>\n", | |
| " <td>3.0</td>\n", | |
| " <td>5.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-05</th>\n", | |
| " <td>abc</td>\n", | |
| " <td>15</td>\n", | |
| " <td>3.0</td>\n", | |
| " <td>5.0</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1900-01-06</th>\n", | |
| " <td>abc</td>\n", | |
| " <td>15</td>\n", | |
| " <td>3.0</td>\n", | |
| " <td>5.0</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " CAT VALUE count average\n", | |
| "DATE \n", | |
| "1900-01-01 abc 10 NaN NaN\n", | |
| "1900-01-02 abc 20 2.0 10.0\n", | |
| "1900-01-03 abc 20 2.0 10.0\n", | |
| "1900-01-04 abc 15 3.0 5.0\n", | |
| "1900-01-05 abc 15 3.0 5.0\n", | |
| "1900-01-06 abc 15 3.0 5.0" | |
| ] | |
| }, | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "res = df.set_index(\"DATE\").resample(\"D\").bfill()\n", | |
| "res = res.join(s_counts)\n", | |
| "res[\"average\"] = res[\"VALUE\"]/res[\"count\"]\n", | |
| "res" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.8.3" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 4 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment