Skip to content

Instantly share code, notes, and snippets.

@vaclavdekanovsky
Last active October 3, 2020 20:10
Show Gist options
  • Select an option

  • Save vaclavdekanovsky/4458418101fdeca5df08db468fe887a8 to your computer and use it in GitHub Desktop.

Select an option

Save vaclavdekanovsky/4458418101fdeca5df08db468fe887a8 to your computer and use it in GitHub Desktop.
Create groups while upsampling to calculate the average in each group.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Upsampling with average\n",
"groups and indices paramters and how to use them to upsample to average"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import io"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data=\"\"\"\n",
"CAT|DATE|VALUE\n",
"abc|0101|10\n",
"abc|0103|20\n",
"abc|0106|15\"\"\"\n",
"df = pd.read_csv(io.StringIO(data), skiprows=1, sep=\"|\", dtype={\"DATE\":\"str\"})\n",
"df[\"DATE\"] = pd.to_datetime(df[\"DATE\"], format=\"%m%d\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"expected=\"\"\"\n",
"CAT|DATE|VALUE\n",
"abc|0102|10\n",
"abc|0103|10\n",
"abc|0104|5\n",
"abc|0105|5\n",
"abc|0106|5\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resample method has two attributes:\n",
" \n",
"* `indices` - showing where the input data appeared\n",
"* `groups` - indexing each datatime value with its group index.\n",
"\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(list,\n",
" {Timestamp('1900-01-01 00:00:00', freq='D'): [0],\n",
" Timestamp('1900-01-03 00:00:00', freq='D'): [1],\n",
" Timestamp('1900-01-06 00:00:00', freq='D'): [2]})"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the data came on first, third and sixth of Jan\n",
"df.set_index(\"DATE\")\\\n",
".resample(\"D\").indices"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{Timestamp('1900-01-01 00:00:00', freq='D'): 1,\n",
" Timestamp('1900-01-02 00:00:00', freq='D'): 1,\n",
" Timestamp('1900-01-03 00:00:00', freq='D'): 2,\n",
" Timestamp('1900-01-04 00:00:00', freq='D'): 2,\n",
" Timestamp('1900-01-05 00:00:00', freq='D'): 2,\n",
" Timestamp('1900-01-06 00:00:00', freq='D'): 3}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the groups are \n",
"\"\"\" \n",
" Jan-1 Group 1\n",
" Jan-2 Group 1\n",
" Jan-3 Group 2\n",
" Jan-4 Group 2\n",
" Jan-5 Group 2\n",
" Jan-6 Group 3\n",
"\"\"\"\n",
"df.set_index(\"DATE\")\\\n",
".resample(\"D\").groups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In reality our data from Jan-1 cover unknown period of time, the value from Jan-3 covers the third and second and the last value covers fourth, fifth and sixth of Jan. This is achived by `label` parameter of the resample method.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{Timestamp('1900-01-02 00:00:00', freq='D'): 1,\n",
" Timestamp('1900-01-03 00:00:00', freq='D'): 1,\n",
" Timestamp('1900-01-04 00:00:00', freq='D'): 2,\n",
" Timestamp('1900-01-05 00:00:00', freq='D'): 2,\n",
" Timestamp('1900-01-06 00:00:00', freq='D'): 2,\n",
" Timestamp('1900-01-07 00:00:00', freq='D'): 3}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.set_index(\"DATE\").resample(\"D\", label=\"right\").groups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can turn the dictionary returned by the `.groups` parameter into a dataframe and shift the group ids by 1 (row) so that data match our scenario."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>group</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1900-01-01</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-02</th>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-03</th>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-04</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-05</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-06</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" group\n",
"1900-01-01 NaN\n",
"1900-01-02 1.0\n",
"1900-01-03 1.0\n",
"1900-01-04 2.0\n",
"1900-01-05 2.0\n",
"1900-01-06 2.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Calculate the number of hours in each group\n",
"resampled_groups = df.set_index(\"DATE\").resample(\"D\").groups\n",
"df_groups = pd.DataFrame(resampled_groups, index=[\"group\"]).T.shift(1)\n",
"df_groups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Knowing which dates belong to each group, let us calculate the number of occurences (days) between each measurement. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"group\n",
"1.0 2\n",
"2.0 3\n",
"Name: count, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = df_groups.groupby(\"group\").size()\n",
"s.name = \"count\"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By joining the counts back to the groups we will see how many days were covered each date so that we can calculate the average - measured value at the end of the period/period lenght. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1900-01-01 NaN\n",
"1900-01-02 2.0\n",
"1900-01-03 2.0\n",
"1900-01-04 3.0\n",
"1900-01-05 3.0\n",
"1900-01-06 3.0\n",
"Name: count, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s_counts = df_groups.join(s, on=\"group\")[\"count\"]\n",
"s_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mering this back to the original dataframe, with backfilled sum allow us to calculate the average. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CAT</th>\n",
" <th>VALUE</th>\n",
" <th>count</th>\n",
" <th>average</th>\n",
" </tr>\n",
" <tr>\n",
" <th>DATE</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1900-01-01</th>\n",
" <td>abc</td>\n",
" <td>10</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-02</th>\n",
" <td>abc</td>\n",
" <td>20</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-03</th>\n",
" <td>abc</td>\n",
" <td>20</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-04</th>\n",
" <td>abc</td>\n",
" <td>15</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-05</th>\n",
" <td>abc</td>\n",
" <td>15</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1900-01-06</th>\n",
" <td>abc</td>\n",
" <td>15</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CAT VALUE count average\n",
"DATE \n",
"1900-01-01 abc 10 NaN NaN\n",
"1900-01-02 abc 20 2.0 10.0\n",
"1900-01-03 abc 20 2.0 10.0\n",
"1900-01-04 abc 15 3.0 5.0\n",
"1900-01-05 abc 15 3.0 5.0\n",
"1900-01-06 abc 15 3.0 5.0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = df.set_index(\"DATE\").resample(\"D\").bfill()\n",
"res = res.join(s_counts)\n",
"res[\"average\"] = res[\"VALUE\"]/res[\"count\"]\n",
"res"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment