Last active
April 11, 2023 23:40
-
-
Save dshemetov/43c6c988e3c9237f15930fc6190b6d77 to your computer and use it in GitHub Desktop.
A Few General Notes on Python Internals and Making It Fast
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# A Few General Notes on Python Internals and Fast Code\n", | |
"\n", | |
"Python is very efficient for developer time. But when an application needs to be performant, many tricks are required to make Python go fast.\n", | |
"\n", | |
"This notebook contains some common tricks and some insights about Python under the hood, that may help you make better guesses about what is fast and what is slow.\n", | |
" \n", | |
"> When I teach courses on Python for scientific computing, I make this point [very early](http://nbviewer.ipython.org/github/jakevdp/2013_fall_ASTR599/blob/master/notebooks/11_EfficientNumpy.ipynb) in the course, and tell the students why: it boils down to Python being a dynamically typed, interpreted language, where values are stored not in dense buffers but in scattered objects. And then I talk about how to get around this by using NumPy, SciPy, and related tools for vectorization of operations and calling into compiled code, and go on from there.\n", | |
" \n", | |
"See: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/\n", | |
"\n", | |
"We will touch on:\n", | |
"- Numpy\n", | |
"- Pandas\n", | |
"- Numba\n", | |
"- Cython\n", | |
"- Polars\n", | |
"\n", | |
"Also, to get a general frame of reference on computer speed, see (co-authored by Julia Evans): https://computers-are-fast.github.io/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# !pip install numpy cython numba pandas polars PIL ipykernel" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## How Much Faster Can C Be Than Python?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"\n", | |
"def sum_list(l: list[int]) -> int:\n", | |
" total = 0\n", | |
" for i in l:\n", | |
" total += i\n", | |
" return total\n", | |
"\n", | |
"lp = list(range(1_000_000))\n", | |
"ln = np.array(lp)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%load_ext Cython" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%cython --annotate\n", | |
"\n", | |
"cimport cython\n", | |
"\n", | |
"cpdef long sum_cython(long[:] l, int N):\n", | |
" cdef long total = 0\n", | |
" cdef long i = 0\n", | |
" for i in range(N):\n", | |
" total += l[i]\n", | |
" return total" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"39.8 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit \n", | |
"sum_list(lp)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"580 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"sum_cython(ln, len(ln))" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Numpy Is C Under the Hood\n", | |
"\n", | |
"For very regular, structured data like n-dimensional arrays, Numpy helps us get away from the bulkiness of Python objects and interfaces directly with C arrays under the hood.\n", | |
"\n", | |
"Plenty of built-ins: https://numpy.org/doc/stable/reference/routines.array-manipulation.html\n", | |
"- Reshaping, transpose, changing dimensions,\n", | |
"- Concatenating, splitting, appending, inserting, deleting\n", | |
"- Sorting, searching, counting\n", | |
"- Etc.\n", | |
"\n", | |
"Numpy data type model is: https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases\n", | |
"- integers - int8, int16, int32, int64\n", | |
"- unsigned integers - uint8, uint16, uint32, uint64\n", | |
"- floating point - float16, float32, float64, float128\n", | |
"- complex floating point - complex64, complex128, complex256\n", | |
"- boolean - bool\n", | |
"- string - str\n", | |
"- object - object\n", | |
"- datetime - datetime64, timedelta64\n", | |
"- more - void, bytes, unicode\n", | |
"\n", | |
"\n", | |
"\n", | |
"There are also structured dtypes, which are like a table of data. See: https://numpy.org/doc/stable/user/basics.rec.html. I don't recommend using these for tabular computations (see below)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"340 µs ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"np.sum(ln)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Avoid Structured Arrays for Tabular Computations\n", | |
"\n", | |
"> Users looking to manipulate tabular data, such as stored in csv files, may find other pydata projects more suitable, such as xarray, pandas, or DataArray. These provide a high-level interface for tabular data analysis and are better optimized for that use. For instance, the C-struct-like memory layout of structured arrays in numpy can lead to poor cache behavior in comparison.\n", | |
"> See: https://numpy.org/doc/stable/user/basics.rec.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>source</th>\n", | |
" <th>signal</th>\n", | |
" <th>geo_value</th>\n", | |
" <th>time_value</th>\n", | |
" <th>geo_type</th>\n", | |
" <th>time_type</th>\n", | |
" <th>direction</th>\n", | |
" <th>issue</th>\n", | |
" <th>lag</th>\n", | |
" <th>missing_value</th>\n", | |
" <th>missing_stderr</th>\n", | |
" <th>missing_sample_size</th>\n", | |
" <th>value</th>\n", | |
" <th>stderr</th>\n", | |
" <th>sample_size</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200122</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>113</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200123</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>112</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200124</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200125</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>110</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200126</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>109</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69072</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221113</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221114</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8875.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69073</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221114</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221115</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8875.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69074</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221115</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221116</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8875.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69075</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221116</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221117</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8877.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69076</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221117</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221118</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8877.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>69077 rows × 15 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" source signal geo_value time_value geo_type \n", | |
"0 jhu-csse confirmed_cumulative_num 01001 20200122 county \\\n", | |
"1 jhu-csse confirmed_cumulative_num 01001 20200123 county \n", | |
"2 jhu-csse confirmed_cumulative_num 01001 20200124 county \n", | |
"3 jhu-csse confirmed_cumulative_num 01001 20200125 county \n", | |
"4 jhu-csse confirmed_cumulative_num 01001 20200126 county \n", | |
"... ... ... ... ... ... \n", | |
"69072 jhu-csse confirmed_cumulative_num 01133 20221113 county \n", | |
"69073 jhu-csse confirmed_cumulative_num 01133 20221114 county \n", | |
"69074 jhu-csse confirmed_cumulative_num 01133 20221115 county \n", | |
"69075 jhu-csse confirmed_cumulative_num 01133 20221116 county \n", | |
"69076 jhu-csse confirmed_cumulative_num 01133 20221117 county \n", | |
"\n", | |
" time_type direction issue lag missing_value missing_stderr \n", | |
"0 day NaN 20200514 113 0 5 \\\n", | |
"1 day NaN 20200514 112 0 5 \n", | |
"2 day NaN 20200514 111 0 5 \n", | |
"3 day NaN 20200514 110 0 5 \n", | |
"4 day NaN 20200514 109 0 5 \n", | |
"... ... ... ... ... ... ... \n", | |
"69072 day NaN 20221114 1 0 5 \n", | |
"69073 day NaN 20221115 1 0 5 \n", | |
"69074 day NaN 20221116 1 0 5 \n", | |
"69075 day NaN 20221117 1 0 5 \n", | |
"69076 day NaN 20221118 1 0 5 \n", | |
"\n", | |
" missing_sample_size value stderr sample_size \n", | |
"0 5 0.0 NaN NaN \n", | |
"1 5 0.0 NaN NaN \n", | |
"2 5 0.0 NaN NaN \n", | |
"3 5 0.0 NaN NaN \n", | |
"4 5 0.0 NaN NaN \n", | |
"... ... ... ... ... \n", | |
"69072 5 8875.0 NaN NaN \n", | |
"69073 5 8875.0 NaN NaN \n", | |
"69074 5 8875.0 NaN NaN \n", | |
"69075 5 8877.0 NaN NaN \n", | |
"69076 5 8877.0 NaN NaN \n", | |
"\n", | |
"[69077 rows x 15 columns]" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"from typing import Iterable, Dict\n", | |
"\n", | |
"df = pd.read_csv(\"/home/dskel/Documents/Code/Delphi/delphi-dev/confirmed_cumulative_num_01_counties.csv\")\n", | |
"df = df.set_index([\"source\", \"signal\", \"geo_value\", \"time_value\"]).sort_index().reset_index()\n", | |
"df[\"time_value\"] = df[\"time_value\"].astype(int) \n", | |
"df[\"issue\"] = df[\"issue\"].astype(int) \n", | |
"df[\"source\"] = df[\"source\"].astype(\"category\")\n", | |
"df[\"signal\"] = df[\"signal\"].astype(\"category\")\n", | |
"df[\"geo_type\"] = df[\"geo_type\"].astype(\"category\")\n", | |
"df[\"time_type\"] = df[\"time_type\"].astype(\"category\")\n", | |
"df[\"geo_value\"] = df[\"geo_value\"].astype(str).str.zfill(5)\n", | |
"row_dicts = df.to_dict(orient=\"records\")\n", | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'source': 'jhu-csse',\n", | |
" 'signal': 'confirmed_cumulative_num',\n", | |
" 'geo_value': '01001',\n", | |
" 'time_value': 20200122,\n", | |
" 'geo_type': 'county',\n", | |
" 'time_type': 'day',\n", | |
" 'direction': nan,\n", | |
" 'issue': 20200514,\n", | |
" 'lag': 113,\n", | |
" 'missing_value': 0,\n", | |
" 'missing_stderr': 5,\n", | |
" 'missing_sample_size': 5,\n", | |
" 'value': 0.0,\n", | |
" 'stderr': nan,\n", | |
" 'sample_size': nan}" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"row_dicts[0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([(b'', b'', b'', b'', b'', 20200122, 20200514, 113, 0, 5, 5, 0., nan, nan),\n", | |
" (b'', b'', b'', b'', b'', 20200123, 20200514, 112, 0, 5, 5, 0., nan, nan),\n", | |
" (b'', b'', b'', b'', b'', 20200124, 20200514, 111, 0, 5, 5, 0., nan, nan),\n", | |
" ...,\n", | |
" (b'', b'', b'', b'', b'', 20221115, 20221116, 1, 0, 5, 5, 8875., nan, nan),\n", | |
" (b'', b'', b'', b'', b'', 20221116, 20221117, 1, 0, 5, 5, 8877., nan, nan),\n", | |
" (b'', b'', b'', b'', b'', 20221117, 20221118, 1, 0, 5, 5, 8877., nan, nan)],\n", | |
" dtype=[('geo_value', 'S'), ('signal', 'S'), ('source', 'S'), ('geo_type', 'S'), ('time_type', 'S'), ('time_value', '<i4'), ('issue', '<i4'), ('lag', '<i4'), ('missing_value', 'i1'), ('missing_stderr', 'i1'), ('missing_sample_size', 'i1'), ('value', '<f8'), ('stderr', '<f8'), ('sample_size', '<f8')])" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ndarray_dtypes = np.dtype([\n", | |
" (\"geo_value\", np.string_),\n", | |
" (\"signal\", np.string_),\n", | |
" (\"source\", np.string_),\n", | |
" (\"geo_type\", np.string_),\n", | |
" (\"time_type\", np.string_),\n", | |
" (\"time_value\", \"i4\"),\n", | |
" (\"issue\", \"i4\"),\n", | |
" (\"lag\", \"i4\"),\n", | |
" (\"missing_value\", \"i1\"),\n", | |
" (\"missing_stderr\", \"i1\"),\n", | |
" (\"missing_sample_size\", \"i1\"),\n", | |
" (\"value\", float),\n", | |
" (\"stderr\", float),\n", | |
" (\"sample_size\", float),\n", | |
"])\n", | |
"\n", | |
"def dicts_to_structured_array(rows: Iterable[Dict]) -> np.ndarray:\n", | |
" \"\"\"Numpy structured arrays are slow.\"\"\"\n", | |
" row_order = [\"geo_value\", \"signal\", \"source\", \"geo_type\", \"time_type\", \"time_value\", \"direction\", \"issue\", \"lag\", \"missing_value\", \"missing_stderr\", \"missing_sample_size\", \"value\", \"stderr\", \"sample_size\"]\n", | |
" return np.array([tuple(row[k] for k in row_order if k != \"direction\") for row in rows], dtype=ndarray_dtypes)\n", | |
"\n", | |
"structured_array = dicts_to_structured_array(row_dicts)\n", | |
"structured_array" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'geo_value': array([b'0', b'0', b'0', ..., b'0', b'0', b'0'], dtype='|S1'),\n", | |
" 'signal': array([b'c', b'c', b'c', ..., b'c', b'c', b'c'], dtype='|S1'),\n", | |
" 'source': array([b'j', b'j', b'j', ..., b'j', b'j', b'j'], dtype='|S1'),\n", | |
" 'geo_type': array([b'c', b'c', b'c', ..., b'c', b'c', b'c'], dtype='|S1'),\n", | |
" 'time_type': array([b'd', b'd', b'd', ..., b'd', b'd', b'd'], dtype='|S1'),\n", | |
" 'time_value': array([20200122, 20200123, 20200124, ..., 20221115, 20221116, 20221117],\n", | |
" dtype=int32),\n", | |
" 'issue': array([20200514, 20200514, 20200514, ..., 20221116, 20221117, 20221118],\n", | |
" dtype=int32),\n", | |
" 'lag': array([113, 112, 111, ..., 1, 1, 1], dtype=int32),\n", | |
" 'missing_value': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),\n", | |
" 'missing_stderr': array([5, 5, 5, ..., 5, 5, 5], dtype=int8),\n", | |
" 'missing_sample_size': array([5, 5, 5, ..., 5, 5, 5], dtype=int8),\n", | |
" 'value': array([ 0., 0., 0., ..., 8875., 8877., 8877.]),\n", | |
" 'stderr': array([nan, nan, nan, ..., nan, nan, nan]),\n", | |
" 'sample_size': array([nan, nan, nan, ..., nan, nan, nan])}" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ndarray_dtypes = np.dtype([\n", | |
" (\"geo_value\", np.string_),\n", | |
" (\"signal\", np.string_),\n", | |
" (\"source\", np.string_),\n", | |
" (\"geo_type\", np.string_),\n", | |
" (\"time_type\", np.string_),\n", | |
" (\"time_value\", \"i4\"),\n", | |
" (\"issue\", \"i4\"),\n", | |
" (\"lag\", \"i4\"),\n", | |
" (\"missing_value\", \"i1\"),\n", | |
" (\"missing_stderr\", \"i1\"),\n", | |
" (\"missing_sample_size\", \"i1\"),\n", | |
" (\"value\", float),\n", | |
" (\"stderr\", float),\n", | |
" (\"sample_size\", float),\n", | |
"])\n", | |
"def dicts_to_arrays(rows: Iterable[Dict]) -> tuple[np.ndarray, dict]:\n", | |
" \"\"\"Convert dictionaries to a dictionary of Numpy arrays.\n", | |
"\n", | |
" This is to get away from using structured arrays, which are slow for tabular computations.\n", | |
" \"\"\"\n", | |
" rows = list(rows)\n", | |
" arrays = {\n", | |
" k: np.empty(len(rows), dtype=ndarray_dtypes[k])\n", | |
" for k in ndarray_dtypes.names\n", | |
" }\n", | |
" for i, row in enumerate(rows):\n", | |
" for k in arrays:\n", | |
" arrays[k][i] = row[k]\n", | |
" return arrays\n", | |
"\n", | |
"dict_arrays = dicts_to_arrays(row_dicts)\n", | |
"dict_arrays" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"145 µs ± 3.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"structured_array[\"time_value\"].max()\n", | |
"structured_array[\"lag\"].sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"40.9 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"dict_arrays[\"time_value\"].max()\n", | |
"dict_arrays[\"lag\"].sum()" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Pandas\n", | |
"\n", | |
"Built on top of Numpy. Let's look at how the data is structured under the hood.\n", | |
"\n", | |
"The data is stored in Blocks, which are 2D Numpy arrays of columns with the same dtype.\n", | |
"\n", | |
"Two implications:\n", | |
"1. Appending rows is slow because it requires copying the entire array for every block.\n", | |
"2. Appending columns is somewhat faster, since by default new Blocks are created separate from the existing data, but occasionally a Block consolidation is triggered, which is slow. See here for more: https://uwekorn.com/2020/05/24/the-one-pandas-internal.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>source</th>\n", | |
" <th>signal</th>\n", | |
" <th>geo_value</th>\n", | |
" <th>time_value</th>\n", | |
" <th>geo_type</th>\n", | |
" <th>time_type</th>\n", | |
" <th>direction</th>\n", | |
" <th>issue</th>\n", | |
" <th>lag</th>\n", | |
" <th>missing_value</th>\n", | |
" <th>missing_stderr</th>\n", | |
" <th>missing_sample_size</th>\n", | |
" <th>value</th>\n", | |
" <th>stderr</th>\n", | |
" <th>sample_size</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200122</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>113</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200123</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>112</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200124</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200125</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>110</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01001</td>\n", | |
" <td>20200126</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20200514</td>\n", | |
" <td>109</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>0.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69072</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221113</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221114</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8875.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69073</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221114</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221115</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8875.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69074</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221115</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221116</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8875.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69075</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221116</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221117</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8877.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>69076</th>\n", | |
" <td>jhu-csse</td>\n", | |
" <td>confirmed_cumulative_num</td>\n", | |
" <td>01133</td>\n", | |
" <td>20221117</td>\n", | |
" <td>county</td>\n", | |
" <td>day</td>\n", | |
" <td>NaN</td>\n", | |
" <td>20221118</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>8877.0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>69077 rows × 15 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" source signal geo_value time_value geo_type \n", | |
"0 jhu-csse confirmed_cumulative_num 01001 20200122 county \\\n", | |
"1 jhu-csse confirmed_cumulative_num 01001 20200123 county \n", | |
"2 jhu-csse confirmed_cumulative_num 01001 20200124 county \n", | |
"3 jhu-csse confirmed_cumulative_num 01001 20200125 county \n", | |
"4 jhu-csse confirmed_cumulative_num 01001 20200126 county \n", | |
"... ... ... ... ... ... \n", | |
"69072 jhu-csse confirmed_cumulative_num 01133 20221113 county \n", | |
"69073 jhu-csse confirmed_cumulative_num 01133 20221114 county \n", | |
"69074 jhu-csse confirmed_cumulative_num 01133 20221115 county \n", | |
"69075 jhu-csse confirmed_cumulative_num 01133 20221116 county \n", | |
"69076 jhu-csse confirmed_cumulative_num 01133 20221117 county \n", | |
"\n", | |
" time_type direction issue lag missing_value missing_stderr \n", | |
"0 day NaN 20200514 113 0 5 \\\n", | |
"1 day NaN 20200514 112 0 5 \n", | |
"2 day NaN 20200514 111 0 5 \n", | |
"3 day NaN 20200514 110 0 5 \n", | |
"4 day NaN 20200514 109 0 5 \n", | |
"... ... ... ... ... ... ... \n", | |
"69072 day NaN 20221114 1 0 5 \n", | |
"69073 day NaN 20221115 1 0 5 \n", | |
"69074 day NaN 20221116 1 0 5 \n", | |
"69075 day NaN 20221117 1 0 5 \n", | |
"69076 day NaN 20221118 1 0 5 \n", | |
"\n", | |
" missing_sample_size value stderr sample_size \n", | |
"0 5 0.0 NaN NaN \n", | |
"1 5 0.0 NaN NaN \n", | |
"2 5 0.0 NaN NaN \n", | |
"3 5 0.0 NaN NaN \n", | |
"4 5 0.0 NaN NaN \n", | |
"... ... ... ... ... \n", | |
"69072 5 8875.0 NaN NaN \n", | |
"69073 5 8875.0 NaN NaN \n", | |
"69074 5 8875.0 NaN NaN \n", | |
"69075 5 8877.0 NaN NaN \n", | |
"69076 5 8877.0 NaN NaN \n", | |
"\n", | |
"[69077 rows x 15 columns]" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df[\"lag\"] = df[\"lag\"].astype(\"i4\")\n", | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"BlockManager\n", | |
"Items: Index(['source', 'signal', 'geo_value', 'time_value', 'geo_type', 'time_type',\n", | |
" 'direction', 'issue', 'lag', 'missing_value', 'missing_stderr',\n", | |
" 'missing_sample_size', 'value', 'stderr', 'sample_size'],\n", | |
" dtype='object')\n", | |
"Axis 1: RangeIndex(start=0, stop=69077, step=1)\n", | |
"NumericBlock: [ 6 12 13 14], 4 x 69077, dtype: float64\n", | |
"NumericBlock: slice(9, 12, 1), 3 x 69077, dtype: int64\n", | |
"ExtensionBlock: slice(5, 6, 1), 1 x 69077, dtype: category\n", | |
"NumericBlock: slice(3, 4, 1), 1 x 69077, dtype: int64\n", | |
"ObjectBlock: slice(2, 3, 1), 1 x 69077, dtype: object\n", | |
"ExtensionBlock: slice(1, 2, 1), 1 x 69077, dtype: category\n", | |
"ExtensionBlock: slice(0, 1, 1), 1 x 69077, dtype: category\n", | |
"NumericBlock: slice(7, 8, 1), 1 x 69077, dtype: int64\n", | |
"ExtensionBlock: slice(4, 5, 1), 1 x 69077, dtype: category\n", | |
"NumericBlock: slice(8, 9, 1), 1 x 69077, dtype: int32" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# This hows us the Numpy arrays that back the DataFrame\n", | |
"df._data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([20200122, 20200123, 20200124, ..., 20221115, 20221116, 20221117])" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df[\"time_value\"].values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([20200122, 20200123, 20200124, ..., 20221115, 20221116, 20221117])" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df[\"time_value\"].values.base" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ nan, nan, nan, ..., nan, nan, nan],\n", | |
" [ 0., 0., 0., ..., 8875., 8877., 8877.],\n", | |
" [ nan, nan, nan, ..., nan, nan, nan],\n", | |
" [ nan, nan, nan, ..., nan, nan, nan]])" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# We can see that values shares an array with the other float64 columns: stderr, sample_size, direction\n", | |
"df[\"value\"].values.base" | |
] | |
}, | |
{ | |
"attachments": { | |
"image.png": { | |
"image/png": "" | |
} | |
}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Pandas Apply, Map, Transform Are A Performance Trap\n", | |
"\n", | |
"These functions are there so you could pipe a custom method into a chain of Pandas operations.\n", | |
"\n", | |
"Under the hood though, they call a Python for loop over the rows of the dataframe, which is slow.\n", | |
"\n", | |
"\n", | |
"\n", | |
"See more here: https://pythonspeed.com/articles/pandas-vectorization/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 162, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert (df.groupby(\"geo_value\").apply(lambda x: x[\"value\"].sum()) == df.groupby(\"geo_value\")[\"value\"].sum()).all()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 160, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"16.7 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit \n", | |
"df.groupby(\"geo_value\").apply(lambda x: x[\"value\"].sum())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 161, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"3.38 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"df.groupby(\"geo_value\")[\"value\"].sum()" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Numba Can Speed Up Python Code With Low Effort\n", | |
"\n", | |
"Numba is a JIT compiler for Python. It can speed up Python code by compiling it to machine code. See: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#jit-functions\n", | |
"\n", | |
"\n", | |
"What it's useful for:\n", | |
"- numerical code\n", | |
"- code that uses NumPy arrays\n", | |
"- code with fixed-size data structures\n", | |
"- limited/buggy support for datetimes and timedeltas, see: https://github.com/numba/numba/issues/5780\n", | |
"\n", | |
"What it's not useful for:\n", | |
"- code that uses Python objects, like lists and dictionaries\n", | |
"- code with variable-size data structures\n", | |
"- code that uses Python's built-in types, like strings and tuples\n", | |
"- code that requires C extensions\n", | |
"\n", | |
"See more here: \n", | |
"- speeding up an Ising model using Numba https://matthewrocklin.com/blog/work/2015/02/28/Ising\n", | |
"- speeding up an Advent of Code problem using Numba https://github.com/dshemetov/advent-of-code-solutions/blob/main/src/advent2022/p11.py" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numba as nb\n", | |
"\n", | |
"# The first argument is the type annotation and it is optional\n", | |
"# see here for the full list of supported types: https://numba.readthedocs.io/en/stable/reference/types.html#signatures \n", | |
"@nb.jit(\"int64(int64[:])\", cache=True, nopython=True)\n", | |
"def sum_numba(l: np.ndarray) -> int:\n", | |
" total = 0\n", | |
" for i in l:\n", | |
" total += i\n", | |
" return total" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"566 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"sum_numba(ln)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"@nb.njit()\n", | |
"def diff(x: np.ndarray) -> np.ndarray:\n", | |
" return np.diff(x)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"diff (array(int64, 1d, C),)\n", | |
"--------------------------------------------------------------------------------\n", | |
"# File: /tmp/ipykernel_138759/4104025119.py\n", | |
"# --- LINE 1 --- \n", | |
"\n", | |
"@nb.njit()\n", | |
"\n", | |
"# --- LINE 2 --- \n", | |
"\n", | |
"def diff(x: np.ndarray) -> np.ndarray:\n", | |
"\n", | |
" # --- LINE 3 --- \n", | |
" # label 0\n", | |
" # x = arg(0, name=x) :: array(int64, 1d, C)\n", | |
" # $2load_global.0 = global(np: <module 'numpy' from '/home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numpy/__init__.py'>) :: Module(<module 'numpy' from '/home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numpy/__init__.py'>)\n", | |
" # $4load_method.1 = getattr(value=$2load_global.0, attr=diff) :: Function(<function diff at 0x7fc09027b880>)\n", | |
" # del $2load_global.0\n", | |
" # $8call_method.3 = call $4load_method.1(x, func=$4load_method.1, args=[Var(x, 4104025119.py:3)], kws=(), vararg=None, varkwarg=None, target=None) :: (array(int64, 1d, C), omitted(default=1)) -> array(int64, 1d, C)\n", | |
" # del x\n", | |
" # del $4load_method.1\n", | |
" # $10return_value.4 = cast(value=$8call_method.3) :: array(int64, 1d, C)\n", | |
" # del $8call_method.3\n", | |
" # return $10return_value.4\n", | |
"\n", | |
" return np.diff(x)\n", | |
"\n", | |
"\n", | |
"================================================================================\n" | |
] | |
} | |
], | |
"source": [ | |
"diff.inspect_types()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 135, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"ename": "TypingError", | |
"evalue": "Failed in nopython mode pipeline (step: nopython frontend)\nNo implementation of function Function(datetime64[]) found for signature:\n \n >>> <unknown function>(Literal[str](2023-03-02))\n \nThere are 2 candidate implementations:\n - Of which 1 did not match due to:\n Overload in function 'make_callable_template.<locals>.generic': File: numba/core/typing/templates.py: Line 174.\n With argument(s): '(unicode_type)':\n Rejected as the implementation raised a specific error:\n TypingError: Casting unicode_type to datetime64[] directly is unsupported.\n raised from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numba/core/typing/builtins.py:833\n - Of which 1 did not match due to:\n Overload in function 'make_callable_template.<locals>.generic': File: numba/core/typing/templates.py: Line 174.\n With argument(s): '(Literal[str](2023-03-02))':\n Rejected as the implementation raised a specific error:\n TypingError: Casting Literal[str](2023-03-02) to datetime64[] directly is unsupported.\n raised from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numba/core/typing/builtins.py:833\n\nDuring: resolving callee type: class(datetime64[])\nDuring: typing of call at /tmp/ipykernel_133357/1513295171.py (4)\n\n\nFile \"../../../../../../tmp/ipykernel_133357/1513295171.py\", line 4:\n<source missing, REPL/exec in use?>\n", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[0;31mTypingError\u001b[0m Traceback (most recent call last)", | |
"Cell \u001b[0;32mIn[135], line 6\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[39m@nb\u001b[39m\u001b[39m.\u001b[39mjit(nopython\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m)\n\u001b[1;32m 3\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39madd_dates\u001b[39m():\n\u001b[1;32m 4\u001b[0m \u001b[39mreturn\u001b[39;00m np\u001b[39m.\u001b[39mdatetime64(\u001b[39m\"\u001b[39m\u001b[39m2023-03-02\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m+\u001b[39m np\u001b[39m.\u001b[39mtimedelta64(\u001b[39m1\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mD\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m----> 6\u001b[0m add_dates()\n", | |
"File \u001b[0;32m~/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numba/core/dispatcher.py:468\u001b[0m, in \u001b[0;36m_DispatcherBase._compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m 464\u001b[0m msg \u001b[39m=\u001b[39m (\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00m\u001b[39mstr\u001b[39m(e)\u001b[39m.\u001b[39mrstrip()\u001b[39m}\u001b[39;00m\u001b[39m \u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m\\n\u001b[39;00m\u001b[39mThis error may have been caused \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 465\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mby the following argument(s):\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{\u001b[39;00margs_str\u001b[39m}\u001b[39;00m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 466\u001b[0m e\u001b[39m.\u001b[39mpatch_message(msg)\n\u001b[0;32m--> 468\u001b[0m error_rewrite(e, \u001b[39m'\u001b[39;49m\u001b[39mtyping\u001b[39;49m\u001b[39m'\u001b[39;49m)\n\u001b[1;32m 469\u001b[0m \u001b[39mexcept\u001b[39;00m errors\u001b[39m.\u001b[39mUnsupportedError \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 470\u001b[0m \u001b[39m# Something unsupported is present in the user code, add help info\u001b[39;00m\n\u001b[1;32m 471\u001b[0m error_rewrite(e, \u001b[39m'\u001b[39m\u001b[39munsupported_error\u001b[39m\u001b[39m'\u001b[39m)\n", | |
"File \u001b[0;32m~/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numba/core/dispatcher.py:409\u001b[0m, in \u001b[0;36m_DispatcherBase._compile_for_args.<locals>.error_rewrite\u001b[0;34m(e, issue_type)\u001b[0m\n\u001b[1;32m 407\u001b[0m \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m 408\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m--> 409\u001b[0m \u001b[39mraise\u001b[39;00m e\u001b[39m.\u001b[39mwith_traceback(\u001b[39mNone\u001b[39;00m)\n", | |
"\u001b[0;31mTypingError\u001b[0m: Failed in nopython mode pipeline (step: nopython frontend)\nNo implementation of function Function(datetime64[]) found for signature:\n \n >>> <unknown function>(Literal[str](2023-03-02))\n \nThere are 2 candidate implementations:\n - Of which 1 did not match due to:\n Overload in function 'make_callable_template.<locals>.generic': File: numba/core/typing/templates.py: Line 174.\n With argument(s): '(unicode_type)':\n Rejected as the implementation raised a specific error:\n TypingError: Casting unicode_type to datetime64[] directly is unsupported.\n raised from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numba/core/typing/builtins.py:833\n - Of which 1 did not match due to:\n Overload in function 'make_callable_template.<locals>.generic': File: numba/core/typing/templates.py: Line 174.\n With argument(s): '(Literal[str](2023-03-02))':\n Rejected as the implementation raised a specific error:\n TypingError: Casting Literal[str](2023-03-02) to datetime64[] directly is unsupported.\n raised from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numba/core/typing/builtins.py:833\n\nDuring: resolving callee type: class(datetime64[])\nDuring: typing of call at /tmp/ipykernel_133357/1513295171.py (4)\n\n\nFile \"../../../../../../tmp/ipykernel_133357/1513295171.py\", line 4:\n<source missing, REPL/exec in use?>\n" | |
] | |
} | |
], | |
"source": [ | |
"# This is a bug: https://github.com/numba/numba/issues/5780\n", | |
"@nb.jit(nopython=True)\n", | |
"def add_dates():\n", | |
" return np.datetime64(\"2023-03-02\") + np.timedelta64(1, \"D\")\n", | |
"\n", | |
"add_dates()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 134, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"numpy.datetime64('2023-03-03')" | |
] | |
}, | |
"execution_count": 134, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# This is a workaround\n", | |
"start_date = np.datetime64(\"2023-03-02\")\n", | |
"day_delta = np.timedelta64(1, \"D\")\n", | |
"\n", | |
"@nb.jit(nopython=True)\n", | |
"def add_dates():\n", | |
" return start_date + day_delta\n", | |
"\n", | |
"add_dates()" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Cython Can Speed Up Python Code With Moderate Effort\n", | |
"\n", | |
"Cython is a superset of Python that compiles to C and can then be imported into your Python code.\n", | |
"\n", | |
"In this notebook, we've been using the convenient %%cython magic command, which compiles the code in the cell and imports it into the notebook.\n", | |
"But, there are many compilation workflows for it, see: https://cython.readthedocs.io/en/stable/src/userguide/source_files_and_compilation.html\n", | |
"\n", | |
"It is useful for:\n", | |
"- similar to Numba, but more flexible (can import C and C++ libs, can use C++ classes, etc.)\n", | |
"\n", | |
"It is not useful for:\n", | |
"- quick speedups, requires a lot more investment\n", | |
"- can quickly turn into a full-on C/C++ coding project\n", | |
"\n", | |
"See:\n", | |
"- Speeding up an Ising model with Cython https://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 101, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"In file included from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948,\n", | |
" from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,\n", | |
" from /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5,\n", | |
" from /home/dskel/.cache/ipython/cython/_cython_magic_9de3c4c32d43323bd19c6fd47083842c.c:769:\n", | |
"/home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning \"Using deprecated NumPy API, disable it with \" \"#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION\" [-Wcpp]\n", | |
" 17 | #warning \"Using deprecated NumPy API, disable it with \" \\\n", | |
" | ^~~~~~~\n" | |
] | |
} | |
], | |
"source": [ | |
"%%cython\n", | |
"\n", | |
"cimport cython\n", | |
"\n", | |
"import numpy as np\n", | |
"cimport numpy as np\n", | |
"\n", | |
"from libc.math cimport exp\n", | |
"from libc.stdlib cimport rand\n", | |
"cdef extern from \"limits.h\":\n", | |
" int RAND_MAX\n", | |
"\n", | |
"\n", | |
"@cython.boundscheck(False)\n", | |
"@cython.wraparound(False)\n", | |
"def cy_ising_step(np.int64_t[:, :] field, float beta=0.4):\n", | |
" cdef int N = field.shape[0]\n", | |
" cdef int M = field.shape[1]\n", | |
" cdef int n_offset, m_offset, n, m\n", | |
" for n_offset in range(2):\n", | |
" for m_offset in range(2):\n", | |
" for n in range(n_offset, N, 2):\n", | |
" for m in range(m_offset, M, 2):\n", | |
" _cy_ising_update(field, n, m, beta)\n", | |
" return np.array(field)\n", | |
"\n", | |
"\n", | |
"@cython.boundscheck(False)\n", | |
"@cython.wraparound(False)\n", | |
"cdef _cy_ising_update(np.int64_t[:, :] field, int n, int m, float beta):\n", | |
" cdef int total = 0\n", | |
" cdef int N = field.shape[0]\n", | |
" cdef int M = field.shape[1]\n", | |
" cdef int i, j\n", | |
" for i in range(n-1, n+2):\n", | |
" for j in range(m-1, m+2):\n", | |
" if i == n and j == m:\n", | |
" continue\n", | |
" total += field[i % N, j % M]\n", | |
" cdef float dE = 2 * field[n, m] * total\n", | |
" if dE <= 0:\n", | |
" field[n, m] *= -1\n", | |
" elif exp(-dE * beta) * RAND_MAX > rand():\n", | |
" field[n, m] *= -1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 103, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"application/vnd.jupyter.widget-view+json": { | |
"model_id": "fc351e9613db4bdebbf2feb1f6b173cd", | |
"version_major": 2, | |
"version_minor": 0 | |
}, | |
"text/plain": [ | |
"interactive(children=(IntSlider(value=25, description='frame', max=50), Output()), _dom_classes=('widget-inter…" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"<function __main__.display_ising_sequence.<locals>._show(frame=(0, 50))>" | |
] | |
}, | |
"execution_count": 103, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from ipywidgets import interact\n", | |
"from PIL import Image\n", | |
"\n", | |
"def random_spin_field(N, M):\n", | |
" return np.random.choice([-1, 1], size=(N, M))\n", | |
"\n", | |
"def display_spin_field(field):\n", | |
" return Image.fromarray(np.uint8((field + 1) * 0.5 * 255)) # 0 ... 255\n", | |
"\n", | |
"def display_ising_sequence(images):\n", | |
" def _show(frame=(0, len(images) - 1)):\n", | |
" return display_spin_field(images[frame])\n", | |
" return interact(_show)\n", | |
"\n", | |
"images = [random_spin_field(200, 200)]\n", | |
"for i in range(50):\n", | |
" images.append(cy_ising_step(images[-1].copy()))\n", | |
"display_ising_sequence(images)" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Polars Exists and Is Fast\n", | |
"\n", | |
"Polars is a DataFrame library that is written in Rust and has bindings for Python, R, and Go. It is very fast. See: \n", | |
"- Broad comparison between analytics libraries/databases: https://h2oai.github.io/db-benchmark/\n", | |
"- This comparison with Pandas: https://gist.github.com/koaning/5a0f3f27164859c42da5f20148ef3856#file-polars-ipynb\n", | |
"\n", | |
"Polars uses an Expressions syntax that feels (to me) like a blend between Pandas and SQL. See here: https://pola-rs.github.io/polars-book/user-guide/dsl/expressions.html\n", | |
"\n", | |
"Polars:\n", | |
"- supports lazy evaluation (unlike Pandas)\n", | |
"- supports parallel execution (unlike Pandas)\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 89, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr > th,\n", | |
".dataframe > tbody > tr > td {\n", | |
" text-align: right;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (69077, 15)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>geo_value</th><th>signal</th><th>source</th><th>geo_type</th><th>time_type</th><th>time_value</th><th>direction</th><th>issue</th><th>lag</th><th>missing_value</th><th>missing_stderr</th><th>missing_sample_size</th><th>value</th><th>stderr</th><th>sample_size</th></tr><tr><td>i64</td><td>str</td><td>str</td><td>str</td><td>str</td><td>i64</td><td>str</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>f64</td><td>str</td><td>str</td></tr></thead><tbody><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1003</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1005</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1007</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1009</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1011</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1013</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1015</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1017</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1019</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1021</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>1023</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td></tr><tr><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td></tr><tr><td>1111</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>6057.0</td><td>null</td><td>null</td></tr><tr><td>1113</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>12125.0</td><td>null</td><td>null</td></tr><tr><td>1115</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>30679.0</td><td>null</td><td>null</td></tr><tr><td>1117</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>72537.0</td><td>null</td><td>null</td></tr><tr><td>1119</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>3003.0</td><td>null</td><td>null</td></tr><tr><td>1121</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>26982.0</td><td>null</td><td>null</td></tr><tr><td>1123</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>13678.0</td><td>null</td><td>null</td></tr><tr><td>1125</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>66506.0</td><td>null</td><td>null</td></tr><tr><td>1127</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>22519.0</td><td>null</td><td>null</td></tr><tr><td>1129</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>4190.0</td><td>null</td><td>null</td></tr><tr><td>1131</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>3457.0</td><td>null</td><td>null</td></tr><tr><td>1133</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>8877.0</td><td>null</td><td>null</td></tr></tbody></table></div>" | |
], | |
"text/plain": [ | |
"shape: (69_077, 15)\n", | |
"┌───────────┬──────────────┬──────────┬──────────┬───┬────────────┬─────────┬────────┬─────────────┐\n", | |
"│ geo_value ┆ signal ┆ source ┆ geo_type ┆ … ┆ missing_sa ┆ value ┆ stderr ┆ sample_size │\n", | |
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ mple_size ┆ --- ┆ --- ┆ --- │\n", | |
"│ i64 ┆ str ┆ str ┆ str ┆ ┆ --- ┆ f64 ┆ str ┆ str │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ ┆ │\n", | |
"╞═══════════╪══════════════╪══════════╪══════════╪═══╪════════════╪═════════╪════════╪═════════════╡\n", | |
"│ 1001 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 0.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1003 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 0.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1005 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 0.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1007 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 0.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", | |
"│ 1127 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 22519.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1129 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 4190.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1131 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 3457.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1133 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 5 ┆ 8877.0 ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"└───────────┴──────────────┴──────────┴──────────┴───┴────────────┴─────────┴────────┴─────────────┘" | |
] | |
}, | |
"execution_count": 89, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import polars as pl\n", | |
"\n", | |
"pldf = pl.read_csv(\"/home/dskel/Documents/Code/Delphi/delphi-dev/confirmed_cumulative_num_01_counties.csv\")\n", | |
"pldf" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr > th,\n", | |
".dataframe > tbody > tr > td {\n", | |
" text-align: right;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (69077,)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>time_value</th></tr><tr><td>i64</td></tr></thead><tbody><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>20200122</td></tr><tr><td>…</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr><tr><td>20221117</td></tr></tbody></table></div>" | |
], | |
"text/plain": [ | |
"shape: (69_077,)\n", | |
"Series: 'time_value' [i64]\n", | |
"[\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t20200122\n", | |
"\t…\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"\t20221117\n", | |
"]" | |
] | |
}, | |
"execution_count": 84, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pldf[\"time_value\"]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 90, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr > th,\n", | |
".dataframe > tbody > tr > td {\n", | |
" text-align: right;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (69077, 16)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>geo_value</th><th>signal</th><th>source</th><th>geo_type</th><th>time_type</th><th>time_value</th><th>direction</th><th>issue</th><th>lag</th><th>missing_value</th><th>missing_stderr</th><th>missing_sample_size</th><th>value</th><th>stderr</th><th>sample_size</th><th>signal_upper</th></tr><tr><td>i64</td><td>str</td><td>str</td><td>str</td><td>str</td><td>i64</td><td>str</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>f64</td><td>str</td><td>str</td><td>str</td></tr></thead><tbody><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1003</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1005</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1007</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1009</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1011</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1013</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1015</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1017</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1019</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1021</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1023</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20200122</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td></tr><tr><td>1111</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>6057.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1113</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>12125.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1115</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>30679.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1117</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>72537.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1119</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>3003.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1121</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>26982.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1123</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>13678.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1125</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>66506.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1127</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>22519.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1129</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>4190.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1131</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>3457.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr><tr><td>1133</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>20221117</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>8877.0</td><td>null</td><td>null</td><td>"CONFIRMED_CUMU…</td></tr></tbody></table></div>" | |
], | |
"text/plain": [ | |
"shape: (69_077, 16)\n", | |
"┌───────────┬──────────────┬──────────┬──────────┬───┬─────────┬────────┬─────────────┬────────────┐\n", | |
"│ geo_value ┆ signal ┆ source ┆ geo_type ┆ … ┆ value ┆ stderr ┆ sample_size ┆ signal_upp │\n", | |
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ er │\n", | |
"│ i64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ str ┆ str ┆ --- │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str │\n", | |
"╞═══════════╪══════════════╪══════════╪══════════╪═══╪═════════╪════════╪═════════════╪════════════╡\n", | |
"│ 1001 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ 1003 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ 1005 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ 1007 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", | |
"│ 1127 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 22519.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ 1129 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 4190.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ 1131 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 3457.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"│ 1133 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 8877.0 ┆ null ┆ null ┆ CONFIRMED_ │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ CUMULATIVE │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ _NUM │\n", | |
"└───────────┴──────────────┴──────────┴──────────┴───┴─────────┴────────┴─────────────┴────────────┘" | |
] | |
}, | |
"execution_count": 90, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pldf.with_columns(\n", | |
" pl.col(\"signal\").str.to_uppercase().alias(\"signal_upper\"),\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 91, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr > th,\n", | |
".dataframe > tbody > tr > td {\n", | |
" text-align: right;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (69077, 16)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>geo_value</th><th>signal</th><th>source</th><th>geo_type</th><th>time_type</th><th>time_value</th><th>direction</th><th>issue</th><th>lag</th><th>missing_value</th><th>missing_stderr</th><th>missing_sample_size</th><th>value</th><th>stderr</th><th>sample_size</th><th>value_diff</th></tr><tr><td>i64</td><td>str</td><td>str</td><td>str</td><td>str</td><td>date</td><td>str</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>f64</td><td>str</td><td>str</td><td>f64</td></tr></thead><tbody><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1003</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1005</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1007</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1009</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1011</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1013</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1015</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1017</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1019</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1021</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>1023</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2020-01-22</td><td>null</td><td>20200514</td><td>113</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>null</td></tr><tr><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td></tr><tr><td>1111</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>6057.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1113</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>12125.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1115</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>30679.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1117</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>72537.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1119</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>3003.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1121</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>26982.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1123</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>13678.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1125</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>66506.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1127</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>22519.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1129</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>4190.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1131</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>3457.0</td><td>null</td><td>null</td><td>0.0</td></tr><tr><td>1133</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2022-11-17</td><td>null</td><td>20221118</td><td>1</td><td>0</td><td>5</td><td>5</td><td>8877.0</td><td>null</td><td>null</td><td>0.0</td></tr></tbody></table></div>" | |
], | |
"text/plain": [ | |
"shape: (69_077, 16)\n", | |
"┌───────────┬──────────────┬──────────┬──────────┬───┬─────────┬────────┬─────────────┬────────────┐\n", | |
"│ geo_value ┆ signal ┆ source ┆ geo_type ┆ … ┆ value ┆ stderr ┆ sample_size ┆ value_diff │\n", | |
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", | |
"│ i64 ┆ str ┆ str ┆ str ┆ ┆ f64 ┆ str ┆ str ┆ f64 │\n", | |
"╞═══════════╪══════════════╪══════════╪══════════╪═══╪═════════╪════════╪═════════════╪════════════╡\n", | |
"│ 1001 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1003 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1005 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1007 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 0.0 ┆ null ┆ null ┆ null │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", | |
"│ 1127 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 22519.0 ┆ null ┆ null ┆ 0.0 │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1129 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 4190.0 ┆ null ┆ null ┆ 0.0 │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1131 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 3457.0 ┆ null ┆ null ┆ 0.0 │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1133 ┆ confirmed_cu ┆ jhu-csse ┆ county ┆ … ┆ 8877.0 ┆ null ┆ null ┆ 0.0 │\n", | |
"│ ┆ mulative_num ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"└───────────┴──────────────┴──────────┴──────────┴───┴─────────┴────────┴─────────────┴────────────┘" | |
] | |
}, | |
"execution_count": 91, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pldf = pldf.with_columns(\n", | |
" pl.col(\"time_value\").cast(\"str\").str.strptime(pl.Date, fmt=\"%Y%m%d\")\n", | |
").with_columns(\n", | |
" pl.col(\"value\").diff().over([\"source\", \"signal\", \"geo_value\"]).alias(\"value_diff\")\n", | |
")\n", | |
"pldf" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 99, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pldf = pldf.with_columns(\n", | |
" pldf.groupby_rolling(\"time_value\", period=\"7d\", by=[\"source\", \"signal\", \"geo_value\"]).agg([\n", | |
" pl.col(\"value_diff\").mean().alias(\"value_diff_smooth\"),\n", | |
" pl.col(\"issue\").max().alias(\"issue\"),\n", | |
" pl.col(\"lag\").max().alias(\"lag\"),\n", | |
" ])\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 100, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr > th,\n", | |
".dataframe > tbody > tr > td {\n", | |
" text-align: right;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (31, 17)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>geo_value</th><th>signal</th><th>source</th><th>geo_type</th><th>time_type</th><th>time_value</th><th>direction</th><th>issue</th><th>lag</th><th>missing_value</th><th>missing_stderr</th><th>missing_sample_size</th><th>value</th><th>stderr</th><th>sample_size</th><th>value_diff</th><th>value_diff_smooth</th></tr><tr><td>i64</td><td>str</td><td>str</td><td>str</td><td>str</td><td>date</td><td>str</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>f64</td><td>str</td><td>str</td><td>f64</td><td>f64</td></tr></thead><tbody><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-01</td><td>null</td><td>20210401</td><td>37</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>21.285714</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-02</td><td>null</td><td>20210401</td><td>36</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>22.857143</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-03</td><td>null</td><td>20210401</td><td>35</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>20.142857</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-04</td><td>null</td><td>20210401</td><td>34</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>17.285714</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-05</td><td>null</td><td>20210401</td><td>33</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>15.0</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-06</td><td>null</td><td>20210401</td><td>32</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>13.714286</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-07</td><td>null</td><td>20210401</td><td>31</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>11.857143</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-08</td><td>null</td><td>20210401</td><td>30</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>13.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-09</td><td>null</td><td>20210401</td><td>29</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>9.714286</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-10</td><td>null</td><td>20210401</td><td>28</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>12.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-11</td><td>null</td><td>20210401</td><td>27</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>12.142857</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-12</td><td>null</td><td>20210401</td><td>26</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>10.857143</td></tr><tr><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td><td>…</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-20</td><td>null</td><td>20210401</td><td>18</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>13.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-21</td><td>null</td><td>20210401</td><td>17</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>12.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-22</td><td>null</td><td>20210401</td><td>16</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>6.571429</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-23</td><td>null</td><td>20210401</td><td>15</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>7.285714</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-24</td><td>null</td><td>20210401</td><td>14</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>7.142857</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-25</td><td>null</td><td>20210401</td><td>13</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>6.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-26</td><td>null</td><td>20210401</td><td>12</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>6.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-27</td><td>null</td><td>20210401</td><td>11</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>7.428571</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-28</td><td>null</td><td>20210401</td><td>10</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>8.142857</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-29</td><td>null</td><td>20210401</td><td>9</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>8.571429</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-30</td><td>null</td><td>20210401</td><td>8</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>7.857143</td></tr><tr><td>1001</td><td>"confirmed_cumu…</td><td>"jhu-csse"</td><td>"county"</td><td>"day"</td><td>2021-03-31</td><td>null</td><td>20210401</td><td>7</td><td>0</td><td>5</td><td>5</td><td>0.0</td><td>null</td><td>null</td><td>0.0</td><td>8.0</td></tr></tbody></table></div>" | |
], | |
"text/plain": [ | |
"shape: (31, 17)\n", | |
"┌───────────┬─────────────┬──────────┬──────────┬───┬────────┬───────────┬────────────┬────────────┐\n", | |
"│ geo_value ┆ signal ┆ source ┆ geo_type ┆ … ┆ stderr ┆ sample_si ┆ value_diff ┆ value_diff │\n", | |
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ ze ┆ --- ┆ _smooth │\n", | |
"│ i64 ┆ str ┆ str ┆ str ┆ ┆ str ┆ --- ┆ f64 ┆ --- │\n", | |
"│ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ f64 │\n", | |
"╞═══════════╪═════════════╪══════════╪══════════╪═══╪════════╪═══════════╪════════════╪════════════╡\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 21.285714 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 22.857143 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 20.142857 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 17.285714 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 8.142857 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 8.571429 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 7.857143 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ 1001 ┆ confirmed_c ┆ jhu-csse ┆ county ┆ … ┆ null ┆ null ┆ 0.0 ┆ 8.0 │\n", | |
"│ ┆ umulative_n ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"│ ┆ um ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", | |
"└───────────┴─────────────┴──────────┴──────────┴───┴────────┴───────────┴────────────┴────────────┘" | |
] | |
}, | |
"execution_count": 100, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from datetime import date\n", | |
"\n", | |
"pldf.filter( \n", | |
" (pl.col(\"signal\") == \"confirmed_cumulative_num\") &\n", | |
" (pl.col(\"geo_value\") == 1001) &\n", | |
" (pl.col(\"time_value\").is_between(date(2021, 3, 1), date(2021, 3, 31)))\n", | |
")" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Line Profiling Python Code\n", | |
"\n", | |
"This uses the [line_profiler package](https://github.com/pyutils/line_profiler)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 105, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%load_ext line_profiler" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def test():\n", | |
" df = pd.DataFrame(row_dicts)\n", | |
" df[\"time_value\"] = pd.to_datetime(df[\"time_value\"], format=\"%Y%m%d\")\n", | |
" df[\"issue\"] = pd.to_datetime(df[\"issue\"], format=\"%Y%m%d\")\n", | |
" df[\"geo_value\"] = df[\"geo_value\"].astype(str).str.zfill(5).astype(\"category\")\n", | |
" df[\"value_diff\"] = df.groupby([\"source\", \"signal\", \"geo_value\"])[\"value\"].diff()\n", | |
" return df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 124, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Timer unit: 1e-09 s\n", | |
"\n", | |
"Total time: 0.258499 s\n", | |
"File: /tmp/ipykernel_133357/225339769.py\n", | |
"Function: test at line 1\n", | |
"\n", | |
"Line # Hits Time Per Hit % Time Line Contents\n", | |
"==============================================================\n", | |
" 1 def test():\n", | |
" 2 1 182160020.0 182160020.0 70.5 df = pd.DataFrame(row_dicts)\n", | |
" 3 1 33241850.0 33241850.0 12.9 df[\"time_value\"] = pd.to_datetime(df[\"time_value\"], format=\"%Y%m%d\")\n", | |
" 4 1 2311023.0 2311023.0 0.9 df[\"issue\"] = pd.to_datetime(df[\"issue\"], format=\"%Y%m%d\")\n", | |
" 5 1 31815835.0 31815835.0 12.3 df[\"geo_value\"] = df[\"geo_value\"].astype(str).str.zfill(5).astype(\"category\")\n", | |
" 6 1 8969550.0 8969550.0 3.5 df[\"value_diff\"] = df.groupby([\"source\", \"signal\", \"geo_value\"])[\"value\"].diff()\n", | |
" 7 1 223.0 223.0 0.0 return df" | |
] | |
} | |
], | |
"source": [ | |
"%lprun -s -f test test()" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"There is also cProfile, built-in to Python. https://docs.python.org/3/library/profile.html#module-cProfile" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 356543 function calls (356170 primitive calls) in 0.289 seconds\n", | |
"\n", | |
" Ordered by: cumulative time\n", | |
" List reduced from 674 to 34 due to restriction <0.05>\n", | |
"\n", | |
" ncalls tottime percall cumtime percall filename:lineno(function)\n", | |
" 1 0.002 0.002 0.289 0.289 /tmp/ipykernel_138759/225339769.py:1(test)\n", | |
" 1 0.000 0.000 0.197 0.197 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/frame.py:640(__init__)\n", | |
" 1 0.000 0.000 0.180 0.180 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:484(nested_data_to_arrays)\n", | |
" 1 0.000 0.000 0.180 0.180 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:775(to_arrays)\n", | |
" 1 0.000 0.000 0.090 0.090 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:886(_list_of_dict_to_arrays)\n", | |
" 36 0.089 0.002 0.089 0.002 {pandas._libs.lib.maybe_convert_objects}\n", | |
" 1 0.000 0.000 0.089 0.089 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:923(_finalize_columns_and_data)\n", | |
" 1 0.000 0.000 0.089 0.089 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:1001(convert_object_array)\n", | |
" 1 0.000 0.000 0.089 0.089 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:1067(<listcomp>)\n", | |
" 15 0.000 0.000 0.089 0.006 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:1023(convert)\n", | |
" 1 0.022 0.022 0.059 0.059 {pandas._libs.lib.fast_unique_multiple_list_gen}\n", | |
" 2 0.000 0.000 0.039 0.020 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:687(to_datetime)\n", | |
" 69078 0.032 0.000 0.037 0.000 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:910(<genexpr>)\n", | |
" 2 0.000 0.000 0.036 0.018 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:352(_convert_listlike_datetimes)\n", | |
" 2 0.000 0.000 0.035 0.017 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:473(_array_strptime_with_fallback)\n", | |
" 1 0.000 0.000 0.033 0.033 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/strings/accessor.py:120(wrapper)\n", | |
" 1 0.000 0.000 0.033 0.033 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/strings/accessor.py:1619(zfill)\n", | |
" 1 0.000 0.000 0.033 0.033 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/strings/object_array.py:44(_str_map)\n", | |
" 1 0.013 0.013 0.030 0.030 {pandas._libs.lib.map_infer_mask}\n", | |
" 1 0.024 0.024 0.024 0.024 {pandas._libs.lib.dicts_to_array}\n", | |
" 2 0.021 0.010 0.023 0.011 {pandas._libs.tslibs.strptime.array_strptime}\n", | |
" 1 0.001 0.001 0.018 0.018 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py:97(arrays_to_mgr)\n", | |
" 69077 0.013 0.000 0.017 0.000 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/strings/accessor.py:1683(<lambda>)\n", | |
" 1 0.000 0.000 0.016 0.016 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/managers.py:2119(create_block_manager_from_column_arrays)\n", | |
" 14 0.006 0.000 0.012 0.001 {built-in method builtins.any}\n", | |
" 1 0.000 0.000 0.010 0.010 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/managers.py:1823(_consolidate_inplace)\n", | |
" 1 0.000 0.000 0.010 0.010 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/managers.py:2262(_consolidate)\n", | |
" 3 0.006 0.002 0.010 0.003 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/internals/managers.py:2279(_merge_blocks)\n", | |
" 1 0.000 0.000 0.009 0.009 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:3822(diff)\n", | |
" 1 0.000 0.000 0.009 0.009 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:3776(shift)\n", | |
" 3 0.000 0.000 0.009 0.003 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/algorithms.py:595(factorize)\n", | |
" 3 0.000 0.000 0.008 0.003 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/algorithms.py:533(factorize_array)\n", | |
" 1 0.000 0.000 0.008 0.008 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/groupby/ops.py:871(group_info)\n", | |
" 1 0.000 0.000 0.008 0.008 /home/dskel/Documents/Code/Delphi/delphi-dev/venv/lib/python3.10/site-packages/pandas/core/groupby/ops.py:886(_get_compressed_codes)\n", | |
"\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"import cProfile\n", | |
"import pstats\n", | |
"\n", | |
"with cProfile.Profile() as pr:\n", | |
" pr.enable()\n", | |
" test()\n", | |
" pr.disable()\n", | |
" pr.create_stats()\n", | |
" pstats.Stats(pr).sort_stats(\"cumtime\").print_stats(.05)" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## The Dis Module Lets You See Python Byte Code\n", | |
"\n", | |
"The dis module lets you see the Python byte code for a function. See: https://docs.python.org/3/library/dis.html\n", | |
"\n", | |
"The usefulness is multiplicative with familiarity with [Python bytecode instructions](https://docs.python.org/3/library/dis.html#python-bytecode-instructions), but occasionally it can give hints about what is fast in Python and what is slow." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 75, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 5 0 LOAD_FAST 0 (a)\n", | |
" 2 LOAD_FAST 1 (b)\n", | |
" 4 ROT_TWO\n", | |
" 6 STORE_FAST 1 (b)\n", | |
" 8 STORE_FAST 0 (a)\n", | |
"\n", | |
" 6 10 LOAD_FAST 0 (a)\n", | |
" 12 LOAD_FAST 1 (b)\n", | |
" 14 BUILD_TUPLE 2\n", | |
" 16 RETURN_VALUE\n" | |
] | |
} | |
], | |
"source": [ | |
"import dis\n", | |
"\n", | |
"def swap(a: int, b: int) -> tuple[int, int]:\n", | |
" b, a = a, b\n", | |
" return a, b\n", | |
"\n", | |
"dis.dis(swap)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 77, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 2 0 LOAD_FAST 0 (a)\n", | |
" 2 STORE_FAST 2 (c)\n", | |
"\n", | |
" 3 4 LOAD_FAST 1 (b)\n", | |
" 6 STORE_FAST 0 (a)\n", | |
"\n", | |
" 4 8 LOAD_FAST 2 (c)\n", | |
" 10 STORE_FAST 1 (b)\n", | |
"\n", | |
" 5 12 LOAD_FAST 0 (a)\n", | |
" 14 LOAD_FAST 1 (b)\n", | |
" 16 BUILD_TUPLE 2\n", | |
" 18 RETURN_VALUE\n" | |
] | |
} | |
], | |
"source": [ | |
"def swap2(a: int, b: int) -> tuple[int, int]:\n", | |
" c = a\n", | |
" a = b\n", | |
" b = c\n", | |
" return a, b\n", | |
"\n", | |
"dis.dis(swap2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 164, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"()" | |
] | |
}, | |
"execution_count": 164, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"swap.__code__.co_names" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 165, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('a', 'b')" | |
] | |
}, | |
"execution_count": 165, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"swap.__code__.co_varnames" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## You Can Change The Meaning of 42\n", | |
"\n", | |
"Taken from: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/\n", | |
"\n", | |
"Small integers in Python are cached, so variables with the same value point to the same object. This is a performance optimization." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"IntStruct(ob_digit=42, refcount=1541)" | |
] | |
}, | |
"execution_count": 65, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import ctypes\n", | |
"\n", | |
"class IntStruct(ctypes.Structure):\n", | |
" _fields_ = [(\"ob_refcnt\", ctypes.c_long),\n", | |
" (\"ob_type\", ctypes.c_void_p),\n", | |
" (\"ob_size\", ctypes.c_ulong),\n", | |
" (\"ob_digit\", ctypes.c_long)]\n", | |
" \n", | |
" def __repr__(self):\n", | |
" return (\"IntStruct(ob_digit={self.ob_digit}, \"\n", | |
" \"refcount={self.ob_refcnt})\").format(self=self)\n", | |
"\n", | |
"num = 42\n", | |
"IntStruct.from_address(id(42))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 74, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\u001b[0;31mSignature:\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m/\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", | |
"\u001b[0;31mDocstring:\u001b[0m\n", | |
"Return the identity of an object.\n", | |
"\n", | |
"This is guaranteed to be unique among simultaneously existing objects.\n", | |
"(CPython uses the object's memory address.)\n", | |
"\u001b[0;31mType:\u001b[0m builtin_function_or_method" | |
] | |
} | |
], | |
"source": [ | |
"?id" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 73, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 73, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"x = 42\n", | |
"y = 42\n", | |
"id(x) == id(y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"False" | |
] | |
}, | |
"execution_count": 66, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"x = 1234\n", | |
"y = 1234\n", | |
"id(x) == id(y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# WARNING: Never do this!\n", | |
"id42 = id(42)\n", | |
"iptr = IntStruct.from_address(id42)\n", | |
"iptr.ob_digit = 1 # now Python's 42 contains a 1!\n", | |
"\n", | |
"42 == 1\n", | |
"# True" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "venv", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.9" | |
}, | |
"orig_nbformat": 4 | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment