Last active
July 18, 2022 22:46
-
-
Save tupui/9c61591395da3a0b008e4be0ebb465d7 to your computer and use it in GitHub Desktop.
Improving random sampling in Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Improving random sampling in Python\n", | |
"\n", | |
"Demo by Pamphile T. Roy for SciPy2022\n", | |
"\n", | |
"SPDX-License-Identifier: MIT" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Since SciPy 1.7, we introduced a QMC submodule in `scipy.stats.qmc`.\n", | |
"Here are some things we can do with it!\n", | |
"\n", | |
"A tutorial can be found in the documentation at:\n", | |
"https://scipy.github.io/devdocs/tutorial/stats.html#quasi-monte-carlo\n", | |
"\n", | |
"Let's check our version of SciPy and play with the QMC submodule." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"# ensure the most recent version of SciPy is installed\n", | |
"!pip install scipy>=1.7" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"from scipy.stats import qmc" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"First let's define a `rng` object to make the following reproducible. Please don't use this in production code. See the section bellow." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"\n", | |
"# seed = np.random.SeedSequence().entropy\n", | |
"# print(seed)\n", | |
"seed = 292114020772849599029278515437886320941\n", | |
"rng = np.random.default_rng(seed)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Basics" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"There are multiple `QMCEngine` that can be used to sample points in $\\mathcal{U} \\sim [0, 1)^d$." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"n, d = 64, 2\n", | |
"lhs_engine = qmc.LatinHypercube(d=d, centered=True, seed=rng) # centered to make nice bins\n", | |
"sample = lhs_engine.random(n=n)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"import seaborn as sns\n", | |
"import pandas as pd\n", | |
"sns.pairplot(pd.DataFrame(sample), diag_kind=\"hist\", corner=True, diag_kws={\"bins\": 4})" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"With the samples, you can further compute quality metrics such as the discrepancy and also scale to bounds." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"qmc.discrepancy(sample)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"scaled_sample = qmc.scale(sample, l_bounds=[-1, 10], u_bounds=[4, 22])\n", | |
"sns.pairplot(pd.DataFrame(scaled_sample), diag_kind=\"hist\", corner=True, diag_kws={\"bins\": 4})" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"The start of the show is the Sobol' sampler:" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"sobol_engine = qmc.Sobol(d=d, scramble=False, seed=rng)\n", | |
"sample = sobol_engine.random(n=n)\n", | |
"qmc.discrepancy(sample)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"sns.pairplot(pd.DataFrame(sample), diag_kind=\"hist\", corner=True, diag_kws={\"bins\": 4})" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"sobol_engine = qmc.Sobol(d=d, scramble=True, seed=rng)\n", | |
"sample = sobol_engine.random(n=n)\n", | |
"qmc.discrepancy(sample)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We spent quite some time on Sobol'. First to ensure it starts with 0, and then to have some warnings if users do not ask for $2^n$ points. Who would dare doing this!?" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"sobol_engine = qmc.Sobol(d=3, scramble=False, seed=rng)\n", | |
"sample = sobol_engine.random(10)\n", | |
"sample" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Sample from another distribution\n", | |
"\n", | |
"The following is also new to SciPy. We can sample from any arbitrary distribution using a MC or QMC engine." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Just defining a helper function to do a nice plot of the PDF" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import matplotlib.pyplot as plt\n", | |
"\n", | |
"\n", | |
"def plot_pdf(dist, sample):\n", | |
" fig, ax = plt.subplots()\n", | |
"\n", | |
" x = np.linspace(dist.ppf(0.01), dist.ppf(0.99), 100)\n", | |
" pdf = dist.pdf(x)\n", | |
" ax.plot(x, pdf, '-', lw=5, label='fisk pdf')\n", | |
"\n", | |
" delta = np.max(pdf) * 5e-2\n", | |
" ax.plot(sample, -delta - delta * rng.random(len(sample)), \"+k\")\n", | |
"\n", | |
" ax.hist(sample, density=True, histtype='stepfilled', alpha=0.2)\n", | |
" ax.legend(loc='best', frameon=False)\n", | |
" plt.show()" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"First we sample from an arbitrary distribution-here Fisk-using MC." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"from scipy.stats import fisk\n", | |
"\n", | |
"c = 3.9\n", | |
"dist = fisk(c)\n", | |
"\n", | |
"sample = dist.rvs(128, random_state=rng)\n", | |
"plot_pdf(dist, sample)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"And now we can use `NumericalInverseHermite` to use any `QMCEngine`." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"from scipy.stats.sampling import NumericalInverseHermite\n", | |
"\n", | |
"fni = NumericalInverseHermite(dist)\n", | |
"sobol_engine = qmc.Sobol(d=1, scramble=True, seed=rng)\n", | |
"sample = fni.qrvs(128, qmc_engine=sobol_engine)\n", | |
"\n", | |
"\n", | |
"plot_pdf(dist, sample)" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"There is also `NumericalInverseHermite.rvs` which uses MC." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Integration convergence\n", | |
"\n", | |
"We all like convergence plots don't we?\n", | |
"\n", | |
"In the following, I evaluate a squared sum in 3 dimension as an example." | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [ | |
"from collections import namedtuple\n", | |
"\n", | |
"import numpy as np\n", | |
"import matplotlib.pyplot as plt\n", | |
"from scipy.stats import qmc\n", | |
"\n", | |
"n_conv = 99\n", | |
"min_m, max_m = 4, 11\n", | |
"ns_gen = 2 ** np.arange(min_m, max_m)\n", | |
"\n", | |
"\n", | |
"def art_2(sample):\n", | |
" \"\"\"3-D squared sum.\n", | |
"\n", | |
" True value 5/3 + 5*(5 - 1)/4\n", | |
"\n", | |
" Art B. Owen. On dropping the first Sobol' point. arXiv 2008.08051, 2020.\n", | |
" \"\"\"\n", | |
" return np.sum(sample, axis=1) ** 2\n", | |
"\n", | |
"\n", | |
"functions = namedtuple('functions', ['name', 'func', 'dim', 'ref'])\n", | |
"case = functions('Art 2', art_2, 5, 5 / 3 + 5 * (5 - 1) / 4)\n", | |
"\n", | |
"\n", | |
"def conv_method(sampler, func, n_samples, n_conv, ref):\n", | |
" samples = [sampler(n_samples) for _ in range(n_conv)]\n", | |
" samples = np.array(samples)\n", | |
"\n", | |
" evals = [np.sum(func(sample)) / n_samples for sample in samples]\n", | |
" squared_errors = (ref - np.array(evals)) ** 2\n", | |
" rmse = (np.sum(squared_errors) / n_conv) ** 0.5\n", | |
"\n", | |
" return rmse\n", | |
"\n", | |
"\n", | |
"# Analysis\n", | |
"sample_mc_rmse = []\n", | |
"sample_sobol_s_rmse = []\n", | |
"sample_sobol_rmse = []\n", | |
"\n", | |
"for ns in ns_gen:\n", | |
" # Monte Carlo\n", | |
" sampler_mc = lambda x: rng.random((x, case.dim))\n", | |
" conv_res = conv_method(sampler_mc, case.func, ns, n_conv, case.ref)\n", | |
" sample_mc_rmse.append(conv_res)\n", | |
"\n", | |
" # Sobol'\n", | |
" engine = qmc.Sobol(d=case.dim, scramble=False)\n", | |
" conv_res = conv_method(engine.random, case.func, ns, 1, case.ref)\n", | |
" sample_sobol_rmse.append(conv_res)\n", | |
"\n", | |
" engine = qmc.Sobol(d=case.dim, scramble=True)\n", | |
" conv_res = conv_method(engine.random, case.func, ns, n_conv, case.ref)\n", | |
" sample_sobol_s_rmse.append(conv_res)\n", | |
"\n", | |
"sample_mc_rmse = np.array(sample_mc_rmse)\n", | |
"sample_sobol_rmse = np.array(sample_sobol_rmse)\n", | |
"sample_sobol_s_rmse = np.array(sample_sobol_s_rmse)\n", | |
"\n", | |
"# Plot\n", | |
"fig, ax = plt.subplots()\n", | |
"\n", | |
"\n", | |
"# MC\n", | |
"ratio = sample_mc_rmse[0] / ns_gen[0] ** (-1 / 2)\n", | |
"ax.plot(ns_gen, ns_gen ** (-1 / 2) * ratio, ls='-', c='k')\n", | |
"\n", | |
"ax.scatter(ns_gen, sample_mc_rmse, label=\"MC\")\n", | |
"\n", | |
"# Sobol'\n", | |
"ratio = sample_sobol_rmse[0] / ns_gen[0] ** (-2/2)\n", | |
"ax.plot(ns_gen, ns_gen ** (-2/2) * ratio, ls='-', c='k')\n", | |
"\n", | |
"ratio = sample_sobol_s_rmse[0] / ns_gen[0] ** (-4/2)\n", | |
"ax.plot(ns_gen, ns_gen ** (-4/2) * ratio, ls='-', c='k')\n", | |
"\n", | |
"ax.scatter(ns_gen, sample_sobol_rmse, label=\"Sobol' unscrambled\")\n", | |
"ax.scatter(ns_gen, sample_sobol_s_rmse, label=\"Sobol' scrambled\")\n", | |
"\n", | |
"ax.set_xlabel(r'$N_s$')\n", | |
"ax.set_xscale('log')\n", | |
"ax.set_xticks(ns_gen)\n", | |
"ax.set_xticklabels([fr'$2^{{{ns}}}$'\n", | |
" for ns in np.arange(min_m, max_m)])\n", | |
"\n", | |
"ax.set_ylabel(r'$\\log (\\epsilon)$')\n", | |
"ax.set_yscale('log')\n", | |
"\n", | |
"ax.legend(loc='lower left')\n", | |
"fig.tight_layout()\n", | |
"plt.show()\n" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Random number generators and seed\n", | |
"\n", | |
"I have a final note on RNG with NumPy. You may have seen or used lot of things like `np.random.seed(123)` or `np.random.RandomState(123)`. Please don't.\n", | |
"\n", | |
"There are 3 problems:\n", | |
"\n", | |
"* Using a fix seed is in general dangerous in a sense that you can forget it and then instead of having a RNG, well it's not anymore random and hence just a MC draw. Also small values, such as commonly used 0, 123, etc. tend to produce bad entropy for the generator.\n", | |
"* Using `np.random.seed(...)` fix the global seed. But new code are not relying on this at all. Hence you might only fix a portion of the code.\n", | |
"* `np.random.RandomState` should not be used for new code because it's slower and has statistical issues. New code should use `np.random.Generator` which is better in every way.\n", | |
"\n", | |
"Have a look at the documentation for more details:\n", | |
"https://scipy.github.io/devdocs/tutorial/stats.html#random-number-generation" | |
], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%% md\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"outputs": [], | |
"source": [], | |
"metadata": { | |
"collapsed": false, | |
"pycharm": { | |
"name": "#%%\n" | |
} | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"colab": { | |
"collapsed_sections": [], | |
"include_colab_link": true, | |
"name": "mcm2021_scipy_demo.ipynb", | |
"provenance": [], | |
"toc_visible": true | |
}, | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.10.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment