Skip to content

Instantly share code, notes, and snippets.

@tupui
Last active July 18, 2022 22:46
Show Gist options
  • Save tupui/9c61591395da3a0b008e4be0ebb465d7 to your computer and use it in GitHub Desktop.
Save tupui/9c61591395da3a0b008e4be0ebb465d7 to your computer and use it in GitHub Desktop.
Improving random sampling in Python
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Improving random sampling in Python\n",
"\n",
"Demo by Pamphile T. Roy for SciPy2022\n",
"\n",
"SPDX-License-Identifier: MIT"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Since SciPy 1.7, we introduced a QMC submodule in `scipy.stats.qmc`.\n",
"Here are some things we can do with it!\n",
"\n",
"A tutorial can be found in the documentation at:\n",
"https://scipy.github.io/devdocs/tutorial/stats.html#quasi-monte-carlo\n",
"\n",
"Let's check our version of SciPy and play with the QMC submodule."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# ensure the most recent version of SciPy is installed\n",
"!pip install scipy>=1.7"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from scipy.stats import qmc"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"First let's define a `rng` object to make the following reproducible. Please don't use this in production code. See the section bellow."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# seed = np.random.SeedSequence().entropy\n",
"# print(seed)\n",
"seed = 292114020772849599029278515437886320941\n",
"rng = np.random.default_rng(seed)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Basics"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"There are multiple `QMCEngine` that can be used to sample points in $\\mathcal{U} \\sim [0, 1)^d$."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"n, d = 64, 2\n",
"lhs_engine = qmc.LatinHypercube(d=d, centered=True, seed=rng) # centered to make nice bins\n",
"sample = lhs_engine.random(n=n)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"import seaborn as sns\n",
"import pandas as pd\n",
"sns.pairplot(pd.DataFrame(sample), diag_kind=\"hist\", corner=True, diag_kws={\"bins\": 4})"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"With the samples, you can further compute quality metrics such as the discrepancy and also scale to bounds."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"qmc.discrepancy(sample)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"scaled_sample = qmc.scale(sample, l_bounds=[-1, 10], u_bounds=[4, 22])\n",
"sns.pairplot(pd.DataFrame(scaled_sample), diag_kind=\"hist\", corner=True, diag_kws={\"bins\": 4})"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"The start of the show is the Sobol' sampler:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sobol_engine = qmc.Sobol(d=d, scramble=False, seed=rng)\n",
"sample = sobol_engine.random(n=n)\n",
"qmc.discrepancy(sample)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sns.pairplot(pd.DataFrame(sample), diag_kind=\"hist\", corner=True, diag_kws={\"bins\": 4})"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sobol_engine = qmc.Sobol(d=d, scramble=True, seed=rng)\n",
"sample = sobol_engine.random(n=n)\n",
"qmc.discrepancy(sample)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"We spent quite some time on Sobol'. First to ensure it starts with 0, and then to have some warnings if users do not ask for $2^n$ points. Who would dare doing this!?"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sobol_engine = qmc.Sobol(d=3, scramble=False, seed=rng)\n",
"sample = sobol_engine.random(10)\n",
"sample"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Sample from another distribution\n",
"\n",
"The following is also new to SciPy. We can sample from any arbitrary distribution using a MC or QMC engine."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Just defining a helper function to do a nice plot of the PDF"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"\n",
"def plot_pdf(dist, sample):\n",
" fig, ax = plt.subplots()\n",
"\n",
" x = np.linspace(dist.ppf(0.01), dist.ppf(0.99), 100)\n",
" pdf = dist.pdf(x)\n",
" ax.plot(x, pdf, '-', lw=5, label='fisk pdf')\n",
"\n",
" delta = np.max(pdf) * 5e-2\n",
" ax.plot(sample, -delta - delta * rng.random(len(sample)), \"+k\")\n",
"\n",
" ax.hist(sample, density=True, histtype='stepfilled', alpha=0.2)\n",
" ax.legend(loc='best', frameon=False)\n",
" plt.show()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"First we sample from an arbitrary distribution-here Fisk-using MC."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from scipy.stats import fisk\n",
"\n",
"c = 3.9\n",
"dist = fisk(c)\n",
"\n",
"sample = dist.rvs(128, random_state=rng)\n",
"plot_pdf(dist, sample)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"And now we can use `NumericalInverseHermite` to use any `QMCEngine`."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from scipy.stats.sampling import NumericalInverseHermite\n",
"\n",
"fni = NumericalInverseHermite(dist)\n",
"sobol_engine = qmc.Sobol(d=1, scramble=True, seed=rng)\n",
"sample = fni.qrvs(128, qmc_engine=sobol_engine)\n",
"\n",
"\n",
"plot_pdf(dist, sample)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"There is also `NumericalInverseHermite.rvs` which uses MC."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Integration convergence\n",
"\n",
"We all like convergence plots don't we?\n",
"\n",
"In the following, I evaluate a squared sum in 3 dimension as an example."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from collections import namedtuple\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from scipy.stats import qmc\n",
"\n",
"n_conv = 99\n",
"min_m, max_m = 4, 11\n",
"ns_gen = 2 ** np.arange(min_m, max_m)\n",
"\n",
"\n",
"def art_2(sample):\n",
" \"\"\"3-D squared sum.\n",
"\n",
" True value 5/3 + 5*(5 - 1)/4\n",
"\n",
" Art B. Owen. On dropping the first Sobol' point. arXiv 2008.08051, 2020.\n",
" \"\"\"\n",
" return np.sum(sample, axis=1) ** 2\n",
"\n",
"\n",
"functions = namedtuple('functions', ['name', 'func', 'dim', 'ref'])\n",
"case = functions('Art 2', art_2, 5, 5 / 3 + 5 * (5 - 1) / 4)\n",
"\n",
"\n",
"def conv_method(sampler, func, n_samples, n_conv, ref):\n",
" samples = [sampler(n_samples) for _ in range(n_conv)]\n",
" samples = np.array(samples)\n",
"\n",
" evals = [np.sum(func(sample)) / n_samples for sample in samples]\n",
" squared_errors = (ref - np.array(evals)) ** 2\n",
" rmse = (np.sum(squared_errors) / n_conv) ** 0.5\n",
"\n",
" return rmse\n",
"\n",
"\n",
"# Analysis\n",
"sample_mc_rmse = []\n",
"sample_sobol_s_rmse = []\n",
"sample_sobol_rmse = []\n",
"\n",
"for ns in ns_gen:\n",
" # Monte Carlo\n",
" sampler_mc = lambda x: rng.random((x, case.dim))\n",
" conv_res = conv_method(sampler_mc, case.func, ns, n_conv, case.ref)\n",
" sample_mc_rmse.append(conv_res)\n",
"\n",
" # Sobol'\n",
" engine = qmc.Sobol(d=case.dim, scramble=False)\n",
" conv_res = conv_method(engine.random, case.func, ns, 1, case.ref)\n",
" sample_sobol_rmse.append(conv_res)\n",
"\n",
" engine = qmc.Sobol(d=case.dim, scramble=True)\n",
" conv_res = conv_method(engine.random, case.func, ns, n_conv, case.ref)\n",
" sample_sobol_s_rmse.append(conv_res)\n",
"\n",
"sample_mc_rmse = np.array(sample_mc_rmse)\n",
"sample_sobol_rmse = np.array(sample_sobol_rmse)\n",
"sample_sobol_s_rmse = np.array(sample_sobol_s_rmse)\n",
"\n",
"# Plot\n",
"fig, ax = plt.subplots()\n",
"\n",
"\n",
"# MC\n",
"ratio = sample_mc_rmse[0] / ns_gen[0] ** (-1 / 2)\n",
"ax.plot(ns_gen, ns_gen ** (-1 / 2) * ratio, ls='-', c='k')\n",
"\n",
"ax.scatter(ns_gen, sample_mc_rmse, label=\"MC\")\n",
"\n",
"# Sobol'\n",
"ratio = sample_sobol_rmse[0] / ns_gen[0] ** (-2/2)\n",
"ax.plot(ns_gen, ns_gen ** (-2/2) * ratio, ls='-', c='k')\n",
"\n",
"ratio = sample_sobol_s_rmse[0] / ns_gen[0] ** (-4/2)\n",
"ax.plot(ns_gen, ns_gen ** (-4/2) * ratio, ls='-', c='k')\n",
"\n",
"ax.scatter(ns_gen, sample_sobol_rmse, label=\"Sobol' unscrambled\")\n",
"ax.scatter(ns_gen, sample_sobol_s_rmse, label=\"Sobol' scrambled\")\n",
"\n",
"ax.set_xlabel(r'$N_s$')\n",
"ax.set_xscale('log')\n",
"ax.set_xticks(ns_gen)\n",
"ax.set_xticklabels([fr'$2^{{{ns}}}$'\n",
" for ns in np.arange(min_m, max_m)])\n",
"\n",
"ax.set_ylabel(r'$\\log (\\epsilon)$')\n",
"ax.set_yscale('log')\n",
"\n",
"ax.legend(loc='lower left')\n",
"fig.tight_layout()\n",
"plt.show()\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Random number generators and seed\n",
"\n",
"I have a final note on RNG with NumPy. You may have seen or used lot of things like `np.random.seed(123)` or `np.random.RandomState(123)`. Please don't.\n",
"\n",
"There are 3 problems:\n",
"\n",
"* Using a fix seed is in general dangerous in a sense that you can forget it and then instead of having a RNG, well it's not anymore random and hence just a MC draw. Also small values, such as commonly used 0, 123, etc. tend to produce bad entropy for the generator.\n",
"* Using `np.random.seed(...)` fix the global seed. But new code are not relying on this at all. Hence you might only fix a portion of the code.\n",
"* `np.random.RandomState` should not be used for new code because it's slower and has statistical issues. New code should use `np.random.Generator` which is better in every way.\n",
"\n",
"Have a look at the documentation for more details:\n",
"https://scipy.github.io/devdocs/tutorial/stats.html#random-number-generation"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "mcm2021_scipy_demo.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment