Skip to content

Instantly share code, notes, and snippets.

@ericmjl
Created May 20, 2020 02:03
Show Gist options
  • Save ericmjl/ac55e3251810ca2bf4d13fefd492613f to your computer and use it in GitHub Desktop.
Save ericmjl/ac55e3251810ca2bf4d13fefd492613f to your computer and use it in GitHub Desktop.
A gist copy of my PyCon talk.
import streamlit as st
import pandas as pd
st.title("A Careful Walk Through Probability Distributions with Python")
st.markdown("""
_By Eric J. Ma, for PyCon 2020_
Hey there! Thanks for stopping by.
I made this streamlit app to explain what probability distributions are.
It's part of my talk delivered at the virtual PyCon 2020 conference.
I hope you enjoy it!
------
""")
st.header("Probability distributions as objects")
st.markdown("""
For a Pythonista, the best way to think about probability distributions
is as a Python object.
It's an object that can be configured when instantiated,
and we can do a number of things with it.
Let's think about it in concrete terms
by studying the **Normal** or **Gaussian** distribution.
To keep the pedagogy clean, I'm going to borrow some machinery
from the SciPy stats library.
""")
st.markdown("""
### Initialization
The Normal distribution is canonically initialized with two parameters,
`mu` and `sigma`.
`mu` controls the "central tendency" of the distribution,
i.e. where on the number line it is centered on,
while `sigma` controls the "spread" of the distribution,
i.e. how wide the Normal distribution is.
""")
with st.echo():
from scipy.stats import norm
class Normal:
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
self.dist = norm(loc=mu, scale=sigma)
st.markdown("""
## Probability Density Function
A probability distribution also contains
one function called the "probability density/mass function"
for continuous and discrete distributions respectively.
By definition, __area__ under the PDF/PMF must sum to 1.
This function has a characteristic shape,
such as the bell shape for a Gaussian,
and is defined over the number line for all valid values of the distribution
(also known as the support).
For example, the Gaussian has "support" from $-\infty$ to $+\infty$,
so the distribution is valid
for values of from negative to positive infinity.
The `logpdf` is nothing more than the log transform of the `pdf`.
""")
with st.echo():
from scipy.stats import norm
class Normal:
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
self.dist = norm(loc=mu, scale=sigma)
def pdf(self, x):
"""Evalute likelihood of data `x` given the distribution."""
return self.dist.pdf(x)
def logpdf(self, x):
"""Evaluate log likelihood of data `x` given the distribution."""
return self.dist.logpdf(x)
st.markdown("""
## Drawing Numbers
The final piece we'll introduce is the ability to draw numbers from the distribution.
More pedantically, we're drawing numbers from the _support_ of the distribution,
i.e. only values that are valid for the distribution.
""")
with st.echo():
from scipy.stats import norm
class Normal:
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
self.dist = norm(loc=mu, scale=sigma)
def pdf(self, x):
"""Evalute likelihood of data `x` given the distribution."""
return self.dist.pdf(x)
def logpdf(self, x):
"""Evaluate log likelihood of data `x` given the distribution."""
return self.dist.logpdf(x)
def draw(self, n):
"""Draw n values from the distribution."""
return self.dist.rvs(n)
st.markdown("""
## Visual Exploration
Enough said! Let's explore these ideas in a visual fashion now.
First off, go ahead and configure a Gaussian using the widgets on the sidebar!
""")
st.sidebar.markdown("## Configure your gaussian")
mu = st.sidebar.slider("mu: central tendency", min_value=-5., max_value=5., value=0., step=0.1)
sigma = st.sidebar.slider("sigma: spread", min_value=0.1, max_value=10.0, value=1.0, step=0.1)
st.sidebar.markdown("------")
gaussian = Normal(mu, sigma)
import numpy as np
import matplotlib.pyplot as plt
def plot_gaussian(gaussian):
fig, ax = plt.subplots()
boundwidth = 0.001
minval, maxval = gaussian.dist.ppf([boundwidth, 1 - boundwidth])
xs = np.linspace(minval, maxval, 1000)
ys = gaussian.pdf(xs)
data = pd.DataFrame({"x": xs, "pdf": ys})
ax = data.plot(x="x", y="pdf", ax=ax)
ax.set_xlim(data["x"].min() - 1, data["x"].max() + 1)
ax.set_ylim(0, data["pdf"].max() * 2)
ax.set_ylabel("Probability density function")
ax.set_title("Your bespoke Gaussian!")
ax.set_xlabel("support (or 'x')")
return fig, ax, minval, maxval
fig, ax, minval, maxval = plot_gaussian(gaussian)
st.pyplot(fig)
"""### Probability and Likelihood
As mentioned above,
the Normal distribution has a probability density function.
That's the blue curve you see in the plot.
By definition,
the area under the probability density/mass function sums to one,
and so for a continuous distribution like the Gaussian,
"probability" is strictly defined _only_ for ranges,
and not for individual values.
Go ahead and calculate the total probability
for a range of x-values.
In particular, see how changing the width
changes the total probability of the x-values.
(In particular, bring the width to 0!)
"""
lower_quartile, upper_quartile = gaussian.dist.ppf([0.25, 0.75])
minrange, maxrange = st.slider("Range on the x-axis", min_value=minval, max_value=maxval, value=[lower_quartile, upper_quartile])
total_probability = gaussian.dist.cdf(maxrange) - gaussian.dist.cdf(minrange)
xs = np.linspace(minrange, maxrange, 1001)
fill_data = pd.DataFrame({
"x": xs,
"pdf": gaussian.pdf(xs),
"lowerbound": np.zeros(len(xs))
})
fig2, ax2, minval, maxval = plot_gaussian(gaussian)
ax2.set_title(f"Total Probability: {total_probability:.2f}")
ax2.fill_between(fill_data["x"], fill_data["lowerbound"], fill_data["pdf"], color="red")
st.pyplot(fig2)
st.markdown(f"""
On the other hand, we also use the probability density function
as a way of expressing "likelihood".
What's "likelihood"?
Basically, it's the "credibility points" idea
associated with a single number on the x-axis.
The higher the likelihood, the greater the amount of credibility points.
Likelihood, as expressed by the probability density function,
is not the **area** under the curve,
but the **height** of the curve.
As such, we can talk about the **likelihood of a single point**.
Go try it out below!
""")
xs = np.linspace(minrange, maxrange, 1000)
fig2, ax2, minval, maxval = plot_gaussian(gaussian)
x_value = st.slider("Pick a value x to caclualte its likelihood:", min_value=minval, max_value=maxval, value=(minval + maxval) / 2, step=(maxval - minval)/100)
like = gaussian.pdf(x_value)
ax2.vlines(x=x_value, ymin=0, ymax=like, color="red")
ax2.set_title(f"Likelihood: {like:.2f}")
st.pyplot(fig2)
st.markdown("""
These two demos hopefully make clear to you that we can never really speak of
the "probability of a number" in a continuous distribution.
However, we can talk about the "likelihood of a number".
### Drawing numbers from a probability distribution
Turns out, one other thing you can do with probability distributions
is to draw numbers from them.
Yes, they are random number generators!
(More generally, they are also called "generative models" of data.)
We can generate a sequence of draws
all we have to do is call on the `.draw()` method:
```python
xs = distribution.draw(1000)
```
Go ahead and request for a number of "draws" from your Gaussian!
""")
num_draws = st.number_input("Number of draws", min_value=0, max_value=2000, value=0, step=250)
draws = gaussian.draw(num_draws)
fig3, ax3, minval, maxval = plot_gaussian(gaussian)
ax3.vlines(x=draws, ymin=0, ymax=gaussian.pdf(draws), alpha=0.1)
st.pyplot(fig3)
st.markdown("""
Notice how as you increase the number of draws,
the density of draws gets in the regions that have
a higher PDF.
This corresponds to having a higher "density" of points
being drawn in that region.
-------
""")
st.markdown("""
## Inferring parameter values given data.
Previously, you saw "probability" as the area under the curve.
"Likelihood", however, is the _height_ of the curve.
We can calculate the likelihood of a single data point
by taking its value on the x-axis and drawing a line up to the height of the curve.
The total likelihood of multiple data points
is the product of all of their likelihoods.
That said, the numbers can get small really quickly on a computer,
so to prevent underflow, we typically take the sum of log likelihoods
rather than the product of likelihoods.
Using our distribution object above, it literally is nothing more than calling:
```python
distribution.logpdf(xs).sum()
```
Let's play a game around this point.
I've got data for you, in the form of "draws" from a Gaussian.
However, I'm not telling you what the configuration of that Gaussian is.
Can you infer what the most likely configuration is?
Here's the values:
""")
data = [0.89, 0.81, 0.54, 0.50, -0.11, 1.19]
st.write(data)
st.markdown("And here's a Gaussian for you to play with. (See the sidebar.)")
mu = st.number_input("mu", min_value=-5., max_value=5., value=0., step=0.1)
sigma = st.number_input("sigma", min_value=0.1, max_value=10., value=1.0, step=0.1)
gaussian = Normal(mu, sigma)
fig4, ax4 = plt.subplots()
boundwidth = 0.001
minval, maxval = gaussian.dist.ppf([boundwidth, 1 - boundwidth])
xs = np.linspace(minval, maxval, 1000)
ys = gaussian.pdf(xs)
ax4.plot(xs, ys)
likelihoods = gaussian.pdf(data)
loglikes = gaussian.logpdf(data)
ax4.vlines(x=data, ymin=0, ymax=likelihoods)
ax4.scatter(x=data, y=likelihoods, color="red", marker="o")
ax4.set_ylim(0, max(ys) * 2)
ax4.set_title(f"Total log-likelihood: {loglikes.sum():.3f}")
st.pyplot(fig4)
st.markdown(
"""
If you tried to maximize the log likelihood,
then you were attempting "maximum likelihood estimation"!
This is a natural way of approaching estimation of parameters
given data.
""")
st.markdown(
"""
## What more from here?
As I wrote above, I have adopted Bayesian methods in my day-to-day work at NIBR.
Everything shown in this app has a natural extension to Bayesian methods.
If you're curious how Bayesian methods work computationally,
check out [this essay that I wrote][compbayes].
[compbayes]: https://ericmjl.github.io/essays-on-data-science/machine-learning/computational-bayesian-stats/
"""
)
st.sidebar.markdown("""
Made with ❤️ using [matplotlib][mpl], [numpy][np], [scipy stats library][sp], and [streamlit][st].
[mpl]: https://matplotlib.org/
[np]: https://numpy.org/doc/
[sp]: https://scipy.github.io/devdocs/stats.html#module-scipy.stats
[st]: https://docs.streamlit.io/
Check out the source of this app [on GitHub][gh].
Copyright 2020-present, Eric J. Ma.
[gh]: https://github.com/ericmjl/what-are-probability-distributions
""")
st.markdown(
"""
If you made it this far, congratulations!
If you liked this content and want to support my efforts creating
programming-oriented educational material for data scientists,
then please [send me a cup of coffee on Patreon][patreon]!
[patreon]: https://www.patreon.com/ericmjl
Meanwhile, I send out a programmer-oriented data science newsletter every month.
If you'd like to receive a curated newsletter of tools, tips and techniques,
come sign up at [TinyLetter][tinyletter]
[tinyletter]: https://tinyletter.com/ericmjl
"""
)
st.sidebar.markdown("""
Support me on [Patreon][patreon]!
[patreon]: https://patreon.com/ericmjl
""")
finished = st.button("Click me!")
if finished:
st.balloons()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment