Created
May 4, 2013 22:32
-
-
Save pilipolio/5518999 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
# <nbformat>3.0</nbformat> | |
# <headingcell level=1> | |
# "Chance to beat" or C2B empirical estimation | |
# <markdowncell> | |
# We define two and independent random variables $A$ and $B$ and their probability density, respectively $f_A :\mathbb{R} \mapsto [0;1]$ and $f_B: \mathbb{R} \mapsto [0;1]$. The "chance to beat" probability, or $C2B$, is the probability that a sample from $B$ is greater than a sample from $A$, i.e. | |
# | |
# | |
# <center>$C2B(A,B) = \mathcal{P}[A > B]$.</center> | |
# <headingcell level=2> | |
# First example with known gaussian variables | |
# <markdowncell> | |
# Let's draw the two pdfs fron an example where $A\sim\mathcal{N}(\mu_A,\sigma_A)$ and $B\sim\mathcal{N}(\mu_B,\sigma_B)$ with $\mu_A = 3$, $\mu_B = 2$ and $\sigma_A = 0.3$ and $\sigma_B = 0.5$. | |
# <codecell> | |
mu_a = 3; mu_b = 2 | |
sigma_a = .4 | |
sigma_b = .6 | |
xs = np.linspace(0,6,100) | |
from scipy.stats import distributions | |
from scipy.stats.distributions import norm | |
pdf_a_xs = norm.pdf(xs, loc = mu_a, scale=sigma_a) | |
pdf_b_xs = norm.pdf(xs, loc = mu_b, scale=sigma_b) | |
plt.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75); | |
plt.figsize(12,6) | |
plt.legend(['$f_a$', '$f_b$']); | |
# <markdowncell> | |
# For a fixed value $a$ drawn from $A$ with a probability $f_A(a)$, the chance to beat a sample $b$ from $B$ is $P[b \le a] = P[B \le a] = F_B(a)$ where $F_B$ is the cumulative distribution function of $B$. Let's illustrate this with arbitrary value for $a$. | |
# <codecell> | |
# Initialize figure and maine axe | |
f, ax = plt.subplots(1, 1) | |
f.set_size_inches(12,6) | |
ax.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75); | |
plt.legend(['$f_a$', '$f_b$'], prop={'size':20}); | |
# Estimate pdf of A and B at a. | |
a = 2.75 | |
f_A_a = norm.pdf(a, loc = mu_a, scale=sigma_a) | |
f_B_a = norm.pdf(a, loc = mu_b, scale=sigma_b) | |
ax.vlines(a, 0, f_A_a, color='blue', linestyles='dashed'); | |
ax.hlines(f_A_a, 0, a, color='blue', linestyles='dashed'); | |
ax.fill_between(xs[xs<=a], y1=0, y2=pdf_b_xs[xs<=a], color='green', alpha=.25); | |
ax.text(a/2, f_A_a, s='$f_A(a)$', color='blue', size=20, ha='center',va='bottom'); | |
ax.text(a, 0, s='$a$', color='blue', size=20, ha='center',va='top'); | |
ax.text(mu_b, norm.pdf(mu_b, loc = mu_b, scale=sigma_b)/2, s='$F_B(a)$',color='green', size=20, ha='center',va='bottom'); | |
# <markdowncell> | |
# The probability $P[A > B]$ is the integral of $F_B(a)$ for every possible value of $a \in \mathrm{R}$ with a pointwise density $f_A(a)$ : | |
# | |
# <center> $$C2B(A,B) = P[A > B] = \int\limits_{a \in \mathrm{R}} F_B(a) f_A(a) da$$</center> | |
# | |
# Empirically (and for the purpose of plotting), we have defined a grid $ (x_i)_{i=1, \\cdots,n} $ of $ n = 100 $ equi-distant points between $ x_1 = 0 $ and $ x_n = 6 $. The value $ F_B(x_i) $ of the cumulative distribution of $B$ at a point $x_i$ can be numerically approximated by the cumulated sums of $ f_B(x_j) $ from $j$ up to $i$ : | |
# | |
# <center> $F_B(x_i) \\approx \\sum \\limits_{j=1}^{i} f_B(x_j) \\frac{x_n - x_1}{n} $ </center> | |
# <codecell> | |
approx_FB_xs = np.cumsum(pdf_b_xs * (xs[-1] - xs[0]) / len(xs)) | |
plt.plot(xs, norm.cdf(xs, loc = mu_b, scale=sigma_b), 'k-') | |
plt.figsize(12, 6) | |
plt.bar(xs, approx_FB_xs, 1.0/len(xs), color='black', alpha=.1) | |
plt.legend(['$F_B(a)$', r'$\approx F_B(a)$'],'upper left',prop={'size':20}); | |
# <markdowncell> | |
# And then apply the same scheme to approximate $C2B(A,B)$ : | |
# <center> $$ C2B(A,B) \approx \sum \limits_{i=1}^{n} F_B(x_i) f_A(x_i) \approx \sum \limits_{i=1}^{n} \Big( \sum \limits_{j=1}^{i} f_B(x_i) \frac{x_n - x_1}{n} \Big) f_A(x_i) \frac{x_n - x_1}{n} $$</center> | |
# <codecell> | |
C2B = np.sum(approx_FB_xs * pdf_a_xs * (xs[-1] - xs[0]) / len(xs)) | |
print 'The C2B is {:.3f}'.format(C2B) | |
# <markdowncell> | |
# Let's define a generic function which calculates the $C2B(A,B)$ given : | |
# | |
# * An array of probability measures of $A$ (pointwise or integrated over a bin, anyway taking into account the $dx$ part) on a set of points or bins spanning the whole domain of $A$. | |
# * An array of the probability measures of $B$ on the same set, also taking into account the $dx$ part. | |
# <codecell> | |
def calculate_C2B(prob_measures_A, prob_measures_B): | |
assert len(prob_measures_A) == len(prob_measures_B), "arguments prob_measures_A and prob_measures_B must have the same length" | |
return np.sum(np.cumsum(prob_measures_B) * prob_measures_A) | |
print calculate_C2B((xs[-1]-xs[0]) / len(xs) * pdf_a_xs, (xs[-1]-xs[0]) / len(xs) * pdf_b_xs) | |
# <headingcell level=2> | |
# Samples from two unknown discrete distributions | |
# <markdowncell> | |
# In a more realistic context, we observe finite samples $n_a$ and $n_b$ of unknown variables $A$ and $B$. | |
# <codecell> | |
from scipy.stats import distributions | |
n_a = 20 | |
n_b = 15 | |
a_samples = np.random.binomial(n=20,p=.7,size=n_a) | |
b_samples = np.random.binomial(n=30,p=.4,size=n_b) | |
#plt.plot(a_samples, np.repeat(0, n_a), 'xb', b_samples, np.repeat(0, n_b), 'og') | |
#plt.plot(np.arange(n_a), a_samples, 'xb', np.arange(n_b), b_samples, 'og') | |
plt.plot(a_samples, 'ob', b_samples, 'og', alpha=0.9) | |
plt.ylim([0, np.max(np.concatenate((a_samples, b_samples))) + 5]) | |
plt.legend([r'$(A_i)_{i=1,\ldots,n_A}$', r'$(B_i)_{i=1,\ldots,n_B}$'], prop={'size':20}); | |
# <markdowncell> | |
# The $C2B(A,B)$ or $P[A > B]$ can actually be computed in a combinatorial way : | |
# | |
# "For all every sample $a$ drawn from $A$, enumerate all the smaller or equal samples from $B$." | |
# "The probability $P[A > B]$ is the sum of those case where $A$ was actually greater than $B$ divided by the number of all possibilities" | |
# <codecell> | |
C2B = np.sum([np.sum(b_samples<=a) for a in a_samples], dtype='float') / (n_a * n_b) | |
print 'With the combinatorial method, we found that the C2B is {:.3f}'.format(C2B) | |
# <markdowncell> | |
# But we can apply the same reasoning as before, with $A$ and $B$ probability measures as sums of Dirac indicator functions of weight resp. $1 / n_a$ and $1 / n_b$. | |
# | |
# <center> $$ P[A>B] = \int\limits_{x \in \mathcal{R}} F_B(x) f_A(x) dx $$ </center> | |
# | |
# <center> $$ P[A>B] = \int\limits_{x \in \mathrm{R}} F_B(x) \sum \limits_{i=1}^{n_a} \mathbb{1}(a_i = x) = \frac{1}{n_a} \sum \limits_{i=1}^{n_a} F_B(a_i) $$ </center> | |
# | |
# <center> $$ P[A>B] = \frac{1}{n_a n_b} \sum \limits_{i=1}^{n_a} \sum \limits_{j=1}^{n_b} \mathbb{1}(b_i \le a_i) $$ </center> | |
# | |
# This can be visualize in a similar fashion to the before mentionned gaussian example : | |
# <codecell> | |
grid = np.linspace(0, np.max(np.concatenate((a_samples, b_samples))), 1000) | |
FB_on_A_samples = np.array([np.sum(b_samples<=a) for a in a_samples],dtype='float') | |
FB_on_grid = np.array([np.sum(b_samples<=x) for x in grid]) | |
plt.plot(grid, FB_on_grid, '-g') | |
plt.plot(a_samples, FB_on_A_samples, 'ob', alpha=.5) | |
plt.legend(['$F_B$', '$(A_i)_{i=1,\cdots,n_A}$'], 'upper left', prop={'size':20}); | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment