Last active
August 15, 2022 20:41
-
-
Save dlebech/5e2885e852bdca38d99c27a55fb6636c to your computer and use it in GitHub Desktop.
Binomial probability calculation function in SQL (BigQuery)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- Public Domain CC0 license. https://creativecommons.org/publicdomain/zero/1.0/ | |
-- Calculate the probability of k successes for n trials with probability of success k, | |
-- using the binomial distribution. | |
-- Calculate the binomial coefficient using the "multiplicative formula" | |
CREATE OR REPLACE FUNCTION functions.binomial_coef(n INT64, k INT64) AS (( | |
-- k!/(n!*(n-k)!) | |
-- We're going to have a hard time doing factorials here, | |
-- but based on the "multiplicative formula" in Wiki, it should be possible: | |
-- https://en.wikipedia.org/wiki/Binomial_coefficient#Computing_the_value_of_binomial_coefficients | |
-- Taken together with the fact that we can do EXP(SUM(LOG(x))) to do product over multiple rows, we have: | |
SELECT | |
-- Binomial coefficient is always an integer, | |
-- but BigQuery is not super happy about EXP(SUM(LN(x))) trick, | |
-- so the ROUND is added to be safe :-) | |
ROUND(EXP(SUM(LN((n + 1 - i) / i)))) | |
FROM | |
-- Take advantage of symmetry (n // k) == (n // (n-k)) | |
UNNEST(GENERATE_ARRAY(1, IF(k > n - k, n - k, k), 1)) i | |
)); | |
-- Calculate the probability of k successes in n trials where the probability of a success is p. | |
-- This uses the probability density function for the binomial distribution. | |
-- See example below. | |
CREATE OR REPLACE FUNCTION functions.binomial_prob(n INT64, k INT64, p FLOAT64) AS ( | |
functions.binomial_coef(n, k) * POW(p, k) * POW(1-p, n-k) | |
); | |
-- The example from Wikipedia: https://en.wikipedia.org/wiki/Binomial_distribution#Example | |
-- Probability of getting 4 heads in 6 tosses with a coin that is biased and has heads probability 0.3 | |
-- Calculates to 0.05953499999999999 | |
SELECT functions.binomial_prob(6, 4, 0.3); |
Thanks for getting back. I liked the idea, btw. I tried another implementation, which also doesn't work.
My use case is a bit different from this actually - I needed a two sample test. I've implemented this in javascript (based on a Python implementation) and then added an approximation of the normal cdf to get the p-values that I found. I don't think I can share this though (corporate).
Aha, that makes sense @benman1 I'm happy to hear it worked out for you with a different solution 🙂
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@benman1 thanks for the comment, and nicely spotted. My original use case didn't involve
n
larger than 20, so I never tested it for very large trial sizesn
😅The error occurs in the binomial coefficient calculation due to the way the product is calculated over multiple rows. For large trial sizes, the coefficient becomes incredibly high anyway and BigQuery might not be able to calculate it -- even with a different method. For example, the binomial coefficient for 300 and 150 (your second example) is already 9.37 * 10^88 [1].
I can't provide a compelling argument against using binomial probability for larger-than-1000 trial sizes since I'm not a math person, but I can conclude that the current calculation probably won't be feasible to use in any programming language for larger trial sizes -- it needs to be rewritten.
If you have a suggestion on how to get around this, I'm very curious to learn more about it though!
[1] 93,759,702,772,827,452,793,193,754,439,064,084,879,232,655,700,081,358,920,472,352,730,000,000,000,000,000,000,000,000 according to ref