-
-
Save dlebech/5e2885e852bdca38d99c27a55fb6636c to your computer and use it in GitHub Desktop.
-- Public Domain CC0 license. https://creativecommons.org/publicdomain/zero/1.0/ | |
-- Calculate the probability of k successes for n trials with probability of success k, | |
-- using the binomial distribution. | |
-- Calculate the binomial coefficient using the "multiplicative formula" | |
CREATE OR REPLACE FUNCTION functions.binomial_coef(n INT64, k INT64) AS (( | |
-- k!/(n!*(n-k)!) | |
-- We're going to have a hard time doing factorials here, | |
-- but based on the "multiplicative formula" in Wiki, it should be possible: | |
-- https://en.wikipedia.org/wiki/Binomial_coefficient#Computing_the_value_of_binomial_coefficients | |
-- Taken together with the fact that we can do EXP(SUM(LOG(x))) to do product over multiple rows, we have: | |
SELECT | |
-- Binomial coefficient is always an integer, | |
-- but BigQuery is not super happy about EXP(SUM(LN(x))) trick, | |
-- so the ROUND is added to be safe :-) | |
ROUND(EXP(SUM(LN((n + 1 - i) / i)))) | |
FROM | |
-- Take advantage of symmetry (n // k) == (n // (n-k)) | |
UNNEST(GENERATE_ARRAY(1, IF(k > n - k, n - k, k), 1)) i | |
)); | |
-- Calculate the probability of k successes in n trials where the probability of a success is p. | |
-- This uses the probability density function for the binomial distribution. | |
-- See example below. | |
CREATE OR REPLACE FUNCTION functions.binomial_prob(n INT64, k INT64, p FLOAT64) AS ( | |
functions.binomial_coef(n, k) * POW(p, k) * POW(1-p, n-k) | |
); | |
-- The example from Wikipedia: https://en.wikipedia.org/wiki/Binomial_distribution#Example | |
-- Probability of getting 4 heads in 6 tosses with a coin that is biased and has heads probability 0.3 | |
-- Calculates to 0.05953499999999999 | |
SELECT functions.binomial_prob(6, 4, 0.3); |
@benman1 thanks for the comment, and nicely spotted. My original use case didn't involve n
larger than 20, so I never tested it for very large trial sizes n
😅
The error occurs in the binomial coefficient calculation due to the way the product is calculated over multiple rows. For large trial sizes, the coefficient becomes incredibly high anyway and BigQuery might not be able to calculate it -- even with a different method. For example, the binomial coefficient for 300 and 150 (your second example) is already 9.37 * 10^88 [1].
I can't provide a compelling argument against using binomial probability for larger-than-1000 trial sizes since I'm not a math person, but I can conclude that the current calculation probably won't be feasible to use in any programming language for larger trial sizes -- it needs to be rewritten.
If you have a suggestion on how to get around this, I'm very curious to learn more about it though!
[1] 93,759,702,772,827,452,793,193,754,439,064,084,879,232,655,700,081,358,920,472,352,730,000,000,000,000,000,000,000,000 according to ref
Thanks for getting back. I liked the idea, btw. I tried another implementation, which also doesn't work.
My use case is a bit different from this actually - I needed a two sample test. I've implemented this in javascript (based on a Python implementation) and then added an approximation of the normal cdf to get the p-values that I found. I don't think I can share this though (corporate).
Aha, that makes sense @benman1 I'm happy to hear it worked out for you with a different solution 🙂
(300, 150, 0.5)
gives me0.0460275144190249