Skip to content

Instantly share code, notes, and snippets.

@joernhees
Created March 19, 2012 19:17
Show Gist options
  • Save joernhees/2124894 to your computer and use it in GitHub Desktop.
Save joernhees/2124894 to your computer and use it in GitHub Desktop.
small script which calculates the significance of term co-occurrences in a document corpus
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
"""
Created on Mar 19, 2012
@author: Joern Hees
"""
import scipy
from scipy.stats import poisson
import math
def sig_cooccs(a,b,k,n):
""" Calculates the significance k observed co-occurrences of two terms A and
B, in a corpus of size n, while each is observed to occur in a resp.
b documents (or scopes, depending on what you do).
The significance is based on poisson probabilities: Given a, b and n we
can calculate lambda l = a/n * b/n * n = a*b/n which is the expected
number of co-occurrences of A and B in case they were statistically
independent. Again: l is the number of co-occurrences one would not be
worried about as it's just by chance that A and B would meet in l
documents given a,b, and n.
With l we can calculate the probability that we have exactly x co-
occurrences P(#cooccs=x, lambda=l), but as we observed k co-occurrences
we are interested in the probability of k or more co-occurrences
P(#cooccs >= k, lambda=l).
The smaller this probability is, the more significant it is when A and B
co-occur k times. To get the significance we take the negative log of
that probability and divide it by the log of the corpus size n.
@return: the significance value (float between 0 and inf)
@see: Evert, Stefan. 2005.
The Statistics of Word Cooccurrences - Word Pairs and Collocations.
Pages 77ff, 91ff.
Universitaet Stuttgart.
http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/ .
>>> print sig_cooccs(1000,1000,650,10000)
72.8162278963
>>> print sig_cooccs(1000,1000,1000,10000)
inf
"""
l = float(a)*b/n
# In order not to run into problems with floating point precision it's
# advisable to use the logsf (log survival) function which gives us the log
# of P(#cooccs > k, l). Hence k-1!
return -poisson.logsf(k-1, l) / math.log(n)
if __name__ == '__main__':
import doctest
doctest.testmod()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment