Created
March 19, 2012 19:17
-
-
Save joernhees/2124894 to your computer and use it in GitHub Desktop.
small script which calculates the significance of term co-occurrences in a document corpus
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python2.7 | |
# -*- coding: utf-8 -*- | |
""" | |
Created on Mar 19, 2012 | |
@author: Joern Hees | |
""" | |
import scipy | |
from scipy.stats import poisson | |
import math | |
def sig_cooccs(a,b,k,n): | |
""" Calculates the significance k observed co-occurrences of two terms A and | |
B, in a corpus of size n, while each is observed to occur in a resp. | |
b documents (or scopes, depending on what you do). | |
The significance is based on poisson probabilities: Given a, b and n we | |
can calculate lambda l = a/n * b/n * n = a*b/n which is the expected | |
number of co-occurrences of A and B in case they were statistically | |
independent. Again: l is the number of co-occurrences one would not be | |
worried about as it's just by chance that A and B would meet in l | |
documents given a,b, and n. | |
With l we can calculate the probability that we have exactly x co- | |
occurrences P(#cooccs=x, lambda=l), but as we observed k co-occurrences | |
we are interested in the probability of k or more co-occurrences | |
P(#cooccs >= k, lambda=l). | |
The smaller this probability is, the more significant it is when A and B | |
co-occur k times. To get the significance we take the negative log of | |
that probability and divide it by the log of the corpus size n. | |
@return: the significance value (float between 0 and inf) | |
@see: Evert, Stefan. 2005. | |
The Statistics of Word Cooccurrences - Word Pairs and Collocations. | |
Pages 77ff, 91ff. | |
Universitaet Stuttgart. | |
http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/ . | |
>>> print sig_cooccs(1000,1000,650,10000) | |
72.8162278963 | |
>>> print sig_cooccs(1000,1000,1000,10000) | |
inf | |
""" | |
l = float(a)*b/n | |
# In order not to run into problems with floating point precision it's | |
# advisable to use the logsf (log survival) function which gives us the log | |
# of P(#cooccs > k, l). Hence k-1! | |
return -poisson.logsf(k-1, l) / math.log(n) | |
if __name__ == '__main__': | |
import doctest | |
doctest.testmod() | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment