Last active
October 1, 2024 08:38
-
-
Save magnusnissel/d9521cb78b9ae0b2c7d6 to your computer and use it in GitHub Desktop.
Yule's K and Yule's I for lexical diversity in Python 3 (quick & dirty)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import collections | |
import re | |
def tokenize(s): | |
tokens = re.split(r"[^0-9A-Za-z\-'_]+", s) | |
return tokens | |
def get_yules(s): | |
""" | |
Returns a tuple with Yule's K and Yule's I. | |
(cf. Oakes, M.P. 1998. Statistics for Corpus Linguistics. | |
International Journal of Applied Linguistics, Vol 10 Issue 2) | |
In production this needs exception handling. | |
""" | |
tokens = tokenize(s) | |
token_counter = collections.Counter(tok.upper() for tok in tokens) | |
m1 = sum(token_counter.values()) | |
m2 = sum([freq ** 2 for freq in token_counter.values()]) | |
i = (m1*m1) / (m2-m1) | |
k = 1/i * 10000 | |
return (k, i) |
Also, k = 10000/i 👍
Thanks for pointing it out, forgot to remove the self after excerpting it from a class.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, maybe you should replace tokenize(self, s) with tokenize(s)