Last active
November 25, 2017 23:42
-
-
Save kawa-kokosowa/ab246eb1a6d246997ca4fa8f2d2226a6 to your computer and use it in GitHub Desktop.
Jaccard Similarity Function in Python 3 (builtins only)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Jaccard similarity of two sets using builtin Python 3 only.""" | |
import doctest | |
def jaccard_similarity(x: set, y: set) -> float: | |
"""Get the Jaccard similarity of two sets. | |
Example: | |
>>> jaccard_similarity({1,2,3,4}, {2,3,5,7}) | |
0.3333333333333333 | |
>>> jaccard_similarity({1,2,3,4}, {2,4,6}) | |
0.4 | |
>>> jaccard_similarity({2,3,5,7}, {2,4,6}) | |
0.16666666666666666 | |
""" | |
intersection_cardinality = len(x.intersection(y)) | |
union_cardinality = len(x) + len(y) - intersection_cardinality | |
return intersection_cardinality / union_cardinality | |
if __name__ == "__main__": | |
doctest.testmod() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Line #20 is tricky. The formula to get the union_cardinality, while counter-intuitive, is precisely
cardinality (x union y) - cardinality (x intersect y)
. This is because aunion
produces a set (whose members are unique) and thus by subtracting the intersection set (whose elements are also unique) you subtracked the duplicates from theunion set
so that one may finally correctly obtain theunion_cardinality
.