The following node-based semantic similarity measures are based on the information content of the lowest common subsumer of two words.
- Information content is the probability of finding a term in a given corpus.
- The lowest common subsumer is the most specific ancestor of two words in a lexical taxonomy such as WordNet. For example, the words "cat" and "dog" might have the common ancestors "animal" and "mammal". "Mammal" is their lowest common subsumer, because its distance to "cat" and "dog" is shorter. Note that if you conceive of WordNet as a tree, with "entity" as the root node, "lowest common subsumer" will be a misnomer; the lowest common subsumer is actually the ancestor that is farthest from the root.
These measures are dependent on the specific corpora used to generate the information content.
Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.