- lemma
- same stem, pos, rouch semantics
- eg, bank, sing
- a lemma can have senses
- wordform
- the inflected word as seen in the text
- eg, banks, sung
share a form but have unrelated, distinct meanings
polysemous words have related meanings
eg, bank can mean a financial institution or a building belonging to the financial institution
eg, school, university, hospital all can mean the institution or the building
- relation between senses and not words
- relation between senses
- ...
- ...
- hierarchically organised lexical database
- online thesaurus + aspects of a dictionary
- using synset (synonym set), set of near-synonyms, gloss
- Given
- a word in context
- a fixed inventory of potential word senses
- Decide
- which sense of the word this is
-
Lexical Sample Task
- small pre-selected set of target words
- supervised machine learning
-
All-words Task
- every word in an entire text
- training corpus is used to train a classifer
- what do we need?
- a tag set
- the training corpus
- set of features extracted from the corpus
- classifier
- vectors of sets of feature/value pairs
- represented as an ordered list of values
- these vecs represent for eg, the window of words around the target
- 2 types of features
- collocational
- features about words at specific positions around the target
- bag-of-words
- features about words that occur anywhere in the window
- collocational
Example
An electric guitar and bass player stand off to one side and not really part of the scene
Assume an window of +/-2 from the target.
The window contains ['guitar', 'and', 'player', 'stand']
Collocational features will have positional information.
Bag-of-words contains and vocabulary and a binary indicator.
Assume the vocab to be ['fishing', 'big', 'sound', 'player', 'fly', 'rod', 'pound', 'double', 'runs', 'playing', 'guitar', 'band']
The vector for guitar and bass player stand: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
- Input:
- a word
w
and some featuresf
- a fixed set of classes
c = {c1, c2, ...}
- a word
- Output:
- a predicate class
c belongs to C
- a predicate class
Types
- Naive Bayes
- classification based on Bayes Rule
- relies on simple representations of documentation like bag-of-words
- Logistic regression
- Neural Nets
- ...
the bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable-rate mortgage securities
- given the two WordNet senses
- financial institution
- sloping land
- words like deposits and mortgage can be found in the WordNet related to the first sense
- needs some sense-labelled data (like SemCor)
- take all the sentences with the relevant word sense
- add these to the gloss+examples for each sense, call it signature of the sense
- choose sense with most overlap between context and signature
- weigh each overlapping word by inverse document frequency
- WordNet as graph
- senses are nodes
- relations (hypernymy, etc) are edges
- add edges between word and unambiguous gloss words
- Insert target word or words in its sentential context into the graph, with directed edges to their senses
- Choose the most central sense
- Supervised and dictionary-based approaches require large hand-built resources
- What if you don’t have so much training data?
- Solution:
- Generalize from a very small hand-labeled seed-set