Skip to content

Instantly share code, notes, and snippets.

@ikwattro
Created September 9, 2016 20:06
Show Gist options
  • Select an option

  • Save ikwattro/ce5fbfa548bfbec559141bf190431a3f to your computer and use it in GitHub Desktop.

Select an option

Save ikwattro/ce5fbfa548bfbec559141bf190431a3f to your computer and use it in GitHub Desktop.
Custom POS Tagger in Python

Using a custom tagger for python nltk.

Problem: most of examples are buggy, need to stick to nltk 3.0.1 :

pip install nltk==3.0.1
When I checked org.neo4j.kernel.impl.TransactionEvents I saw that some methods were private.
[('When', 'WRB'), ('I', 'PRP'), ('checked', 'VBD'), ('org.neo4j.kernel.impl.TransactionEvents', 'CCN'), ('I', 'PRP'), ('saw', 'VBD'), ('that', 'IN'), ('some', 'DT'), ('methods', 'NNS'), ('were', 'VBD'), ('private', 'JJ'), ('.', '.')]
Do you think that `getEvents()` can be some kind of useful when using `bool`?
[('Do', 'NNP'), ('you', 'PRP'), ('think', 'VBP'), ('that', 'IN'), ('`getEvents', 'NNS'), ('(', 'VBP'), (')', ':'), ('`', '``'), ('can', 'MD'), ('be', 'VB'), ('some', 'DT'), ('kind', 'NN'), ('of', 'IN'), ('useful', 'JJ'), ('when', 'WRB'), ('using', 'VBG'), ('`bool`', 'NN'), ('?', '.')]
Process finished with exit code 0
from nltk.corpus import brown
import nltk.tag, nltk.data
document = '''
Scores of people were already lying dead or injured inside a crowded Orlando nightclub,
and the police had spent hours trying to connect with the gunman and end the situation without further violence.
But when Omar Mateen threatened to set off explosives, the police decided to act, and pushed their way through a
wall to end the bloody standoff.
'''
document2 = '''
When I checked org.neo4j.kernel.impl.TransactionEvents I saw that some methods were private.
Do you think that `getEvents()` can be some kind of useful when using `bool`?
'''
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
patterns = [
(r'(.*\..*){2,}', 'CCN')
]
regexp_tagger = nltk.RegexpTagger(patterns, backoff=default_tagger)
sentences = nltk.sent_tokenize(document2)
for s in sentences:
print(s)
text = nltk.word_tokenize(s)
print( regexp_tagger.tag(text) )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment