Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save paulmwatson/dcc309b344775fd36391d085bfbde92f to your computer and use it in GitHub Desktop.
Save paulmwatson/dcc309b344775fd36391d085bfbde92f to your computer and use it in GitHub Desktop.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
text = 'Sgt. Maj. A. Grinston found approx. 2.2 miles up a creek on Mt. Toohigh.'
PunktSentenceTokenizer().tokenize(text)
#=> ['Sgt.', 'Maj.', 'A. Grinston found approx.', '2.2 miles up a creek on Mt.', 'Toohigh.']
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['sgt', 'maj', 'mt', 'approx'])
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(text)
tokenizer.tokenize(text)
#=> ['Sgt. Maj. A. Grinston found approx. 2.2 miles up a creek on Mt. Toohigh.']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment