Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save paulmwatson/1a7ed411cb7e5baf4f9e90d39705fc61 to your computer and use it in GitHub Desktop.
Save paulmwatson/1a7ed411cb7e5baf4f9e90d39705fc61 to your computer and use it in GitHub Desktop.
import spacy
from spacy.attrs import ORTH, LEMMA
text = 'Sgt. Maj. A. Grinston found approx. 2.2 miles up a creek on Mt. Toohigh.'
nlp = spacy.load('en_core_web_lg')
print([t.text for t in nlp(text).sents])
#=> ['Sgt.', 'Maj.', 'A. Grinston found approx.', '2.2 miles up a creek on Mt. Toohigh.']
nlp.tokenizer.add_special_case('Sgt.', [{ORTH: 'Sgt.', LEMMA: 'seargeant'}])
nlp.tokenizer.add_special_case('Maj.', [{ORTH: 'Maj.', LEMMA: 'major'}])
nlp.tokenizer.add_special_case('approx.', [{ORTH: 'approx.', LEMMA: 'approximately'}])
print([t.text for t in nlp(text).sents])
#=> ['Sgt. Maj. A. Grinston found approx. 2.2 miles up a creek on Mt. Toohigh.']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment