Skip to content

Instantly share code, notes, and snippets.

@mzdravkov
Last active May 20, 2022 20:34
Show Gist options
  • Save mzdravkov/4fd25f5f780a7936b26d3a196d6b1ba2 to your computer and use it in GitHub Desktop.
Save mzdravkov/4fd25f5f780a7936b26d3a196d6b1ba2 to your computer and use it in GitHub Desktop.
import re
text1 = """
I celebrate myself, and sing myself,
And what I assume you shall assume,
For every atom belonging to me as good belongs to you.
I loafe and invite my soul,
I lean and loafe at my ease observing a spear of summer grass.
My tongue, every atom of my blood, form’d from this soil, this air,
Born here of parents born here from parents the same, and their parents the same,
I, now thirty-seven years old in perfect health begin,
Hoping to cease not till death.
Creeds and schools in abeyance,
Retiring back a while sufficed at what they are, but never forgotten,
I harbor for good or bad, I permit to speak at every hazard,
Nature without check with original energy.
"""
text2 = """
You don’t know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain’t no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly—Tom’s Aunt Polly, she is—and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before.
"""
contractions = {
"ain’t": "is not",
"aren’t": "are not",
"can’t": "cannot",
"could’ve": "could have",
"couldn’t": "could not",
"didn’t": "did not",
"doesn’t": "does not",
"don’t": "do not",
"everybody’s": "everybody is",
"everyone’s": "everyone is",
"hadn’t": "had not",
"had’ve": "had have",
"hasn’t": "has not",
"haven’t": "have not",
"he’d": "he had",
"he’ll": "he shall",
"he’s": "he has",
"here’s": "here is",
"how’ll": "how will",
"how’re": "how are",
"how’s": "how is",
"I’d": "I had",
"I’m": "I am",
"I’ve": "I have",
"isn’t": "is not",
"it’d": "it would",
"it’ll": "it shall",
"it’s": "it has",
"let’s": "let us",
"may’ve": "may have",
"might’ve":"might have",
"mustn’t": "must not",
"must’ve": "must have",
"o’clock": "of the clock",
"ought’ve": "ought have",
"oughtn’t": "ought not",
"’s": " is",
"shan’t": "shall not",
"she’d": "she had",
"she’ll": "she shall",
"she’s": "she has",
"should’ve": "should have",
"shouldn’t": "should not",
"somebody’s": "somebody has",
"someone’s": "someone has",
"something’s": "something has",
"that’ll": "that shall",
"that’s": "that has",
"that’d": "that would",
"there’d": "there had",
"there’ll": "there shall",
"there’s": "there has",
"they’d": "they had",
"they’ll": "they shall",
"they’re": "they are",
"they’ve": "they have",
"this’s": "this has",
"wasn’t": "was not",
"we’d": "we had",
"we’ll": "we shall",
"we’re": "we are",
"we’ve": "we have",
"weren’t": "were not",
"what’d": "what did",
"what’ll": "what shall",
"what’re": "what are",
"what’s": "what has",
"what’ve": "what have",
"when’s": "when has",
"where’d": "where did",
"where’ll": "where shall",
"where’re": "where are",
"where’s": "where has",
"where’ve": "where have",
"which’d": "which had",
"which’ll": "which shall",
"which’re": "which are",
"which’s": "which has",
"which’ve": "which have",
"who’d": "who would",
"who’d've": "who would have",
"who’ll": "who shall",
"who’re": "who are",
"who’s": "who has",
"who’ve": "who have",
"why’d": "why did",
"why’re": "why are",
"why’s": "why has",
"won’t": "will not",
"would’ve": "would have",
"wouldn’t": "would not",
"y’all": "you all",
"you’d": "you had",
"you’ll": "you shall",
"you’re": "you are",
"yo’ve": "you have",
}
def tokenize(text):
normalized_text = text
for contraction in contractions.keys():
normalized_text = normalized_text.replace(contraction, contractions[contraction])
normalized_text = re.sub(r'\W', '\n', normalized_text)
return [token for token in normalized_text.split('\n') if token.strip() != '']
for text in (text1, text2):
print("=" * 42)
print("Text:")
print(text)
print("\n")
tokens = tokenize(text)
print("{} tokens found:".format(len(tokens)))
for token in tokens:
print(token)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment