Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created May 9, 2019 15:04
Show Gist options
  • Save gaphex/53117647d65fe34e41bc6061dad72204 to your computer and use it in GitHub Desktop.
Save gaphex/53117647d65fe34e41bc6061dad72204 to your computer and use it in GitHub Desktop.
regex_tokenizer = nltk.RegexpTokenizer("\w+")
def normalize_text(text):
# lowercase text
text = str(text).lower()
# remove non-UTF
text = text.encode("utf-8", "ignore").decode()
# remove punktuation symbols
text = " ".join(regex_tokenizer.tokenize(text))
return text
def count_lines(filename):
count = 0
with open(filename) as fi:
for line in fi:
count += 1
return count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment