Skip to content

Instantly share code, notes, and snippets.

@leongkui
Last active July 4, 2023 01:15
Show Gist options
  • Save leongkui/ecf1fe38de26a414f92a75f4e94cacf7 to your computer and use it in GitHub Desktop.
Save leongkui/ecf1fe38de26a414f92a75f4e94cacf7 to your computer and use it in GitHub Desktop.
Python Bigram
# replace all punctuations with spaces in plain_text and store in `cleaned` variable
cleaned = re.sub(r'[^\w\s]', ' ', plain_text)
# replace all whitespace with single space in `cleaned` variable
cleaned = re.sub(r'\s+', ' ', cleaned)
# split the cleaned text into a list of words
words = cleaned.split()
# create a list to store bigrams
# iterate through words list and combine current with the next word to form bigram and append to bigrams list
bigrams = []
for i in range(len(words) - 1):
bigrams.append(words[i] + ' ' + words[i + 1])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment