Skip to content

Instantly share code, notes, and snippets.

@shreydesai
Created June 26, 2019 04:40
Show Gist options
  • Save shreydesai/2f246db5845de3c954b561fc8b31fb3d to your computer and use it in GitHub Desktop.
Save shreydesai/2f246db5845de3c954b561fc8b31fb3d to your computer and use it in GitHub Desktop.
Preprocessing rules for Twitter data
def preprocess(raw):
if '>' in raw:
raw = raw.replace('>','>')
if '<' in raw:
raw = raw.replace('&lt;','<')
if '&amp;' in raw:
raw = raw.replace('&amp;','&')
if '”' in raw or '“' in raw:
raw = raw.replace('“','"')
raw = raw.replace('”','"')
if '’' in raw:
raw = raw.replace('’',"'")
text = [x.text.strip() for x in nlp(raw) if len(x.text.strip())>0]
for i in range(len(text)):
if text[i].startswith('http'):
text[i] = '<link>'
elif text[i].startswith('@'):
text[i] = '<user>'
return ' '.join(text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment