Skip to content

Instantly share code, notes, and snippets.

@renanreismartins
Last active November 18, 2015 00:12
Show Gist options
  • Save renanreismartins/8e9ec9c756f0ec116713 to your computer and use it in GitHub Desktop.
Save renanreismartins/8e9ec9c756f0ec116713 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python
import sys
import re
import csv
p = re.compile(ur'[^\s+\,\.\!\?\:\;\"\(\)\<\>\#\$\=\-\/]+')
reader = csv.reader(sys.stdin, delimiter='\t')
next(reader, None) # skip headers
for line in reader:
if len(line) >= 5:
id = line[0]
body = line[4]
words = [word.lower() for word in re.findall(p, body)]
for word in words:
print '%s\t%s' % (word,1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment