Created
May 29, 2012 15:55
-
-
Save kachok/2829198 to your computer and use it in GitHub Desktop.
pickling of words from spanish tweets
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import codecs | |
import pickle | |
file = "/Users/dkachaev/repos/hltcoe/tweets-es/data/oov.vocab" | |
out = codecs.open(file, "r", "utf-8") | |
vocab={} | |
f=open("/Users/dkachaev/repos/hltcoe/tweets-es/data/tweets_es_vocabulary.pickle","w") | |
for line in out: | |
try: | |
line=line.strip() | |
freq, word = line.split(" ") | |
#print word, " - " ,freq | |
vocab[word]={"frequency":int(freq),"context":[""]} | |
# Context - "" <- need text of original tweet where word occurred, or 3 tweets ["tweet1", "tweet2", "tweet3"] | |
except: | |
print "skipping line" | |
pickle.dump(vocab,f) | |
f.close() | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment