Skip to content

Instantly share code, notes, and snippets.

@gupul2k
Created December 10, 2012 22:59
Show Gist options
  • Save gupul2k/4254125 to your computer and use it in GitHub Desktop.
Save gupul2k/4254125 to your computer and use it in GitHub Desktop.
Find Most Frequent 500 BoWs(Bag of Words)
#!/usr/bin/python
#Script to generate most frequent 500 BoWs from a corpus (ie lexicon).
#Date: Nov 2 2012
#Author: Hota Sobhan
from string import punctuation
from operator import itemgetter
N = 1000
words = {}
#total_words = 0
words_gen = (word.strip(punctuation).lower() for line in open("C:\Python27\Corpus.txt")
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
#print total_words
for word, frequency in top_words:
print "%s %d" % (word, frequency)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment