Skip to content

Instantly share code, notes, and snippets.

@rohitdholakia
Created December 30, 2011 06:46
Show Gist options
  • Select an option

  • Save rohitdholakia/1538312 to your computer and use it in GitHub Desktop.

Select an option

Save rohitdholakia/1538312 to your computer and use it in GitHub Desktop.
Generating a dictionary from all mails - spam + non-spam
#This script reads all files in all directories of the folder taken from the openClassroom site and generates the dictionary, which we can then store in a file
import os,sys
#We need a dictionary to store word occurences. What we can do is create a default dict and then update the frequencies. Write it all into a file all at once.
from collections import *
dictionary = defaultdict(int)
listWords = []
fdict = open(sys.argv[2],'w') #File to write all the entries in the dictionary
for root,dirnames,filenames in os.walk(sys.argv[1]):
for d in dirnames: #For each directory
for f in os.listdir(d):
data = open ( os.path.join(sys.argv[1],d,f),'r')
for line in data:
words = line.split(" ")#Split words on space
for w in words:
dictionary[w] += 1
count = 0
for k in dictionary.keys():
count = count+1
listWords.append(k)
for i in listWords:
fdict.write(i+" "+str(dictionary[i])+"\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment