rohitdholakia’s gists

rohitdholakia / WritingTweets.py

Created December 27, 2011 21:24

Python script to write all tweets into a single file

	#This is to take only tweets from a file and put them in another file
	import sys
	tweetsFile=open(sys.argv[1],'r')#Having the tweets
	newFile=open(sys.argv[2],'w')#With only tweets, no user and time data
	import itertools
	for line1,line2,line3 in itertools.izip_longest([tweetsFile]3):
	line3=line3.split('\t') #Split on the basis of tab as we only need the right side , the tweet
	newFile.write(line3[1]+"\n")

rohitdholakia / dictionary.py

Created December 30, 2011 06:46

Generating a dictionary from all mails - spam + non-spam

	#This script reads all files in all directories of the folder taken from the openClassroom site and generates the dictionary, which we can then store in a file
	import os,sys
	#We need a dictionary to store word occurences. What we can do is create a default dict and then update the frequencies. Write it all into a file all at once.
	from collections import *
	dictionary = defaultdict(int)
	listWords = []
	fdict = open(sys.argv[2],'w') #File to write all the entries in the dictionary
	for root,dirnames,filenames in os.walk(sys.argv[1]):
	for d in dirnames: #For each directory

rohitdholakia / Training.py

Created December 30, 2011 06:49

This is for training the Bayesian model

	#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
	from __future__ import division
	import sys,os
	'''
	1. The spam and non-spam is already 50% . So they by default are 0.5
	2. Now we need to calculate probability of each word , in spam and non-spam separately
	2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
	2.2 When time comes to calculate probabilities, we just need to substitute values
	'''
	from collections import *

rohitdholakia / classification.py

Created December 30, 2011 06:54

Classification

	'''Now given a mail, split it in terms of spaces , then , add up the log probability of each . Multiply it with the spam probability . Do the same thing for non-spam
	Whichever is higher wins . Lets start
	'''
	import sys,os
	def makeDict(f):
	temp = {}
	data = open(f,'r')
	for line in data:
	prob = line.split(" ")
	temp[prob[0]] = prob[1]

rohitdholakia / dict.txt

Created December 30, 2011 07:12

Snapshot of the dictionary file

rohitdholakia / Vector(s).txt

Created December 30, 2011 12:42

Snapshot of the vector file

rohitdholakia / roles.xml

Created January 4, 2012 19:32

A sample of the XML file

rohitdholakia / web.xml

Created January 4, 2012 19:36

Complete role-based access XML file

	<servlet>
	<servlet-name>Servlet_1</servlet-name>
	<servlet-class>com.Servlet_1</servlet-class>
	</servlet>
	<servlet>
	<servlet-name>Servlet_2</servlet-name>
	<servlet-class>com.Servlet_2</servlet-class>
	</servlet>

	<servlet>

rohitdholakia / SQLite.txt

Created January 9, 2012 07:59

SQL Queries to set up Twitter data processing

	crazyabtliv@linux-hknk:~> sqlite3 twitter.db
	SQLite version 3.7.9 2011-11-01 00:52:41
	Enter ".help" for instructions
	Enter SQL statements terminated with a ";"
	sqlite> .tables
	sqlite> create table tweets (
	...> username string,
	...> tweet string);
	sqlite> .tables
	tweets

rohitdholakia / preprocessing02.py

Created January 9, 2012 09:10

Version 0.2 of Py script to preprocess tweets

	'''The aim is to take all the tweets of a user and store them in a table. Do this for all the users and then lets see what we can do with it
	What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started
	'''
	def regexSub(line):
	line = re.sub(regRT,'',line)
	line = re.sub(regAt,'',line)
	line = line.lstrip(' ')
	line = re.sub(regHttp,'',line)
	return line
	def userName(line): return line[19:]

Rohit Dholakia rohitdholakia