Skip to content

Instantly share code, notes, and snippets.

View rohitdholakia's full-sized avatar

Rohit Dholakia rohitdholakia

View GitHub Profile
@rohitdholakia
rohitdholakia / WritingTweets.py
Created December 27, 2011 21:24
Python script to write all tweets into a single file
#This is to take only tweets from a file and put them in another file
import sys
tweetsFile=open(sys.argv[1],'r')#Having the tweets
newFile=open(sys.argv[2],'w')#With only tweets, no user and time data
import itertools
for line1,line2,line3 in itertools.izip_longest(*[tweetsFile]*3):
line3=line3.split('\t') #Split on the basis of tab as we only need the right side , the tweet
newFile.write(line3[1]+"\n")
@rohitdholakia
rohitdholakia / dictionary.py
Created December 30, 2011 06:46
Generating a dictionary from all mails - spam + non-spam
#This script reads all files in all directories of the folder taken from the openClassroom site and generates the dictionary, which we can then store in a file
import os,sys
#We need a dictionary to store word occurences. What we can do is create a default dict and then update the frequencies. Write it all into a file all at once.
from collections import *
dictionary = defaultdict(int)
listWords = []
fdict = open(sys.argv[2],'w') #File to write all the entries in the dictionary
for root,dirnames,filenames in os.walk(sys.argv[1]):
for d in dirnames: #For each directory
@rohitdholakia
rohitdholakia / Training.py
Created December 30, 2011 06:49
This is for training the Bayesian model
#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
from __future__ import division
import sys,os
'''
1. The spam and non-spam is already 50% . So they by default are 0.5
2. Now we need to calculate probability of each word , in spam and non-spam separately
2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
@rohitdholakia
rohitdholakia / classification.py
Created December 30, 2011 06:54
Classification
'''Now given a mail, split it in terms of spaces , then , add up the log probability of each . Multiply it with the spam probability . Do the same thing for non-spam
Whichever is higher wins . Lets start
'''
import sys,os
def makeDict(f):
temp = {}
data = open(f,'r')
for line in data:
prob = line.split(" ")
temp[prob[0]] = prob[1]
@rohitdholakia
rohitdholakia / dict.txt
Created December 30, 2011 07:12
Snapshot of the dictionary file
junk 29
juno 14
june 124
jung 2
cogscus 21
expensive 24
leat 10
leader 8
locate 62
slur 2
@rohitdholakia
rohitdholakia / Vector(s).txt
Created December 30, 2011 12:42
Snapshot of the vector file
Spam values :
latinoweb -12.1396550267
yellow -11.2233642948
four -7.942453079
woody -12.1396550267
payoff -11.7341899186
looking -11.7341899186
eligible -11.4465078461
electricity -11.7341899186
lord -10.8868920582
@rohitdholakia
rohitdholakia / roles.xml
Created January 4, 2012 19:32
A sample of the XML file
<security-role>
<role-name>Employee</role-name>
</security-role>
<security-role>
<role-name>CEO</role-name>
</security-role>
<security-role>
<role-name>CFO</role-name>
</security-role>
@rohitdholakia
rohitdholakia / web.xml
Created January 4, 2012 19:36
Complete role-based access XML file
<servlet>
<servlet-name>Servlet_1</servlet-name>
<servlet-class>com.Servlet_1</servlet-class>
</servlet>
<servlet>
<servlet-name>Servlet_2</servlet-name>
<servlet-class>com.Servlet_2</servlet-class>
</servlet>
<servlet>
@rohitdholakia
rohitdholakia / SQLite.txt
Created January 9, 2012 07:59
SQL Queries to set up Twitter data processing
crazyabtliv@linux-hknk:~> sqlite3 twitter.db
SQLite version 3.7.9 2011-11-01 00:52:41
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
sqlite> create table tweets (
...> username string,
...> tweet string);
sqlite> .tables
tweets
@rohitdholakia
rohitdholakia / preprocessing02.py
Created January 9, 2012 09:10
Version 0.2 of Py script to preprocess tweets
'''The aim is to take all the tweets of a user and store them in a table. Do this for all the users and then lets see what we can do with it
What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started
'''
def regexSub(line):
line = re.sub(regRT,'',line)
line = re.sub(regAt,'',line)
line = line.lstrip(' ')
line = re.sub(regHttp,'',line)
return line
def userName(line): return line[19:]