Shashank Gupta shashankg7

Expand The Edinburgh Twitter FSD Corpus

The Python scripts attached here take care of the following tedious work, and should help one quickly get started with some real work on the corpus:

Respect the Twitter API rate limits and throttle API hits.
Don't hit the API for already expanded tweet ID's, so you can resume tweet expansion after stopping midway.
Parse the API response and dump it into the correct column in the sqlite3 database.
Gracefully handle exceptions while acquiring tweets from the API.
Wrap version 1.1 of the Twitter API.
Start from a specified tweet ID, assuming the input file is sorted in increasing order of tweet ID.

duplicates = multiple editions

	"""Information Retrieval metrics

	Useful Resources:
	http://www.cs.utexas.edu/~mooney/ir-course/slides/Evaluation.ppt
	http://www.nii.ac.jp/TechReports/05-014E.pdf
	http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
	http://hal.archives-ouvertes.fr/docs/00/72/67/60/PDF/07-busa-fekete.pdf
	Learning to Rank for Information Retrieval (Tie-Yan Liu)
	"""
	import numpy as np

	This is free and unencumbered software released into the public domain.

	Anyone is free to copy, modify, publish, use, compile, sell, or
	distribute this software, either in source code form or as a compiled
	binary, for any purpose, commercial or non-commercial, and by any
	means.

	In jurisdictions that recognize copyright laws, the author or authors
	of this software dedicate any and all copyright interest in the
	software to the public domain. We make this dedication for the benefit

	package topic

	import spark.broadcast._
	import spark.SparkContext
	import spark.SparkContext._
	import spark.RDD
	import spark.storage.StorageLevel
	import scala.util.Random
	import scala.math.{ sqrt, log, pow, abs, exp, min, max }
	import scala.collection.mutable.HashMap

	#!/usr/bin/env python
	# -- coding: utf-8 --

	'''

	This script just show the basic workflow to compute TF-IDF similarity matrix with Gensim


	OUTPUT :

	"""
	Implementations of:

	Probabilistic Matrix Factorization (PMF) [1],
	Bayesian PMF (BPMF) [2],
	Modified BPFM (mBPMF)

	using `pymc3`. mBPMF is, to my knowledge, my own creation. It is an attempt
	to circumvent the limitations of `pymc3` w/regards to the Wishart distribution:

	"""
	Code to parse sklearn classification_report
	"""
	##
	import sys
	import collections
	##
	def parse_classification_report(clfreport):
	"""
	Parse a sklearn classification report into a dict keyed by class name

	from keras.models import Graph
	from keras.layers import containers
	from keras.layers.core import Dense, Dropout, Activation, Reshape, Flatten
	from keras.layers.embeddings import Embedding
	from keras.layers.convolutional import Convolution2D, MaxPooling2D

	def ngram_cnn(n_vocab, max_length, embedding_size, ngram_filters=[2, 3, 4, 5], n_feature_maps=100, dropout=0.5, n_hidden=15):
	"""A single-layer convolutional network using different n-gram filters.

	Parameters

	#Get the data here http://grouplens.org/datasets/movielens/
	movielens = sc.textFile("../in/ml-100k/u.data")

	movielens.first() #u'196\t242\t3\t881250949'
	movielens.count() #100000

	#Clean up the data by splitting it
	#Movielens readme says the data is split by tabs and
	#is user product rating timestamp
	clean_data = movielens.map(lambda x:x.split('\t'))