Shashank Gupta shashankg7

Expand The Edinburgh Twitter FSD Corpus

The Python scripts attached here take care of the following tedious work, and should help one quickly get started with some real work on the corpus:

Respect the Twitter API rate limits and throttle API hits.
Don't hit the API for already expanded tweet ID's, so you can resume tweet expansion after stopping midway.
Parse the API response and dump it into the correct column in the sqlite3 database.
Gracefully handle exceptions while acquiring tweets from the API.
Wrap version 1.1 of the Twitter API.
Start from a specified tweet ID, assuming the input file is sorted in increasing order of tweet ID.

	package topic

	import spark.broadcast._
	import spark.SparkContext
	import spark.SparkContext._
	import spark.RDD
	import spark.storage.StorageLevel
	import scala.util.Random
	import scala.math.{ sqrt, log, pow, abs, exp, min, max }
	import scala.collection.mutable.HashMap

	This is free and unencumbered software released into the public domain.

	Anyone is free to copy, modify, publish, use, compile, sell, or
	distribute this software, either in source code form or as a compiled
	binary, for any purpose, commercial or non-commercial, and by any
	means.

	In jurisdictions that recognize copyright laws, the author or authors
	of this software dedicate any and all copyright interest in the
	software to the public domain. We make this dedication for the benefit

	"""Information Retrieval metrics

	Useful Resources:
	http://www.cs.utexas.edu/~mooney/ir-course/slides/Evaluation.ppt
	http://www.nii.ac.jp/TechReports/05-014E.pdf
	http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
	http://hal.archives-ouvertes.fr/docs/00/72/67/60/PDF/07-busa-fekete.pdf
	Learning to Rank for Information Retrieval (Tie-Yan Liu)
	"""
	import numpy as np