darkseed’s gists

darkseed / gist:9b2361a551c4bfeb635d

Last active August 29, 2015 14:20 — forked from debasishg/gist:8172796

Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep

darkseed / ipython\ pyspark.ipynb

Created May 1, 2015 09:08

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

darkseed / gist:838072bec58a9da9f526

Last active August 29, 2015 14:18 — forked from gwenshap/gist:505b3fa6e478282e03c9

	ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;

	DROP TABLE raw_log;

	CREATE EXTERNAL TABLE raw_log(
	IP STRING,
	timestamp STRING,
	URL STRING,
	referrer STRING,
	user_agent STRING)

darkseed / 00-Setup-IPython-PySpark.ipynb

Last active August 29, 2015 14:18 — forked from fperez/00-Setup-IPython-PySpark.ipynb

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

darkseed / StanfordNERExample.scala

Last active August 29, 2015 14:16 — forked from seralf/StanfordNERExample.scala

	package ner

	import edu.stanford.nlp.ie.crf.CRFClassifier
	import scala.collection.JavaConversions._
	import scala.collection.JavaConverters._
	import edu.stanford.nlp.ling.CoreAnnotations
	import java.util.ArrayList
	import java.util.HashMap
	import java.util.Map
	import scala.xml.XML

darkseed / KMeansJob.scala

Last active August 29, 2015 14:15 — forked from azymnis/KMeansJob.scala

	import com.twitter.algebird.{Aggregator, Semigroup}
	import com.twitter.scalding._

	import scala.util.Random

	/**
	* This job is a tutorial of sorts for scalding's Execution[T] abstraction.
	* It is a simple implementation of Lloyd's algorithm for k-means on 2D data.
	*
	* http://en.wikipedia.org/wiki/K-means_clustering

darkseed / ItemSimilarity.scala

Last active August 29, 2015 14:15 — forked from azymnis/ItemSimilarity.scala

	import com.twitter.scalding._
	import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }

	/**
	* Computes similar items (with a string itemId), based on approximate
	* Jaccard similarity, using LSH.
	*
	* Assumes an input data TSV file of the following format:
	*
	* itemId userId

darkseed / BoomerangLogJob.scala

Last active August 29, 2015 14:14 — forked from piotrbelina/BoomerangLogJob.scala

	import cascading.tuple.{Fields, TupleEntry}
	import com.twitter.scalding._
	import java.net.URLDecoder
	import scala.util.matching.Regex

	class BoomerangLogJob(args: Args) extends Job(args) {
	val input = TextLine(args("input"))
	val output = TextLine(args("output"))
	val trap = Tsv(args("trap"))

darkseed / USPopulation

Last active August 29, 2015 14:14 — forked from krishnanraman/USPopulation

	Goal: Process the 12 million plus records
	from: http://seer.cancer.gov/popdata/download.html
	using: a Scala API atop Cascading, aka SCALDING ( Inventors: Avi Bryant, Oscar Boykin, Argyris )
	to find:
	THE FASTEST GROWING COUNTY IN THE UNITED STATES over the 1969-2011 timeframe.
	-----------------------------------------------------------------------------
	RESULTS: Scroll to the very bottom.

	First, the scalding source...
	---

darkseed / KMeansJob.scala

Last active August 29, 2015 14:14 — forked from azymnis/KMeansJob.scala

	import com.twitter.algebird.{Aggregator, Semigroup}
	import com.twitter.scalding._

	import scala.util.Random

	/**
	* This job is a tutorial of sorts for scalding's Execution[T] abstraction.
	* It is a simple implementation of Lloyd's algorithm for k-means on 2D data.
	*
	* http://en.wikipedia.org/wiki/K-means_clustering

Tom Mulder darkseed