soumyasd

General Background and Overview

Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep

Typesafe webinar notes: Spray & Akka HTTP

Presenter - Mathias Doenitz

Spary.io

embeddable http stack built on Akka actors
Just an HTTP integration layer, not for building full web apps
Server & client side

Tuning Storm+Trident

Tuning a dataflow system is easy:

The First Rule of Dataflow Tuning:
* Ensure each stage is always ready to accept records, and
* Deliver each processed record promptly to its destination

	import org.apache.spark.mllib.linalg.distributed.RowMatrix
	import org.apache.spark.mllib.linalg._
	import org.apache.spark.{SparkConf, SparkContext}

	// To use the latest sparse SVD implementation, please build your spark-assembly after this
	// change: https://github.com/apache/spark/pull/1378

	// Input tsv with 3 fields: rowIndex(Long), columnIndex(Long), weight(Double), indices start with 0
	// Assume the number of rows is larger than the number of columns, and the number of columns is
	// smaller than Int.MaxValue

	/*
	* Object in scala for calculating cosine similarity
	* Reuben Sutton - 2012
	* More information: http://en.wikipedia.org/wiki/Cosine_similarity
	*/

	object CosineSimilarity {

	/*
	* This method takes 2 equal length arrays of integers

	# using four part files to construct "minitweet"
	cat rawtweets/part-0000[1-3] > minitweets

	# change log4j properties to WARN to reduce noise during demo
	mv conf/log4j.properties.template conf/log4j.properties
	vim conf/log4j.properties # Change to WARN

	# launch Spark shell REPL
	./bin/spark-shell

	import akka.actor._
	import akka.util.ByteString
	import spray.http.HttpEntity.Empty
	import spray.http.MediaTypes._
	import spray.http._
	import spray.routing.{HttpService, RequestContext, SimpleRoutingApp}

	object StreamingActor {

	// helper methods

	package redisbenchmark

	import java.util.UUID
	import java.util.concurrent.ThreadLocalRandom

	import akka.actor.ActorSystem
	import akka.stream.{MaterializerSettings, OverflowStrategy, FlowMaterializer}
	import akka.stream.scaladsl.{TickSource, IterableSource, Source}
	import akka.util.ByteString
	import redis.RedisClient

	// data files can be downloaded at https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip

	import java.io.Serializable
	import java.util

	import org.apache.spark.sql._

	val sc = new SparkContext("spark://master:7077", "Spark SQL Intro")
	val sqlContext = new SQLContext(sc)
	import sqlContext.createSchemaRDD

	package topic

	import spark.broadcast._
	import spark.SparkContext
	import spark.SparkContext._
	import spark.RDD
	import spark.storage.StorageLevel
	import scala.util.Random
	import scala.math.{ sqrt, log, pow, abs, exp, min, max }
	import scala.collection.mutable.HashMap