Skip to content

Instantly share code, notes, and snippets.

import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}
// To use the latest sparse SVD implementation, please build your spark-assembly after this
// change: https://github.com/apache/spark/pull/1378
// Input tsv with 3 fields: rowIndex(Long), columnIndex(Long), weight(Double), indices start with 0
// Assume the number of rows is larger than the number of columns, and the number of columns is
// smaller than Int.MaxValue
/*
* Object in scala for calculating cosine similarity
* Reuben Sutton - 2012
* More information: http://en.wikipedia.org/wiki/Cosine_similarity
*/
object CosineSimilarity {
/*
* This method takes 2 equal length arrays of integers
@soumyasd
soumyasd / 0.setup.sh
Last active August 29, 2015 14:14 — forked from ceteri/0.setup.sh
# using four part files to construct "minitweet"
cat rawtweets/part-0000[1-3] > minitweets
# change log4j properties to WARN to reduce noise during demo
mv conf/log4j.properties.template conf/log4j.properties
vim conf/log4j.properties # Change to WARN
# launch Spark shell REPL
./bin/spark-shell
import akka.actor._
import akka.util.ByteString
import spray.http.HttpEntity.Empty
import spray.http.MediaTypes._
import spray.http._
import spray.routing.{HttpService, RequestContext, SimpleRoutingApp}
object StreamingActor {
// helper methods
  1. General Background and Overview
package redisbenchmark
import java.util.UUID
import java.util.concurrent.ThreadLocalRandom
import akka.actor.ActorSystem
import akka.stream.{MaterializerSettings, OverflowStrategy, FlowMaterializer}
import akka.stream.scaladsl.{TickSource, IterableSource, Source}
import akka.util.ByteString
import redis.RedisClient
@soumyasd
soumyasd / notes.md
Last active August 29, 2015 14:11 — forked from gangstead/notes.md

Typesafe webinar notes: Spray & Akka HTTP

Presenter - Mathias Doenitz

Spary.io

  • embeddable http stack built on Akka actors
  • Just an HTTP integration layer, not for building full web apps
  • Server & client side
// data files can be downloaded at https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip
import java.io.Serializable
import java.util
import org.apache.spark.sql._
val sc = new SparkContext("spark://master:7077", "Spark SQL Intro")
val sqlContext = new SQLContext(sc)
import sqlContext.createSchemaRDD

Tuning Storm+Trident

Tuning a dataflow system is easy:

The First Rule of Dataflow Tuning:
* Ensure each stage is always ready to accept records, and
* Deliver each processed record promptly to its destination
package topic
import spark.broadcast._
import spark.SparkContext
import spark.SparkContext._
import spark.RDD
import spark.storage.StorageLevel
import scala.util.Random
import scala.math.{ sqrt, log, pow, abs, exp, min, max }
import scala.collection.mutable.HashMap