MLnick’s gists

MLnick / sklearn-lr-spark.py

Created February 4, 2013 14:29

SGD in Spark using Scikit-learn

	import sys

	from pyspark.context import SparkContext
	from numpy import array, random as np_random
	from sklearn import linear_model as lm
	from sklearn.base import copy

	N = 10000 # Number of data points
	D = 10 # Numer of dimensions
	ITERATIONS = 5

MLnick / StreamingHLL.scala

Last active January 24, 2024 19:39

Spark Streaming meets Algebird's HyperLogLog Monoid

	import spark.streaming.StreamingContext._
	import spark.streaming.{Seconds, StreamingContext}
	import spark.SparkContext._
	import spark.storage.StorageLevel
	import spark.streaming.examples.twitter.TwitterInputDStream
	import com.twitter.algebird.HyperLogLog._
	import com.twitter.algebird._

	/**
	* Example of using HyperLogLog monoid from Twitter's Algebird together with Spark Streaming's

MLnick / StreamingCMS.scala

Created February 13, 2013 15:00

Spark Streaming with CountMinSketch from Twitter Algebird

	import spark.streaming.{Seconds, StreamingContext}
	import spark.storage.StorageLevel
	import spark.streaming.examples.twitter.TwitterInputDStream
	import com.twitter.algebird._
	import spark.streaming.StreamingContext._
	import spark.SparkContext._

	/**
	* Example of using CountMinSketch monoid from Twitter's Algebird together with Spark Streaming's
	* TwitterInputDStream

MLnick / MovieSimilarities.scala

Created April 1, 2013 17:49

Movie Similarities with Spark

	import spark.SparkContext
	import SparkContext._

	/**
	* A port of [[http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/]]
	* to Spark.
	* Uses movie ratings data from MovieLens 100k dataset found at [[http://www.grouplens.org/node/73]]
	*/
	object MovieSimilarities {

MLnick / SparkML.scala

Last active December 20, 2015 04:09

Spark Machine Learning API Design Notes

	// An Example is an observation with optional target value and features in the form of a vector of Doubles
	case class Example(target: Option[Double] = None, features: Vector[Double])

	// Base model API looks something like:
	abstract class BaseModel(val modelSettings: Settings)
	extends Serializable
	with Logging {

	def fit(data: RDD[Example])

MLnick / Configuration.scala

Created September 16, 2013 08:50

	trait Configured {
	val config = ConfigFactory.load().getConfig("spark")
	}

	object ConfigUtils {
	def asOption[T](t: => T): Option[T] = {
	try {
	Option(t)
	} catch {
	case e: ConfigException.Missing => None

MLnick / JavaSparkContext.scala

Last active December 26, 2015 12:19

PySpark read arbitrary Hadoop InputFormats

	def newHadoopFileAsText[K, V, F <: NewInputFormat[K, V]](
	path: String,
	inputFormatClazz: String,
	keyClazz: String,
	valueClazz: String,
	delimiter: String): JavaRDD[String] = {

	implicit val kcm = ClassManifest.fromClass(Class.forName(keyClazz))
	implicit val vcm = ClassManifest.fromClass(Class.forName(valueClazz))
	implicit val fcm = ClassManifest.fromClass(Class.forName(inputFormatClazz))

MLnick / JavaSparkContext.scala

Last active November 29, 2016 06:28

PySpark / Hadoop InputFormat interop

	// Python RDD creation functions //

	// SequenceFile converted to Text and then to String
	def sequenceFileAsText(path: String) = {
	implicit val kcm = ClassManifest.fromClass(classOf[Text])
	implicit val fcm = ClassManifest.fromClass(classOf[SequenceFileAsTextInputFormat])

	new JavaPairRDD(sc
	.newAPIHadoopFile[Text, Text, SequenceFileAsTextInputFormat](path)
	.map{ case (k, v) => (k.toString, v.toString) }

MLnick / output

Last active December 30, 2015 20:29

PySpark: svmlight Hadoop text files -> RDD[sparse vectors]

	13/12/09 23:01:02 INFO spark.SparkContext: Job finished: runJob at PythonRDD.scala:288, took 0.175237 s
	Count raw: 683; Count svml: 683
	Raw sample: [u'2.000000 1:1000025.000000 2:5.000000 3:1.000000 4:1.000000 5:1.000000 6:2.000000 7:1.000000 8:3.000000 9:1.000000 10:1.000000', u'2.000000 1:1002945.000000 2:5.000000 3:4.000000 4:4.000000 5:5.000000 6:7.000000 7:10.000000 8:3.000000 9:2.000000 10:1.000000', u'2.000000 1:1015425.000000 2:3.000000 3:1.000000 4:1.000000 5:1.000000 6:2.000000 7:2.000000 8:3.000000 9:1.000000 10:1.000000', u'2.000000 1:1016277.000000 2:6.000000 3:8.000000 4:8.000000 5:1.000000 6:3.000000 7:4.000000 8:3.000000 9:7.000000 10:1.000000', u'2.000000 1:1017023.000000 2:4.000000 3:1.000000 4:1.000000 5:3.000000 6:2.000000 7:1.000000 8:3.000000 9:1.000000 10:1.000000', u'4.000000 1:1017122.000000 2:8.000000 3:10.000000 4:10.000000 5:8.000000 6:7.000000 7:10.000000 8:9.000000 9:7.000000 10:1.000000', u'2.000000 1:1018099.000000 2:1.000000 3:1.000000 4:1.000000 5:1.000000 6:2.000000 7

MLnick / config

Created May 16, 2014 16:59

Elasticsearch / Spark / Shark setup

	Elasticsearch : "1.1.1"
	Spark : "0.9.1-hadoop1"
	Shark : "0.9.1-hadoop1"
	elasticsearch-hadoop-hive : "elasticsearch-hadoop-hive-2.0.0.RC1.jar"
	elasticsearch-hadoop : 2.0.0RC1

	- Spark using ESInputFormat works fine. However the type returned from the "date" field ("_ts") is Text. I convert that toString the toLong to get the timestamp, which I can use as I wish within Spark
	- Shark returns NULL for the timestamp field. There's nothing funny about the timestamps themselves as I can access them in Spark (as Text) and I can do date math on that field in elasticsearch queries. This also returns nulls in EC2, so it's not just on my machine.

Nick Pentreath MLnick