Skip to content

Instantly share code, notes, and snippets.

View johnynek's full-sized avatar

P. Oscar Boykin johnynek

View GitHub Profile
@johnynek
johnynek / gist:8961994
Last active August 29, 2015 13:56
Some Questions with Sketch Monoids

Unifying Sketch Monoids

As I discussed in Algebra for Analytics, many sketch monoids, such as Bloom filters, HyperLogLog, and Count-min sketch, can be described as a hashing (projection) of items into a sparse space, then using two different commutative monoids to read and write respectively. Finally, the read monoids always have the property that (a + b) <= a, b and the write monoids has the property that (a + b) >= a, b.

##Some questions:

  1. Note how similar CMS and Bloom filters are. The difference: bloom hashes k times onto the same space, CMS hashes k times onto a k orthogonal subspaces. Why the difference? Imagine a fixed space bloom that hashes onto k orthogonal spaces, or an overlapping CMS that hashes onto k * m length space. How do the error asymptotics change?
  2. CMS has many query modes (dot product, etc...) can those generalize to other sketchs (HLL, Bloom)?
  3. What other sketch or non-sketch algorithms can be expressed in this dual mo
You can swap with only two registers if they point to elements of a group:
a = a + b
b = a - b
a = a - b
proof:
a1 = a + b
b1 = a1 - b = a
a2 = -b1 + a1 = (-a) + a + b
@johnynek
johnynek / gist:8290375
Created January 6, 2014 21:47
example of LAG type function in the scalding Fields API (similar for typed)
groupBy('source) {
_.sortBy('links)
.reverse
.mapStream[(String,Int), (String, Int, Int, Int)]
(('destination, 'links) -> ('destination, 'links, 'rank, 'gap)) { destLinks =>
destLinks.scanLeft(None: Option[(String, Int, Int, Int)]) {
(prevRowOut: Option[(String,Int,Int,Int)], thisRow: (String, Int)) =>
val (dest, links) = thisRow
prevRowOut match {
case None => Some((dest, links, 1, 0)) // rank 1, gap 0 -- not exactly what you wanted...
@johnynek
johnynek / catseq.scala
Created December 16, 2013 05:05
CatSeq: O(1) concatenation of sequences. What is this called?
import scala.collection.immutable.Seq
object CatSeq {
val empty: CatSeq[Nothing] = CatSeq[Nothing](0, Seq.empty, Seq.empty)
}
/** Data structure to do O(1) concatenation of lists
*/
case class CatSeq[+T](leftLen: Int, left: Seq[T], right: Seq[T]) extends Seq[T] {
override def isEmpty = left.isEmpty && right.isEmpty
override def head = if(left.isEmpty) right.head else left.head
override def tail = if(left.isEmpty) right.tail else CatSeq(leftLen - 1, left.tail, right)
@johnynek
johnynek / join.scala
Last active December 31, 2015 04:29
How to do a cross product with more data than can fit in RAM (or with only an iterator on the items), this is one way to do joins on Hadoop.
// You can run this file with: scala join.scala
import scala.collection.BufferedIterator
def sortIterator[T:Ordering](it: Iterator[T]): Iterator[T] =
// hadoop handles this, so we just do this for illustration
// here, on disk you want something like: http://en.wikipedia.org/wiki/External_sorting
it.toList.sorted.iterator
// An ordering that sorts Right first
@johnynek
johnynek / WeakRefMonad.scala
Created December 10, 2013 23:53
Weak Reference is a monad (h/t https://twitter.com/J_ )
import java.lang.ref.WeakReference
/** An option-like type that doesn't keep
* the pointed item in scope
*/
trait WeakRef[+T] {
// If the reference is still live, return it, else None
def toOption: Option[T]
// If either reference is dead, return Empty, else give the live ref
@johnynek
johnynek / gist:7668195
Created November 26, 2013 23:30
Idea for a metrics API for scalding and summingbird.
// basically the state monad on a Map[String, Long] where String is the name, long is the delta,
// and P is some platform specific type
trait Metric[P, T] {
def map[U](fn: T => U): Metric[P, U]
// monoid merge the metrics of the result with this:
def flatMap[U](fn: T => Metric[P, U]): Metric[P, U]
def increment(kv: (String, Long)): Metric[P, T] = for {
t <- this
_ <- Metric.increment(kv)
@johnynek
johnynek / mergeablemap.rs
Created November 25, 2013 06:55
MergeableMap in rust. The idea is to give a store that can "merge" arbitrary associative operations with no GC. That and fool around with rust.
// compiles with rustc 0.8
use std::hashmap::HashMap;
use std::container::MutableMap;
use std::util::swap;
use std::to_str::ToStr;
trait Semigroup {
fn plus(&self, that: ~Self) -> ~Self;
}
@johnynek
johnynek / pickling.md
Created October 7, 2013 16:26
Some of my thoughts on pickling.

Why Pickling does not solve most of my serialization problems

From what I can tell, scala pickling is two things:

  1. a good type that composes well: Pickler
  2. Macros for generating that type.

What are my pain points?

@johnynek
johnynek / gist:6846362
Created October 5, 2013 21:39
Sketch of a generic formula tree optimizer. Idea is to use for Map/Reduce in summingbird where there are only binary (merge) and unary (map, reduce, etc...) operators.
trait Family[F <: Family[F,V], V] { // might not need F-bound here, try to remove
type Unary <: (V => V)
type Binary <: ((V, V) => V)
}
sealed trait Formula[F <: Family[F,T], T] {
def eval: T
}
/** eval could be made lazy, but pattern matching on it could be useful */
case class Res[F <: Family[F,T], T](eval: T) extends Formula[F, T]