Skip to content

Instantly share code, notes, and snippets.

View manboubird's full-sized avatar

Toshiaki Toyama manboubird

View GitHub Profile
@johnynek
johnynek / TypedDataCube.md
Last active August 29, 2015 14:04
How to do data cubing in typed scalding?

Suppose you have a key like (page, geo, day) and you want to make rollups/datacube so you can query for all pages, or all geos or all days.

Here is how you do it:

def opts[T](t: T): Seq[Option[T]] = Seq(Some(t), None)

val p: TypedPipe[(String, String, Int)] = ...

p.sumByLocalKeys
@philandstuff
philandstuff / euroclojure2014.org
Last active February 19, 2024 05:12
Euroclojure 2014

EuroClojure 2014, Krakow

Fergal Byrne, Clortex: Machine Intelligence based on Jeff Hawkins’ HTM Theory

  • @fergbyrne
  • HTM = Hierarchical Temporal Memory
  • Slides

big data

  • big data is like teenage sex
    • noone knows how to do it
    • everyone thinks everyone else is doing it
@mattb
mattb / gist:da8d779573a10300e512
Last active August 29, 2015 14:02
Calculating the median distance and time of NYC taxi rides in 2013
// transcribed from an Apache Spark 1.0 spark-shell session
// using data from http://chriswhong.com/open-data/foil_nyc_taxi/
// and the QTree algorithm for approximate quantiles over large datasets
// each of the distanceRange and minutesRange calculations below takes about 15 minutes on my four-core SSD-based Macbook Pro
import com.twitter.algebird._
import com.twitter.algebird.Operators._
implicit val qtSemigroupD = new QTreeSemigroup[Double](6)
val in = sc.textFile("trip_data") // a directory containing all the trip_data*.csv files downloaded from the above link
@gwenshap
gwenshap / gist:1e6894100f72e3f109f2
Last active December 20, 2015 06:45
Sessionization in Hive 0.12 (to AVRO table)
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/avro-mapred.jar;
DROP TABLE raw_log;
CREATE EXTERNAL TABLE raw_log(
IP STRING,
timestamp STRING,
URL STRING,
referrer STRING,
@eladc
eladc / box-linux.sh
Last active July 13, 2023 11:43
Mount your Box.com Account Using davfs2
#!/bin/bash
## davfs2 installation and Box.com account configuration script for Linux
## Tested on Ubuntu, Fedora and OpenSuse
## Update 1.032615
## This script must be run as root
if [ ! $UID = 0 ]; then
echo "This script needs super user privileges to run"
echo "run it againg using sudo or login as root"
exit 1
@johnynek
johnynek / gist:8961994
Last active August 29, 2015 13:56
Some Questions with Sketch Monoids

Unifying Sketch Monoids

As I discussed in Algebra for Analytics, many sketch monoids, such as Bloom filters, HyperLogLog, and Count-min sketch, can be described as a hashing (projection) of items into a sparse space, then using two different commutative monoids to read and write respectively. Finally, the read monoids always have the property that (a + b) <= a, b and the write monoids has the property that (a + b) >= a, b.

##Some questions:

  1. Note how similar CMS and Bloom filters are. The difference: bloom hashes k times onto the same space, CMS hashes k times onto a k orthogonal subspaces. Why the difference? Imagine a fixed space bloom that hashes onto k orthogonal spaces, or an overlapping CMS that hashes onto k * m length space. How do the error asymptotics change?
  2. CMS has many query modes (dot product, etc...) can those generalize to other sketchs (HLL, Bloom)?
  3. What other sketch or non-sketch algorithms can be expressed in this dual mo
@pkuczynski
pkuczynski / LICENSE
Last active March 14, 2025 14:12
Read YAML file from Bash script
MIT License
Copyright (c) 2014 Piotr Kuczynski
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWAR
@johnynek
johnynek / gist:8290375
Created January 6, 2014 21:47
example of LAG type function in the scalding Fields API (similar for typed)
groupBy('source) {
_.sortBy('links)
.reverse
.mapStream[(String,Int), (String, Int, Int, Int)]
(('destination, 'links) -> ('destination, 'links, 'rank, 'gap)) { destLinks =>
destLinks.scanLeft(None: Option[(String, Int, Int, Int)]) {
(prevRowOut: Option[(String,Int,Int,Int)], thisRow: (String, Int)) =>
val (dest, links) = thisRow
prevRowOut match {
case None => Some((dest, links, 1, 0)) // rank 1, gap 0 -- not exactly what you wanted...
@debasishg
debasishg / gist:8172796
Last active April 16, 2025 13:43
A collection of links for streaming algorithms and data structures

General Background and Overview

  1. Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
  2. Models and Issues in Data Stream Systems
  3. Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
  4. Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
  5. [Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&amp;rep=rep1&amp;t
@azymnis
azymnis / ItemSimilarity.scala
Created December 13, 2013 05:17
Approximate item similarity using LSH in Scalding.
import com.twitter.scalding._
import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }
/**
* Computes similar items (with a string itemId), based on approximate
* Jaccard similarity, using LSH.
*
* Assumes an input data TSV file of the following format:
*
* itemId userId