Skip to content

Instantly share code, notes, and snippets.

@debasishg
debasishg / gist:8172796
Last active March 23, 2026 03:02
A collection of links for streaming algorithms and data structures

General Background and Overview

  1. Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
  2. Models and Issues in Data Stream Systems
  3. Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
  4. Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
  5. [Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep=rep1&t
@azymnis
azymnis / ItemSimilarity.scala
Created December 13, 2013 05:17
Approximate item similarity using LSH in Scalding.
import com.twitter.scalding._
import com.twitter.algebird.{ MinHasher, MinHasher32, MinHashSignature }
/**
* Computes similar items (with a string itemId), based on approximate
* Jaccard similarity, using LSH.
*
* Assumes an input data TSV file of the following format:
*
* itemId userId
@fawda123
fawda123 / nnet_plot_update.r
Last active May 11, 2022 00:20
nnet_plot_update
plot.nnet<-function(mod.in,nid=T,all.out=T,all.in=T,bias=T,wts.only=F,rel.rsc=5,
circle.cex=5,node.labs=T,var.labs=T,x.lab=NULL,y.lab=NULL,
line.stag=NULL,struct=NULL,cex.val=1,alpha.val=1,
circle.col='lightblue',pos.col='black',neg.col='grey',
bord.col='lightblue', max.sp = F,...){
require(scales)
#sanity checks
if('mlp' %in% class(mod.in)) warning('Bias layer not applicable for rsnns object')
@kaja47
kaja47 / combinations.scala
Last active December 27, 2015 15:49
Fast array combinations
// genrate all combinations of integers in range from 0 to `len`-1
// fast as fuck
def combIdxs(len: Int, k: Int): Iterator[Array[Int]] = {
val arr = Array.range(0, k)
arr(k-1) -= 1
val end = k-1
Iterator.continually {
arr(end) += 1
if (arr(end) >= len) {
@piotrbelina
piotrbelina / BoomerangLogJob.scala
Created August 3, 2013 10:12
Scalding apache log parser for boomerang.js
import cascading.tuple.{Fields, TupleEntry}
import com.twitter.scalding._
import java.net.URLDecoder
import scala.util.matching.Regex
class BoomerangLogJob(args: Args) extends Job(args) {
val input = TextLine(args("input"))
val output = TextLine(args("output"))
val trap = Tsv(args("trap"))
@bmarcot
bmarcot / knapsack_problem.scala
Last active June 28, 2017 13:41
The Knapsack Problem, in Scala -- Keywords: dynamic programming, recursion, scala.
def knapsack_aux(x: (Int, Int), is: List[Int]): List[Int] = {
for {
w <- is.zip(is.take(x._1) ::: is.take(is.size - x._1).map(_ + x._2))
} yield math.max(w._1, w._2)
}
def knapsack_rec(xs: List[(Int, Int)], is: List[Int]): List[List[Int]] = {
xs match {
case x :: xs => knapsack_aux(x, is) :: knapsack_rec(xs, knapsack_aux(x, is))
case _ => Nil
@steipete
steipete / PSPDFUIKitMainThreadGuard.m
Last active October 30, 2025 15:53
This is a guard that tracks down UIKit access on threads other than main. This snippet is taken from the commercial iOS PDF framework http://pspdfkit.com, but relicensed under MIT. Works because a lot of calls internally call setNeedsDisplay or setNeedsLayout. Won't catch everything, but it's very lightweight and usually does the job.You might n…
// Taken from the commercial iOS PDF framework http://pspdfkit.com.
// Copyright (c) 2014 Peter Steinberger, PSPDFKit GmbH. All rights reserved.
// Licensed under MIT (http://opensource.org/licenses/MIT)
//
// You should only use this in debug builds. It doesn't use private API, but I wouldn't ship it.
// PLEASE DUPE rdar://27192338 (https://openradar.appspot.com/27192338) if you would like to see this in UIKit.
#import <objc/runtime.h>
#import <objc/message.h>
@Yangqing
Yangqing / mr_compute_gist.py
Created May 17, 2013 00:05
The mapreduce code to extract gist features from ImageNet images. To be used together with mincepie.
from mincepie import mapreducer, launcher
import gflags
import glob
import leargist
import numpy as np
import os
from PIL import Image
import uuid
# constant value
You can use cURL to upload packet captures to Packetloop. We created a simple script that shows how to login, list capture points, create capture points, upload and also check processing status.
## variables
PL_ENDPOINT=https://www.packetloop.com
PL_USERNAME=... # your packetloop email address
PL_PASSWORD=... # your packetloop password
## logging in
PL_TOKEN=$(curl -3 -s -b cookies.jar -c cookies.jar -X GET "$PL_ENDPOINT/init")
curl -3 -s -H "X-CSRF-Token: $PL_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -b cookies.jar -c cookies.jar -X POST "$PL_ENDPOINT/users/sign_in.json?pretty=true" -d "{ \"user\": { \"email\": \"$PL_USERNAME\", \"password\": \"$PL_PASSWORD\" } }"
@codeinthehole
codeinthehole / run.py
Created November 21, 2012 13:46
Sample Celery chain usage for processing pipeline
from celery import chain
from django.core.management.base import BaseCommand
from . import tasks
class Command(BaseCommand):
def handle(self, *args, **kwargs):