Ted Dunning tdunning

Researcher at HPE Labs on security, member of @apache

717 followers · 21 following

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

tdunning / Distance.java

Created October 8, 2013 09:11

Quick benchmark for different ways to compute distance between points on the globe.

	package com.mapr.bench;

	import com.google.caliper.Benchmark;
	import com.google.caliper.runner.Running;
	import org.apache.commons.math3.util.FastMath;
	import org.junit.Assert;
	import org.junit.Test;

	public class Distance {
	@Benchmark

tdunning / bank.md

Last active August 29, 2015 13:56

Examination of a simple model for the UCI bank data set

# for auc()
> library(pROC)

# for performance plots
> library(ROCR)
Loading required package: gplots
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009

Attaching package: ‘gplots’

tdunning / BigDecimalWritable

Created March 4, 2014 01:30

Quick and dirty (very dirty) big decimal writable. This is much uglier than it needs to be because Hadoop's API causes an extra copy to be required.

	public static class BigDecimalWritable implements Writable {
	private BigDecimal value;

	public BigDecimalWritable(BigDecimal value) {
	this.value = value;
	}

	public BigDecimal value() {
	return value;
	}

tdunning / gist:9587624

Created March 16, 2014 18:22

	@Test
	public void testStats() {
	// the reference limits here were derived using a numerical simulation where I took
	// 10,000 samples from the distribution in question and computed the stats from that
	// sample to get min, 25%-ile, median and so on. I did this 1000 times to get 5% and
	// 95% confidence limits for those values.

	// symmetrical, well behaved
	System.out.printf("normal\n");
	check(normal(10000));

tdunning / gist:c5d276dc94bdb17e5d84

Created June 12, 2014 00:27

	Set<String> common = Sets.newHashSet(firstListOfEmails);
	common.retainAll(secondListOfEmails);

tdunning / gist:22432450b9e27948b6b5

Created September 29, 2014 00:47


	public class HbaseLookup {
	static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(TrigoMathFunctions.class);

	private HbaseLookup(){}

	@FunctionTemplate(name = "hLookup", scope = FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
	public static class Lookup implements DrillSimpleFunc {

	@Param VarCharHolder table; // the table to read from

tdunning / Clustering is hard

Last active August 29, 2015 14:12

This R script shows just how hard even trivial k-means clustering can be (with Lloyd's algorithm) by generating trivially clusterable data and then failing.

	# picking the corners of the hyper cube at random usually gives us a good selection
	d = 0
	while (d == 0) {
	centers = matrix(runif(10*10)>0.5, ncol=10) + 0
	# but occasionally we get a duplicate row that is easily detected
	d = det(centers)
	}

	# start x out by selecting clusters
	x = data.frame(n = ceiling(runif(10000,1e-10,10)))

tdunning / t-digest-size-test.r

Last active August 29, 2015 14:14

A quick bit of code to test the size growth for different size limits in t-digest

	# Experiments with t-digest in R

	standard.size.bound = function(n, q) {
	4 * n * q * (1-q)
	}

	constant.size.bound = function(n, q) {
	n
	}

tdunning / on-board

Created March 23, 2015 19:54

File manipulation

	Log in to the cluster:

	ted:downloads$ ssh se-node10.se.lab
	Last login: Mon Mar 23 17:35:37 2015 from 10.250.0.220
	Please check the cluster reservation calendar:
	https://www.google.com/calendar/embed?src=maprtech.com_2d38343133383836382d313737%40resource.calendar.google.com

	Poke around looking for my volume and such:

	[tdunning@se-node10 ~]$ ls /mapr/se1/user/t

tdunning / gist:e8f0c9ca213a3ff055cc

Last active August 29, 2015 14:18

	import fileinput
	from string import join
	import json
	import csv
	import json
	### read the output from MAHOUT and collect into hash ###
	with open('x','rb') as csv_file:
	csv_reader = csv.reader(csv_file,delimiter='\t')
	old_id = ""
	indicators = []

OlderNewer