Skip to content

Instantly share code, notes, and snippets.

View tdunning's full-sized avatar

Ted Dunning tdunning

View GitHub Profile
@tdunning
tdunning / shift-detection.r
Last active December 6, 2020 02:15
Sample code that shows how distributional changes in a single tail can be detected accurately using counts targeted at particular parts of a reference dataset
### Draws a figure illustrating change detection in the distribution of synthetic data.
### Each dot represents a single time period with 1000 samples. Before the change,
### the data is sampled from a unit normal distribution. After the change, 20 samples
### in each time period are taken from N(3,1). Comparing counts with a chi^2 test that
### is robust to small expected counts robustly detects this shift.
### log-likelihood ratio test for multinomial data
llr = function(k) {
2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))
}
@tdunning
tdunning / mcem.r
Last active December 7, 2020 22:59
Implementation of Monte Carlo EM algorithm for reconstructing a standard distribution from censored observations
### This is a demonstration of a Monte Carlo Expectation Maximization
### algorithm that can recover the mean and standard deviation of
### truncated normally distributed data. We get 10,000 samples from
### a unit normal distribution, but every sample below 0.5 is truncated
### to that value. Every sample above 2.5 is truncated to that value.
### These choices were made to get quick and visually appealling convergence
### but the algorithm still converges for any choice. The converges
### could be very, very slow if there is little information in the samples
### and the final answer could have substantial uncertainty. For instance,
### if we truncated at 4 and 6, almost all samples would be piled up at
### This code builds a simple physical model of the range of an 85kWh Tesla Model S and
### compares it to real data. The data here is digitized from
### https://www.tesla.com/blog/model-s-efficiency-and-range
### The model here accounts for aerodynamic drag, viscous drag, constant
### friction and constant power drain
### First the digitized data
x = read.csv(text="v,range
10.22976354700292, 393.9005561997566
@tdunning
tdunning / viewpoints.r
Last active July 12, 2019 20:31
how different definition of distance changes our view of clustering
# you can run this script with the following R command:
# source('https://gist.githubusercontent.com/tdunning/badb88043d41d916a3148c669f2fb0cd/raw/8d3289fdbf2a7999bd5d9687002488b904e1d82f/viewpoints.r')
set.seed(1)
noise = matrix(nrow=2000, ncol=8, data=rnorm(4*8*500))
offsets = matrix(
c(rep(-1,1000), rep(1,1000),
rep(-1, 500), rep(1, 500), rep(-1, 500), rep(1, 500)),
ncol=2)
xy = rbind(matrix(nrow=2000, ncol=2, data=rnorm(2*2000))) + offsets * 8
@tdunning
tdunning / Summarizer.java
Created April 12, 2019 23:18
Demonstrates the summarization of database fields using t-digest
package com.tdunning.tdigest.quality;
import com.google.common.collect.ImmutableList;
import com.google.common.io.Resources;
import com.tdunning.math.stats.MergingDigest;
import com.tdunning.math.stats.TDigest;
import org.junit.Test;
import java.io.File;
import java.io.IOException;
@tdunning
tdunning / MomentSketchOffsetTest.java
Created March 25, 2019 22:12
Test for moment sketches versus offset distribution
public class MomentSketchOffsetTest {
@Test
public void testOffsetUniform() throws Exception {
MomentSketch ms = new MomentSketch(1e-10);
ms.setSizeParam(7);
ms.initialize();
double[] data = TestDataSource.getUniform(20e1, 20e1 + 1, 1_000_000);
ms.add(data);
@tdunning
tdunning / HighDynamicRangeQuantile.java
Last active August 25, 2017 06:50 — forked from oertl/HighDynamicRangeQuantile.java
Simpler and slightly faster version of Otmar Oertl's idea for improving FastHistogram / HdrHistogram
public class HighDynamicRangeQuantile {
private final long[] counts;
private double minimum = Double.POSITIVE_INFINITY;
private double maximum = Double.NEGATIVE_INFINITY;
private long underFlowCount = 0;
private long overFlowCount = 0;
private final double factor;
private final double offset;
private final double minExpectedQuantileValue;
@tdunning
tdunning / td-in-r.r
Last active November 26, 2018 20:05
A simplified implementation of a merging t-digest in R with some visualization of the results
### x is either a vector of numbers or a data frame with sums and weights. Digest is a data frame.
merge = function(x, digest, compression=100) {
## Force the digest to be a data.frame, possibly empty
if (!is.data.frame(digest) && is.na(digest)) {
digest = data.frame(sum=c(), weight=c())
}
## and coerce the incoming data likewise ... a vector of points have default weighting of 1
if (!is.data.frame(x)) {
x = data.frame(sum=x, weight=1)
}
@tdunning
tdunning / speed.c
Created July 14, 2016 00:07
test of effect of flushing on speed of disk I/O on OSX
#define _GNU_SOURCE
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <sys/file.h>
#include <unistd.h>
#include <errno.h>