gdbassett’s gists

gdbassett / gist:97e9df660269db098db8

Created July 29, 2014 14:16

	### Keybase proof

	I hereby claim:

	* I am gdbassett on github.
	* I am gdbassett (https://keybase.io/gdbassett) on keybase.
	* I have a public key whose fingerprint is 8F47 6E59 65B3 9C92 428C 5A8C C609 81ED D4FA 1957

	To claim this, I am signing this object:

gdbassett / robustTau.py

Created October 10, 2014 16:36

A Robust Measure of Scale modeled on Maronnaa & Zamarb's performance improvements on Rousseeuw and Croux's skewness and efficiency improvements on median absolute deviation. Blatantly transposed from the R robustbase module.

	from scipy import stats as scistats
	import numpy as np

	# Implementation of Tau from http://amstat.tandfonline.com/doi/abs/10.1198/004017002188618509#.VDgKhdR4rEh
	# blatently transposed R robustbase library from http://r-forge.r-project.org/scm/?group_id=59, OGK.R
	def scaleTau2(x, c1 = 4.5, c2 = 3.0, consistency = True, mu_too = False, xargs, *kargs):
	## NOTA BENE: This is NOT consistency corrected
	x = np.asarray(x)
	n = len(x)
	medx = np.median(x)

gdbassett / bulk_netflow_import.py

Created November 20, 2014 02:51

A script to bulk import netflow records into a Neo4j graph database. Designed for efficiency, can import roughly 1 million flows every 2 hours.

	#!/usr/bin/env python
	# -- encoding: utf-8 --

	"""
	AUTHOR: Gabriel Bassett
	DATE: 11-19-2014
	DEPENDENCIES: py2neo
	Copyright 2014 Gabriel Bassett

gdbassett / canopy.py

Created December 12, 2014 21:59

Efficient python implementation of canopy clustering. (A method for efficiently generating centroids and clusters, most commonly as input to a more robust clustering algorithm.)

	from sklearn.metrics.pairwise import pairwise_distances
	import numpy as np

	# X shoudl be a numpy matrix, very likely sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix
	# T1 > T2 for overlapping clusters
	# T1 = Distance to centroid point to not include in other clusters
	# T2 = Distance to centroid point to include in cluster
	# T1 > T2 for overlapping clusters
	# T1 < T2 will have points which reside in no clusters
	# T1 == T2 will cause all points to reside in mutually exclusive clusters

gdbassett / text_cluster.py

Last active October 2, 2018 07:19

Basic script for text->vectorization->TF-IDF->canopies->kmeans->clusters. Initially tested on VCDB breach summaries.

	#!/usr/bin/env python
	# -- encoding: utf-8 --
	# based on http://scikit-learn.org/stable/auto_examples/document_clustering.html

	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.cluster import KMeans, MiniBatchKMeans
	from sklearn.metrics.pairwise import pairwise_distances
	import numpy as np
	from time import time
	from collections import defaultdict

gdbassett / gist:6438b4036a501eba9f5e

Created January 28, 2015 14:11

Association Rules Console Ouput 1

	> df <- df[!names(df) %in% c('root.victim.region',
	+ 'root.victim.country',
	+ 'root.summary',
	+ 'root.summary=Source_Category',
	+ 'root.victim.industry',
	+ 'root.timeline.incident.year',
	+ 'root.plus.dbir_year',
	+ 'root.action.social.notes',
	+ 'root.victim.secondary.notes',
	+ 'root.action.hacking.notes',

gdbassett / linearKMeans.R

Last active February 27, 2016 16:45

A quick function to produce a kmeans like calculation, but using a line in place of the point centroid. Used to try and classify multiple linear relationships in a dataset.

	#' @param df Dataframe with x and y columns. (Hopefully in the future this can be x)
	#' @param nlines The number of clusters.
	#' @param ab a dataframe with a 'slopes' and 'intercepts' column and one row per initial line. Dimensions must match nlines.
	#' @param maxiter The maximum number of iterations to do
	#' @export
	#' @examples
	linearKMeans <- function(df, ab=NULL, nlines=0, maxiter=1000) {
	# default number of lines
	nlines_default <- 5

gdbassett / test_GENERIC.Rmd

Created November 3, 2016 23:01

	---
	title: "Test"
	author: "Gabe"
	date: "November 03, 2016"
	output: html_document
	params:
	df: data.frame()
	a: ""
	b: ""
	c: "FALSE"

gdbassett / livesplit.R

Last active June 26, 2017 20:02

basic R code to parse livesplit splits into a dataframe

	speedrun <- XML::xmlParse("/livesplit.lss")
	speedrun <- XML::xmlToList(speedrun)

	chunk <- do.call(rbind, lapply(speedrun[['Segments']], function(segments) {

	segments.df <- do.call(rbind, lapply(segments[['SegmentHistory']], function(segment) {
	if ('RealTime' %in% names(segment))
	data.frame(`attemptID` = segment$.attrs['id'], RealTime = segment$RealTime)
	}))
	segments.df$name <- rep(segments$Name, nrow(segments.df))

gdbassett / bayesian_credible_intervals.R

Last active August 15, 2017 18:13

bayesian credible intervals on veris data

	# pick an enumeration
	enum <- "action.*.variety"
	# establish filter criteria (easier than a complex standard-eval filter_ line)
	df <- vcdb %>%
	dplyr::filter(plus.dbir_year == 2016, subset.2017dbir) %>%
	dplyr::filter(attribute.confidentiality.data_disclosure.Yes) %>%
	dplyr::filter(victim.industry2.92)

	# establish priors from previous year
	priors <- df %>%

Gabe gdbassett