Skip to content

Instantly share code, notes, and snippets.

View gdbassett's full-sized avatar

Gabe gdbassett

  • Liberty Mutual
  • US
View GitHub Profile
@gdbassett
gdbassett / gist:6438b4036a501eba9f5e
Created January 28, 2015 14:11
Association Rules Console Ouput 1
> df <- df[!names(df) %in% c('root.victim.region',
+ 'root.victim.country',
+ 'root.summary',
+ 'root.summary=Source_Category',
+ 'root.victim.industry',
+ 'root.timeline.incident.year',
+ 'root.plus.dbir_year',
+ 'root.action.social.notes',
+ 'root.victim.secondary.notes',
+ 'root.action.hacking.notes',
@gdbassett
gdbassett / text_cluster.py
Last active October 2, 2018 07:19
Basic script for text->vectorization->TF-IDF->canopies->kmeans->clusters. Initially tested on VCDB breach summaries.
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# based on http://scikit-learn.org/stable/auto_examples/document_clustering.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
from time import time
from collections import defaultdict
@gdbassett
gdbassett / canopy.py
Created December 12, 2014 21:59
Efficient python implementation of canopy clustering. (A method for efficiently generating centroids and clusters, most commonly as input to a more robust clustering algorithm.)
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
# X shoudl be a numpy matrix, very likely sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix
# T1 > T2 for overlapping clusters
# T1 = Distance to centroid point to not include in other clusters
# T2 = Distance to centroid point to include in cluster
# T1 > T2 for overlapping clusters
# T1 < T2 will have points which reside in no clusters
# T1 == T2 will cause all points to reside in mutually exclusive clusters
@gdbassett
gdbassett / bulk_netflow_import.py
Created November 20, 2014 02:51
A script to bulk import netflow records into a Neo4j graph database. Designed for efficiency, can import roughly 1 million flows every 2 hours.
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
AUTHOR: Gabriel Bassett
DATE: 11-19-2014
DEPENDENCIES: py2neo
Copyright 2014 Gabriel Bassett
@gdbassett
gdbassett / robustTau.py
Created October 10, 2014 16:36
A Robust Measure of Scale modeled on Maronnaa & Zamarb's performance improvements on Rousseeuw and Croux's skewness and efficiency improvements on median absolute deviation. Blatantly transposed from the R robustbase module.
from scipy import stats as scistats
import numpy as np
# Implementation of Tau from http://amstat.tandfonline.com/doi/abs/10.1198/004017002188618509#.VDgKhdR4rEh
# blatently transposed R robustbase library from http://r-forge.r-project.org/scm/?group_id=59, OGK.R
def scaleTau2(x, c1 = 4.5, c2 = 3.0, consistency = True, mu_too = False, *xargs, **kargs):
## NOTA BENE: This is *NOT* consistency corrected
x = np.asarray(x)
n = len(x)
medx = np.median(x)
### Keybase proof
I hereby claim:
* I am gdbassett on github.
* I am gdbassett (https://keybase.io/gdbassett) on keybase.
* I have a public key whose fingerprint is 8F47 6E59 65B3 9C92 428C 5A8C C609 81ED D4FA 1957
To claim this, I am signing this object: