Skip to content

Instantly share code, notes, and snippets.

View wpm's full-sized avatar

W.P. McNeill wpm

View GitHub Profile
@wpm
wpm / multi_join.py
Last active May 31, 2023 10:57
Pandas multi-table join
import pandas
"""
Join an arbitrary number of data frames, using a multi-index label for each data frame.
For example say you have three data frames each of which lists the classroom and
number of students a teacher has in a given period.
Classroom Students
Teacher
@wpm
wpm / poll.js
Last active November 14, 2019 09:59
Javascript Polling with Promises
var Promise = require('bluebird');
/**
* Periodically poll a signal function until either it returns true or a timeout is reached.
*
* @param signal function that returns true when the polled operation is complete
* @param interval time interval between polls in milliseconds
* @param timeout period of time before giving up on polling
* @returns true if the signal function returned true, false if the operation timed out
*/
@wpm
wpm / spark_parallel_boost.py
Last active December 3, 2018 02:56
A simple example of how to integrate the Spark parallel computing framework and the scikit-learn machine learning toolkit. This script randomly generates test and train data sets, trains an ensemble of decision trees using boosting, and applies the ensemble to the test set. The ensemble training is done in parallel.
from pyspark import SparkContext
import numpy as np
from sklearn.cross_validation import train_test_split, Bootstrap
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
def run(sc):
@wpm
wpm / ItemSet.java
Created September 13, 2011 18:35
ItemSet: a Hadoop ArrayWritable of Text
package wpmcn.structure;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import java.util.*;
/**