Roy Keyes roycoding

Multi-armed banditry in Python with slots

Roy Keyes

22 Aug 2016 - This is a post on my blog.

I recently released slots, a Python library that implements multi-armed bandit strategies. If that sounds like something that won't put you to sleep, then please pip install slots and read on.

Multi-armed bandits

A year of working remotely

Roy Keyes

29 Feb 2016 - This is a post on my blog

October 2015 marked my first year of working as a data scientist on a remote team at Zoomer (we help restaurants deliver food, fast). I have always thought that remote work was an obvious choice given the penetration of the internet into all aspects of our lives. In this post I'm going to talk about some of the experiences I've had and lessons learned working on a remote data science team.

Zoomer's main offices are in Philadelphia and San Francisco, but, by the nature of our business, we are a fundamentally distributed company. While I live in San Francisco and work out of the SF office a few days a week, I spend most of the time working from my apartment, collaborating with my team remotely.

Remote data science is not fundamentally different than working on other remote teams, but it has its own set of challenges and difficulties. Done right, remote data s

4 Core attributes of a data scientist

Roy Keyes

17 Jan 2016 - This is a post on my blog.

Note: I wrote this short post a few years ago and recently found it floating around. It seems more obvious now (and probably trite), but I thought it was still worth posting on the old 'net.

Thinking statistically. Things tend to come in distributions. Understanding this make a huge difference in how (data) problems are approached. See #2.
Strong grasp of the fundamentals. By having a strong base of math, stats, experimentation, and computer science, the job of the data scientist is much easier. For a specific problem, the fundamentals may not be enough, but they will get you very far.

MapReduce Patterns

Roy Keyes

17 Sep 2014 - This is a post on my blog.

MapReduce is a powerful algorithm for processing large sets of data in a distributed, parallel manner. It has proven very popular for many data processing tasks, particularly using the open source Hadoop implementation.

MapReduce basics

The most basic idea powering MapReduce is to break large data sets into smaller chunks, which are then processed separately (in parallel). The results of the chunk processing are then collected.

This is some matplotlib scratch code to make a pretty boxplot as seen here.

import matplotlib.pyplot as plt

# Data is external: ra, tar, tas, ta, gar, gas, pa

bp=plt.boxplot([ra,tar+tas,ta,gar+gas,pa],widths=0.2,sym='',patch_artist=True)
plt.setp(bp['caps'],color='blue',alpha=1)
plt.setp(bp['whiskers'],color='blue',alpha=1)

Beating the Forest Cover Type Prediction benchmark

Day 4 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.

To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.

Random Forest Cover Types

Using pandas I loaded the train and test data sets into Python.

Beating the Bike Sharing Demand benchmark

Day 3 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.

To beat the total mean count benchmark I tried to strategies, one very simple and another slightly sophisticated. The first strategy was to use the per-month mean. The second was to a rolling mean.

Per-month count means

Using pandas I loaded the train and test data sets into Python. I then down sampled by month using the mean and upsampled by hour, filling in each month with the appropriate mean value.

	# -- coding: utf-8 --

	import datetime

	from numpy import asarray, ceil
	import pandas
	import rpy2.robjects as robjects

	def stl(data, ns, np=None, nt=None, nl=None, isdeg=0, itdeg=1, ildeg=1,
	nsjump=None, ntjump=None, nljump=None, ni=2, no=0, fulloutput=False):