22 Aug 2016 - This is a post on my blog.
I recently released slots, a Python library that implements multi-armed bandit strategies. If that sounds like something that won't put you to sleep, then please pip install slots
and read on.
29 Feb 2016 - This is a post on my blog
October 2015 marked my first year of working as a data scientist on a remote team at Zoomer (we help restaurants deliver food, fast). I have always thought that remote work was an obvious choice given the penetration of the internet into all aspects of our lives. In this post I'm going to talk about some of the experiences I've had and lessons learned working on a remote data science team.
Zoomer's main offices are in Philadelphia and San Francisco, but, by the nature of our business, we are a fundamentally distributed company. While I live in San Francisco and work out of the SF office a few days a week, I spend most of the time working from my apartment, collaborating with my team remotely.
Remote data science is not fundamentally different than working on other remote teams, but it has its own set of challenges and difficulties. Done right, remote data s
17 Jan 2016 - This is a post on my blog.
Note: I wrote this short post a few years ago and recently found it floating around. It seems more obvious now (and probably trite), but I thought it was still worth posting on the old 'net.
# -*- coding: utf-8 -*- | |
import datetime | |
from numpy import asarray, ceil | |
import pandas | |
import rpy2.robjects as robjects | |
def stl(data, ns, np=None, nt=None, nl=None, isdeg=0, itdeg=1, ildeg=1, | |
nsjump=None, ntjump=None, nljump=None, ni=2, no=0, fulloutput=False): |
17 Sep 2014 - This is a post on my blog.
MapReduce is a powerful algorithm for processing large sets of data in a distributed, parallel manner. It has proven very popular for many data processing tasks, particularly using the open source Hadoop implementation.
The most basic idea powering MapReduce is to break large data sets into smaller chunks, which are then processed separately (in parallel). The results of the chunk processing are then collected.
This is some matplotlib scratch code to make a pretty boxplot as seen here.
import matplotlib.pyplot as plt
# Data is external: ra, tar, tas, ta, gar, gas, pa
bp=plt.boxplot([ra,tar+tas,ta,gar+gas,pa],widths=0.2,sym='',patch_artist=True)
plt.setp(bp['caps'],color='blue',alpha=1)
plt.setp(bp['whiskers'],color='blue',alpha=1)
For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.
To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.
Using pandas I loaded the train and test data sets into Python.
In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.
To beat the total mean count benchmark I tried to strategies, one very simple and another slightly sophisticated. The first strategy was to use the per-month mean. The second was to a rolling mean.
Using pandas I loaded the train and test data sets into Python. I then down sampled by month using the mean and upsampled by hour, filling in each month with the appropriate mean value.