Skip to content

Instantly share code, notes, and snippets.

@roycoding
roycoding / Intro to Neural Networks.ipynb
Created November 16, 2016 21:51
Neural Network in Python 3
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@roycoding
roycoding / slots.md
Last active July 1, 2022 15:29
slots - A multi-armed bandit library in Python

Multi-armed banditry in Python with slots

Roy Keyes

22 Aug 2016 - This is a post on my blog.

I recently released slots, a Python library that implements multi-armed bandit strategies. If that sounds like something that won't put you to sleep, then please pip install slots and read on.

Some one armed bandits

Multi-armed bandits

@roycoding
roycoding / remote_ds.md
Last active June 19, 2017 18:53
A year of working remotely

A year of working remotely

Roy Keyes

29 Feb 2016 - This is a post on my blog

October 2015 marked my first year of working as a data scientist on a remote team at Zoomer (we help restaurants deliver food, fast). I have always thought that remote work was an obvious choice given the penetration of the internet into all aspects of our lives. In this post I'm going to talk about some of the experiences I've had and lessons learned working on a remote data science team.

Zoomer's main offices are in Philadelphia and San Francisco, but, by the nature of our business, we are a fundamentally distributed company. While I live in San Francisco and work out of the SF office a few days a week, I spend most of the time working from my apartment, collaborating with my team remotely.

Remote data science is not fundamentally different than working on other remote teams, but it has its own set of challenges and difficulties. Done right, remote data s

@roycoding
roycoding / 4_core_attributes.md
Last active June 19, 2017 18:54
4 Core Attributes of a data scientist

4 Core attributes of a data scientist

Roy Keyes

17 Jan 2016 - This is a post on my blog.

Note: I wrote this short post a few years ago and recently found it floating around. It seems more obvious now (and probably trite), but I thought it was still worth posting on the old 'net.

  1. Thinking statistically. Things tend to come in distributions. Understanding this make a huge difference in how (data) problems are approached. See #2.
  2. Strong grasp of the fundamentals. By having a strong base of math, stats, experimentation, and computer science, the job of the data scientist is much easier. For a specific problem, the fundamentals may not be enough, but they will get you very far.
@roycoding
roycoding / r_stl.py
Last active August 29, 2015 14:10 — forked from andreas-h/r_stl.py
# -*- coding: utf-8 -*-
import datetime
from numpy import asarray, ceil
import pandas
import rpy2.robjects as robjects
def stl(data, ns, np=None, nt=None, nl=None, isdeg=0, itdeg=1, ildeg=1,
nsjump=None, ntjump=None, nljump=None, ni=2, no=0, fulloutput=False):
@roycoding
roycoding / art.ipynb
Last active August 29, 2015 14:06
IPython Notebook for computer generated art.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@roycoding
roycoding / mr_patterns.md
Last active May 2, 2018 17:29
MapReduce Patterns

MapReduce Patterns

Roy Keyes

17 Sep 2014 - This is a post on my blog.

MapReduce is a powerful algorithm for processing large sets of data in a distributed, parallel manner. It has proven very popular for many data processing tasks, particularly using the open source Hadoop implementation.

MapReduce basics

The most basic idea powering MapReduce is to break large data sets into smaller chunks, which are then processed separately (in parallel). The results of the chunk processing are then collected.

MapReduce

@roycoding
roycoding / boxplots.md
Last active August 29, 2015 14:06
Salary box plots using matplotlib

This is some matplotlib scratch code to make a pretty boxplot as seen here.

import matplotlib.pyplot as plt

# Data is external: ra, tar, tas, ta, gar, gas, pa

bp=plt.boxplot([ra,tar+tas,ta,gar+gas,pa],widths=0.2,sym='',patch_artist=True)
plt.setp(bp['caps'],color='blue',alpha=1)
plt.setp(bp['whiskers'],color='blue',alpha=1)
@roycoding
roycoding / forest.md
Last active January 23, 2018 08:05
Beat the Becnhmark: Forest Cover Type Prediction

Beating the Forest Cover Type Prediction benchmark

Day 4 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.

To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.

Random Forest Cover Types

Using pandas I loaded the train and test data sets into Python.

@roycoding
roycoding / beat-bike.md
Last active June 27, 2017 18:35
Beat the Benchmark: Bike Sharing Demand

Beating the Bike Sharing Demand benchmark

Day 3 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

In the Bike Sharing Demand competition on Kaggle, the goal is to predict the demand for bike share bikes in Washington DC based on historical usage data. For this regression problem, the evaluation metric is RMSLE.

To beat the total mean count benchmark I tried to strategies, one very simple and another slightly sophisticated. The first strategy was to use the per-month mean. The second was to a rolling mean.

Per-month count means

Using pandas I loaded the train and test data sets into Python. I then down sampled by month using the mean and upsampled by hour, filling in each month with the appropriate mean value.