hwu dawu76

Useful Pandas Snippets

A personal diary of DataFrame munging over the years.

Data Types and Conversion

Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)

If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?

I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:

Statistical knowledge
Programming/hacking skills
Domain expertise

	recipes = readLines('recipes combined.tsv')

	# Once I read it into R, I have to get rid of the /t
	# characters so that it's more acceptable to the tm package

	recipes.new = apply(as.matrix(recipes), 1, function (x) gsub('\t',' ', x))

	recipes.corpus = Corpus(VectorSource(recipes.new))
	recipes.dtm = DocumentTermMatrix(recipes.corpus)

	from matplotlib import use

	from pylab import *
	from scipy.stats import beta, norm, uniform
	from random import random
	from numpy import *
	import numpy as np
	import os

	# Input data

	#!/bin/bash
	#
	# DESCRIPTION:
	#
	# Set the bash prompt according to:
	# * the active virtualenv
	# * the branch of the current git/mercurial repository
	# * the return value of the previous command
	# * the fact you just came from Windows and are used to having newlines in
	# your prompts.

	library(dplyr)
	library(tidyr)
	library(magrittr)
	library(ggplot2)
	"http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt" %>%
	read.table() %>% data.frame %>% tbl_df -> data
	names(data) <- c("month", "day", "year", "temp")

	data %>%
	group_by(year, month) %>%

	/**
	* To get started:
	* git clone https://github.com/twitter/algebird
	* cd algebird
	* ./sbt algebird-core/console
	*/

	/**
	* Let's get some data. Here is Alice in Wonderland, line by line
	*/

	# using dplyr finctions in non-interactive mode
	# examples

	library(plyr)
	library(dplyr)

	d1 = data_frame(x = seq(1,20),y = rep(1:10,2),z = rep(1:5,4))
	head(d1)

	#### single table verbs ####

	library(jsonlite)

	cp = fromJSON(txt = "Cell Phone Data.txt", simplifyDataFrame = TRUE)

	num.atts = c(4,9,11,12,13,14,15,16,18,22)

	cp[,num.atts] = sapply(cp[,num.atts], function (x) as.numeric(x))
	cp$aspect.ratio = cp$att_pixels_y / cp$att_pixels_x
	cp$isSmartPhone = ifelse(grepl("smart\|iphone\|blackberry", cp$name, ignore.case=TRUE) == TRUE \| cp$att_screen_size >= 4, "Yes", "No")

	#!/usr/bin/env python

	"""Sample Google Cloud Storage API client.

	Based on <https://cloud.google.com/storage/docs/json_api/v1/json-api-python-samples>,
	but removed parts that are not relevant to the Cloud Storage API.

	Assumes the use of a service account, whose secrets are stored in
	$HOME/google-api-secrets.json"""

hwu dawu76

Useful Pandas Snippets

Data Types and Conversion

Statistical knowledge