A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
recipes = readLines('recipes combined.tsv') | |
# Once I read it into R, I have to get rid of the /t | |
# characters so that it's more acceptable to the tm package | |
recipes.new = apply(as.matrix(recipes), 1, function (x) gsub('\t',' ', x)) | |
recipes.corpus = Corpus(VectorSource(recipes.new)) | |
recipes.dtm = DocumentTermMatrix(recipes.corpus) |
from matplotlib import use | |
from pylab import * | |
from scipy.stats import beta, norm, uniform | |
from random import random | |
from numpy import * | |
import numpy as np | |
import os | |
# Input data |
#!/bin/bash | |
# | |
# DESCRIPTION: | |
# | |
# Set the bash prompt according to: | |
# * the active virtualenv | |
# * the branch of the current git/mercurial repository | |
# * the return value of the previous command | |
# * the fact you just came from Windows and are used to having newlines in | |
# your prompts. |
A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)
library(dplyr) | |
library(tidyr) | |
library(magrittr) | |
library(ggplot2) | |
"http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt" %>% | |
read.table() %>% data.frame %>% tbl_df -> data | |
names(data) <- c("month", "day", "year", "temp") | |
data %>% | |
group_by(year, month) %>% |
/** | |
* To get started: | |
* git clone https://github.com/twitter/algebird | |
* cd algebird | |
* ./sbt algebird-core/console | |
*/ | |
/** | |
* Let's get some data. Here is Alice in Wonderland, line by line | |
*/ |
If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?
I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:
# using dplyr finctions in non-interactive mode | |
# examples | |
library(plyr) | |
library(dplyr) | |
d1 = data_frame(x = seq(1,20),y = rep(1:10,2),z = rep(1:5,4)) | |
head(d1) | |
#### single table verbs #### |
library(jsonlite) | |
cp = fromJSON(txt = "Cell Phone Data.txt", simplifyDataFrame = TRUE) | |
num.atts = c(4,9,11,12,13,14,15,16,18,22) | |
cp[,num.atts] = sapply(cp[,num.atts], function (x) as.numeric(x)) | |
cp$aspect.ratio = cp$att_pixels_y / cp$att_pixels_x | |
cp$isSmartPhone = ifelse(grepl("smart|iphone|blackberry", cp$name, ignore.case=TRUE) == TRUE | cp$att_screen_size >= 4, "Yes", "No") |
#!/usr/bin/env python | |
"""Sample Google Cloud Storage API client. | |
Based on <https://cloud.google.com/storage/docs/json_api/v1/json-api-python-samples>, | |
but removed parts that are not relevant to the Cloud Storage API. | |
Assumes the use of a service account, whose secrets are stored in | |
$HOME/google-api-secrets.json""" |