Vamshi Chollati vchollati

Public Data Sources

Legislative

United States on github They have a great US module that has state abbrevs, names, etc. O'Reilly article about the project
The State Decoded
- definition parsing
Legislative Documents in XML at the United States House of Representatives
US Government Web Services and XML Data Sources

If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?

I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:

Statistical knowledge
Programming/hacking skills
Domain expertise

Statistical knowledge

Useful Linux commands

Check drive speed

Read

sudo hdparm -t /dev/sda

	library(mgcv)
	library(ggplot2)
	library(dplyr)
	library(XML)
	library(weatherData)

	us.airports.url <- 'http://www.world-airport-codes.com/us-top-40-airports.html'

	us.airports <- readHTMLTable(us.airports.url)[[1]] %>%
	filter(!is.na(IATA)) %>%

	"""
	The MIT License (MIT)

	Copyright (c) 2015 Alec Radford

	Permission is hereby granted, free of charge, to any person obtaining a copy
	of this software and associated documentation files (the "Software"), to deal
	in the Software without restriction, including without limitation the rights
	to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	copies of the Software, and to permit persons to whom the Software is

	/**
	* To get started:
	* git clone https://github.com/twitter/algebird
	* cd algebird
	* ./sbt algebird-core/console
	*/

	/**
	* Let's get some data. Here is Alice in Wonderland, line by line
	*/

	# When you're sure of the format, it's much quicker to explicitly convert your dates than use `parse_dates`
	# Makes sense; was just surprised by the time difference.
	import pandas as pd
	from datetime import datetime
	to_datetime = lambda d: datetime.strptime(d, '%m/%d/%Y %H:%M')

	%time trips = pd.read_csv('data/divvy/Divvy_Trips_2013.csv', parse_dates=['starttime', 'stoptime'])
	# CPU times: user 1min 29s, sys: 331 ms, total: 1min 29s
	# Wall time: 1min 30s

	import numpy as np
	import pandas as pd
	import datetime
	import urllib

	from bokeh.plotting import *
	from bokeh.models import HoverTool
	from collections import OrderedDict

	## Read in our data. We've aggregated it by date already, so we don't need to worry about paging