Mehmet Ali "Mali" Akmanalp makmanalp

Subject and Motivation

With the open data and reproducible research movements, it’s becoming more and more common for researchers and analysts to make datasets public. But just as putting your code on GitHub as is doesn’t make it a good open source project, putting your zipped CSV files on a website doesn’t make it a good open dataset. For example, it’s not uncommon to have to spend half the length of a project just cleaning a dataset/project just cleaning a dataset.

This talk is about pitfalls commonly encountered when working with unfamiliar datasets, and how to help your audience avoid such pitfalls when you publish your own datasets. This is a “best practices” talk, but along with strategies for dealing with the issues, the talk will mention relevant python libraries, tools and techniques that might help tackle each problem.

Outline

When I run this:

    def fill_parents(row):
        print (row.parent_id,
               type(row.parent_id),
               row.parent_id is pd.np.nan,
               row.parent_id == pd.np.nan,
               row.parent_id is None,
               row.parent_id == None,
               pd.isnull(row.parent_id)

Basically, it comes down to three things:

Analysis paralysis sucks

You can think and read for years about how best to do something, and it's useless if you're actually doing nothing to improve your current state. Pick a small thing and go with it. Ignore the details of what everyone says about X is better, Y is better. Just pick something reasonable and go with it. Example: We could argue about the details about fermented foods. Just eating more fermented foods is not going to make someone healthier. Just use the general gist of the

https://github.com/vkaynig/IACS_ComputeFest_DeepLearning https://news.ycombinator.com/item?id=10901980 https://www.quora.com/What-is-regularization-in-machine-learning https://www.quora.com/Why-does-deep-learning-architectures-use-only-non-linear-activation-function-in-the-hidden-layers http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf https://github.com/twitter/torch-autograd http://www.kdnuggets.com/2015/12/deep-learning-outgrows-bag-words-recurrent-neural-networks.html http://www.nvidia.com/object/jetson-tx1-dev-kit.html http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/ http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

	How do modern websites work? I give you a cross sectional tour of the web -
	from the second you hit the enter key to when you see google. How does a
	website really work behind the scenes? What do databases, caches, servers,
	content delivery networks do? What's javascript? What is frontend versus
	backend? How does a massive website like Facebook or Twitter work? What does
	"the cloud" mean? Why do companies care other than the buzzword factor? What is
	"big data", really?

	March 31st, 1 to 2 pm
	Harvard Kennedy School, Rubinstein Building 4th floor, Perkins Room / R-415

	<html>
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
	<style type="text/css">
	.ef0,.f0 { color: #000000; } .eb0,.b0 { background-color: #000000; }
	.ef1,.f1 { color: #AA0000; } .eb1,.b1 { background-color: #AA0000; }
	.ef2,.f2 { color: #00AA00; } .eb2,.b2 { background-color: #00AA00; }
	.ef3,.f3 { color: #AA5500; } .eb3,.b3 { background-color: #AA5500; }
	.ef4,.f4 { color: #0000AA; } .eb4,.b4 { background-color: #0000AA; }
	.ef5,.f5 { color: #AA00AA; } .eb5,.b5 { background-color: #AA00AA; }

	- name: run git status
	local_action: shell git status \| grep "up-to-date with 'origin/master'"
	ignore_errors: True
	register: git_up_to_date
	- name: fail if git not up to date
	local_action: fail msg="Please pull / push the latest changes to the playbooks repo before deploying."
	when: git_up_to_date \| failed

	- name: run git diff
	local_action: command git status -s

	import json
	from pandas.io.json import json_normalize

	data = json.loads(open("./out.json").read())
	table = json_normalize(data, "links", ["author", "title"])
	print table

	# -- coding: utf-8 --

	import re
	import unittest


	NON_CAPITAL = (
	"De Los",
	"De Las",
	"De La",