Skip to content

Instantly share code, notes, and snippets.

View makmanalp's full-sized avatar

Mehmet Ali "Mali" Akmanalp makmanalp

View GitHub Profile
@makmanalp
makmanalp / Outline.md
Last active August 29, 2015 14:17
Clean Data Talk (WIP)

Subject and Motivation

With the open data and reproducible research movements, it’s becoming more and more common for researchers and analysts to make datasets public. But just as putting your code on GitHub as is doesn’t make it a good open source project, putting your zipped CSV files on a website doesn’t make it a good open dataset. For example, it’s not uncommon to have to spend half the length of a project just cleaning a dataset/project just cleaning a dataset.

This talk is about pitfalls commonly encountered when working with unfamiliar datasets, and how to help your audience avoid such pitfalls when you publish your own datasets. This is a “best practices” talk, but along with strategies for dealing with the issues, the talk will mention relevant python libraries, tools and techniques that might help tackle each problem.

Outline

@makmanalp
makmanalp / Blurb
Last active August 29, 2015 14:18
How Websites Work talk
How do modern websites work? I give you a cross sectional tour of the web -
from the second you hit the enter key to when you see google. How does a
website really work behind the scenes? What do databases, caches, servers,
content delivery networks do? What's javascript? What is frontend versus
backend? How does a massive website like Facebook or Twitter work? What does
"the cloud" mean? Why do companies care other than the buzzword factor? What is
"big data", really?
March 31st, 1 to 2 pm
Harvard Kennedy School, Rubinstein Building 4th floor, Perkins Room / R-415
@makmanalp
makmanalp / gist:a53532d99c33e68a93a5
Created April 13, 2015 20:16
Word-diff of stata 13 vs 14
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<style type="text/css">
.ef0,.f0 { color: #000000; } .eb0,.b0 { background-color: #000000; }
.ef1,.f1 { color: #AA0000; } .eb1,.b1 { background-color: #AA0000; }
.ef2,.f2 { color: #00AA00; } .eb2,.b2 { background-color: #00AA00; }
.ef3,.f3 { color: #AA5500; } .eb3,.b3 { background-color: #AA5500; }
.ef4,.f4 { color: #0000AA; } .eb4,.b4 { background-color: #0000AA; }
.ef5,.f5 { color: #AA00AA; } .eb5,.b5 { background-color: #AA00AA; }
@makmanalp
makmanalp / check_git_synced.yml
Created May 12, 2015 13:38
Ansible playbook to include in pre_tasks to complain on deploy if you haven't synced changes
- name: run git status
local_action: shell git status | grep "up-to-date with 'origin/master'"
ignore_errors: True
register: git_up_to_date
- name: fail if git not up to date
local_action: fail msg="Please pull / push the latest changes to the playbooks repo before deploying."
when: git_up_to_date | failed
- name: run git diff
local_action: command git status -s
@makmanalp
makmanalp / foo.md
Last active August 29, 2015 14:27
Weird null comparison issue in pandas / python / numpy

When I run this:

    def fill_parents(row):
        print (row.parent_id,
               type(row.parent_id),
               row.parent_id is pd.np.nan,
               row.parent_id == pd.np.nan,
               row.parent_id is None,
               row.parent_id == None,
               pd.isnull(row.parent_id)
@makmanalp
makmanalp / healthy_made_easy.md
Last active August 29, 2015 14:27
Help, I'm an adult, and I don't know how to be healthy

Basically, it comes down to three things:

Analysis paralysis sucks

You can think and read for years about how best to do something, and it's useless if you're actually doing nothing to improve your current state. Pick a small thing and go with it. Ignore the details of what everyone says about X is better, Y is better. Just pick something reasonable and go with it. Example: We could argue about the details about fermented foods. Just eating more fermented foods is not going to make someone healthier. Just use the general gist of the

@makmanalp
makmanalp / json_to_table.py
Created December 16, 2015 15:59
Converting structured / hierarchical JSON to flat table
import json
from pandas.io.json import json_normalize
data = json.loads(open("./out.json").read())
table = json_normalize(data, "links", ["author", "title"])
print table
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@makmanalp
makmanalp / spanish_title_case_fix.py
Last active February 23, 2016 20:17
Spanish title case preposition fixer: de, del, de la, de las, de los
# -*- coding: utf-8 -*-
import re
import unittest
NON_CAPITAL = (
"De Los",
"De Las",
"De La",