Matt Button BRMatt

I want to spend a week (during Hacker School alumni reunion week) better understanding performance (probably of things in the Hadoop ecosystem) on a few different dataset sizes (8GB, 100GB, 1TB). I have $1000 of AWS credit that I can spend on this (yay!)

Some things I want:

get a much better grasp on the performance of in-memory operations (put 8GB of data into memory and be done) vs running a distributed map reduce.
Understand what goes into the performance (how much time is spent copying data? sending data over the network? CPU?)
Learn something about tradeoffs

I'd love suggestions for experiments to run and setups to use. At work I've been using HDFS / Impala / Scalding, so my current thought is to spend time looking in depth at running a map/reduce with Scalding vs an Impala query vs running a non-distributed job in memory, because I already know about those things. But I'm open to other ideas!

ANNOUNCEMENT

I have moved this over to the Tech Interview Cheat Sheet Repo and has been expanded and even has code challenges you can run and practice against!

\

MEDIUM-TERM QUESTIONS

Define a high level mission statement

what are we focusing on and what we do not plan on teaching?
why we are not doing Rails, why you should go back to basics: blog post? disclaimer? (e.g. wordpress/tools/startup-type helpers vs learning how to program from the ground up)

Coach inductions
Explicit paths/tracks through our training content
Development environment surgeries
Define roles such as course coordinator/tutorial coordinator/tutorial owner

	WITH table_scans as (
	SELECT relid,
	tables.idx_scan + tables.seq_scan as all_scans,
	( tables.n_tup_ins + tables.n_tup_upd + tables.n_tup_del ) as writes,
	pg_relation_size(relid) as table_size
	FROM pg_stat_user_tables as tables
	),
	all_writes as (
	SELECT sum(writes) as total_writes
	FROM table_scans

	This tool is used to compare microbenchmarks across two versions of code. It's
	paranoid about nulling out timing error, so the numbers should be meaningful.
	It runs the benchmarks many times, scaling the iterations up if the benchmark
	is extremely short, and it nulls out its own timing overhead while doing so. It
	reports results graphically with a text interface in the terminal.

	You first run it with --record, which generates a JSON dotfile with runtimes
	for each of your benchmarks. Then you change the code and run again with
	--compare, which re-runs and generates comparison plots between your recorded
	and current times. In the example output, I did a --record on the master

	(comment ; Fun with transducers, v2

	;; Still haven't found a brief + approachable overview of Clojure 1.7's new
	;; transducers in the particular way I would have preferred myself - so here goes:

	;;;; Definitions
	;; Looking at the `reduce` docstring, we can define a 'reducing-fn' as:
	(fn reducing-fn ([]) ([accumulation next-input])) -> new-accumulation
	;; (The `[]` arity is actually optional; it's only used when calling
	;; `reduce` w/o an init-accumulator).

	#!/bin/bash

	cmd-hello() {
	declare desc="Displays a friendly hello"
	declare firstname="$1" lastname="$2"
	echo "Hello, $firstname $lastname."
	}

	cmd-help() {
	declare desc="Shows help information for a command"

	* {
	font-size: 12pt;
	font-family: monospace;
	font-weight: normal;
	font-style: normal;
	text-decoration: none;
	color: black;
	cursor: default;
	}

Matt Button BRMatt

ANNOUNCEMENT

Contents