Steve sanp

Currying vs Partials

Some concepts:

Functions have arity: n of arguments they take [source]

Nullary: 0 args
Unary: 1 arg
Polyadic: many args
- Binary: 2 args

Ternary: 3 args

SSH Tunneling

Problem

You want to query a DB and get a result set, but you don't have access to that DB directly from your localhost.

Bad solution A: Cry
Bad solution B: Ask someone who does have access to run your query for you
Bad solution C: ssh into a box that has access, then psql into the DB

Vacuuming

Postgres uses an MVCC (Multiversion concurrency control) model (as opposed to table locking)
- When an update/transaction is happening, a new snaphot of the data is created
- Whenever you query data, you're seeing a snapshot of the data as it was at a certain time in the past.
So: when you run an update, it's essentially doubling the size of the table, because a new snapshot is being created.

Joins using Where clause vs on clause

Hive and postgres handle where vs on clauses differently. Postgres' query engine is smarter: where and on clause joins will be handled the same. In Hive, where clause is more efficient than on clause.

Stats:

Hive:

On clause: In stage 1, pulls in ~400MM records; takes ~13 minutes to execute

Where clause: In stage 1, pulls in ~60MM records; takes ~5 minutes to execute

Second Price Auctions

Overview

Second price auctions (2PA) are a type of auction where the highest bidder pays the second highest bid
In contrast to first price auctions (FPA), where the highest bidder pays her own bid

In this talk, going to go over

Why 2PA work better than FPA

	# Parse JSON data with this one weird trick!

	from pyspark import SparkContext
	from pyspark import SparkConf
	from pyspark.sql import SQLContext
	from pyspark.sql import Row

	# Set up basic spark session
	conf = (SparkConf()
	.setAppName('My App')