Functions have arity: n of arguments they take [source]
- Nullary: 0 args
- Unary: 1 arg
- Polyadic: many args
- Binary: 2 args
- Ternary: 3 args
Functions have arity: n of arguments they take [source]
# Parse JSON data with this one weird trick! | |
from pyspark import SparkContext | |
from pyspark import SparkConf | |
from pyspark.sql import SQLContext | |
from pyspark.sql import Row | |
# Set up basic spark session | |
conf = (SparkConf() | |
.setAppName('My App') |
Postgres uses an MVCC (Multiversion concurrency control) model (as opposed to table locking)
So: when you run an update, it's essentially doubling the size of the table, because a new snapshot is being created.
Hive and postgres handle where
vs on
clauses differently. Postgres' query engine is smarter: where
and on
clause joins will be handled the same. In Hive, where
clause is more efficient than on
clause.
On
clause: In stage 1, pulls in ~400MM records; takes ~13 minutes to execute
Where
clause: In stage 1, pulls in ~60MM records; takes ~5 minutes to execute