George Erickson GeorgeErickson

Reese kafka migration script
chef to shell converter for the VPC migration

Spark Tips & Tricks

Misc. Tips & Tricks

If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the

Working with SQL syntax trees in F#

Update 12/15/2016 - Added Sql generation

Welcome to my blog post for #FsAdvent 2016.

If you're using a relational database, as your application grows in size, at some point you may find yourself looking for an SQL parser. This can give you lots of leverage, for example allowing you to:

Do permission checks on queries before executing them
Rewrite incorrect or inefficient queries

	OS := $(shell uname -s)

	PROTO_VERSION := 3.0.2
	PROTO_ZIP_FILE := /tmp/protoc-$(PROTO_VERSION).zip
	PROTOC := /usr/local/bin/protoc-$(PROTO_VERSION)
	ifeq ($(OS),Darwin)
	PROTO_URL := https://github.com/google/protobuf/releases/download/v$(PROTO_VERSION)/protoc-$(PROTO_VERSION)-osx-x86_64.zip
	PROTO_CHECKSUM := 06f7401ffe5211340692b0a16dc53f3d8f9dc8ef3c1f74378110ee222e36436d
	else
	PROTO_URL := "https://s3.amazonaws.com/dd-public-oss-mirror/protoc-$(PROTO_VERSION)-linux-x86_64.zip"

	--- PSQL queries which also duplicated from https://github.com/anvk/AwesomePSQLList/blob/master/README.md
	--- some of them taken from https://www.slideshare.net/alexeylesovsky/deep-dive-into-postgresql-statistics-54594192

	-- I'm not an expert in PSQL. Just a developer who is trying to accumulate useful stat queries which could potentially explain problems in your Postgres DB.

	------------
	-- Basics --
	------------

	-- Get indexes of tables

	// ==UserScript==
	// @name gitlab pipeline job highlighter
	// @namespace http://tampermonkey.net/
	// @version 0.1
	// @description highlight personal jobs, customize for your url (match)
	// @author pmbauer
	// @match https://gitlab.ddbuild.io//pipelines
	// @grant none
	// ==/UserScript==

	// Copyright 2017 The Go Authors. All rights reserved.
	// Use of this source code is governed by a BSD-style
	// license that can be found in the LICENSE file.

	package main

	import (
	"image/color"
	"math"
	"os"

	package main

	import (
	"bufio"
	"bytes"
	"errors"
	"flag"
	"fmt"
	"log"
	"os"

	### x is either a vector of numbers or a data frame with sums and weights. Digest is a data frame.
	merge = function(x, digest, compression=100) {
	## Force the digest to be a data.frame, possibly empty
	if (!is.data.frame(digest) && is.na(digest)) {
	digest = data.frame(sum=c(), weight=c())
	}
	## and coerce the incoming data likewise ... a vector of points have default weighting of 1
	if (!is.data.frame(x)) {
	x = data.frame(sum=x, weight=1)
	}