Edwin M. Black-Milk

There are packages for this now!

2017-08-03: Since I wrote this in 2014, the universe, specifically Kirill Müller (https://github.com/krlmlr), has provided better solutions to this problem. I now recommend that you use one of these two packages:

rprojroot: This is the main package with functions to help you express paths in a way that will "just work" when developing interactively in an RStudio Project and when you render your file.
here: A lightweight wrapper around rprojroot that anticipates the most likely scenario: you want to write paths relative to the top-level directory, defined as an RStudio project or Git repo. TRY THIS FIRST.

I love these packages so much I wrote an ode to here.

I use these packages now instead of what I describe below. I'll leave this gist up for historical interest. 😆

distinct column -> For each row returned, return only the unique members of a set. Think of it as for each row in a projection, concatenate all the column values and return only the strings that are unique.

test_db=# SELECT DISTINCT parent_id, child_id, id FROM test.foo_table ORDER BY parent_id, child_id, id LIMIT 10;
parent_id | child_id | id
-----------+------------+-----------------------------
1000040 | 103 | 1000040|2645405726|0001|103

#Converting SPSS files to csv with PSPP#

Install and open PSPP Use the File menu to open your file (it probably has a .sav extension) Go to File>New>Syntax to open PSPP's command line window

Enter:

Standalone Spark 2.0.0 with s3

###Tested with:

Spark 2.0.0 pre-built for Hadoop 2.7
Mac OS X 10.11
Python 3.5.2

Goal

Use s3 within pyspark with minimal hassle.

	#!/usr/bin/env python
	# encoding: utf-8
	'''
	${module}
	'''

	import sys
	import os
	from os import path

	# This file's location is ~/.R/Makevars
	# Default variables (no omp support):
	# CXX=clang++
	# CC=clang

	# I followed the instructions at http://hpc.sourceforge.net/ to install gcc 4.9
	CC=/usr/local/bin/gcc
	CXX=/usr/local/bin/g++
	FC=/usr/local/bin/gfortran
	F77=/usr/local/bin/gfortran

	train <- read.csv("train.csv")
	bound <- floor(nrow(train) * 0.9)
	train <- train[sample(nrow(train)), ]
	df.train <- train[1:bound, ]
	df.validation <- train[(bound+1):nrow(train), ]

	train.y <- df.train$TARGET
	validation.y <- df.validation$TARGET

	dtrain <- xgb.DMatrix(data=df.train, label=train.y)

	# CV based on general traininds and testinds list
	# useful for time-series based split
	xgb.ts.cv <- function (params = list(), data, nrounds, nfold, label = NULL,
	missing = NULL, prediction = FALSE, showsd = TRUE, metrics = list(),
	obj = NULL, feval = NULL, stratified = TRUE, folds = NULL,
	verbose = T, print.every.n = 1L, early.stop.round = NULL,
	maximize = NULL, traininds, testinds, ...)
	{

	if (typeof(params) != "list") {

	rdd = sc.parallelize(
	[
	(0., 1.),
	(0., 0.),
	(0., 0.),
	(1., 1.),
	(1.,0.),
	(1.,0.),
	(1.,1.),
	(1.,1.)