Campbell Allen camallen

AWS EMR bootstrap to install R packages from CRAN

This bootstrap is useful if you want to deploy SparkR applications that run arbitrary code on the EMR cluster's workers. The R code will need to have its dependencies already installed on each of the workers, and will fail otherwise. This is the case if you use functions such as gapply or dapply.

How to use the bootstrap

You will first have to download the gist to a file and then upload it to S3 in a bucket of your choice.
Using the AWS EMR Console create a cluster and choose advanced options.
In Step 3 you can configure your bootstraps. Choose to Configure and add a Custom action

	import csv, json, pdb;
	from geojson import Feature, FeatureCollection, Point

	def convertBox2MidPoint(lower_lat, lower_lon, upper_lat, upper_lon):
	delta_lon = abs(lower_lon - upper_lon) / 2
	delta_lat = abs(lower_lat - upper_lat) / 2
	mid_lon = lower_lon + delta_lon
	mid_lat = lower_lat + delta_lat
	# geojson is lon, lat ordering
	return (mid_lon, mid_lat)

	def _iterate_cursor(collection: nil, query: { }, opts: { }, message: '')
	opts.reverse_merge! timeout: false
	index = 0
	total = collection.find(query).count
	message = "#{ message } Galaxy Zoo Subjects"

	collection.find(query, opts) do \|cursor\|
	while cursor.has_next?
	index += 1

	SELECT
	nspname AS schema_name,
	relname AS index_name,
	round(100 * pg_relation_size(indexrelid) / pg_relation_size(indrelid)) / 100 AS index_ratio,
	pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
	pg_size_pretty(pg_relation_size(indrelid)) AS table_size

	FROM
	pg_index I

	{
	"classification_id": "103101552",
	"project_id": "825",
	"workflow_id": "2647",
	"user_id": "6",
	"subject_ids": [
	"15686058"
	],
	"subject_urls": [
	{

	[
	{
	"userName": "zooniverse",
	"repo": "panoptes"
	},
	{
	"userName": "zooniverse",
	"repo": "panoptes-front-end"
	},
	{

	DIRS=(local_image_directory)
	for dir_to_process in "${DIRS[@]}" ; do
	echo "converting files in $dir_to_process"
	cd $dir_to_process
	# possibly speed up using parallels? https://unix.stackexchange.com/questions/320877/how-to-use-convert-and-xargs-together
	OUT_PATH="../converted/${dir_to_process}"

	# this is from another project but I manually tested conversion of images to determine the following values
	# resize to max width @ 2048 (match other sites) and 80% quality to get under 1M / 900K
	# run some manual tests to see what works for you, e.g.

	# Manual csv classifications dump
	# ensure the config/database.yml is configured to use the read replica database and not the production db.
	#
	# run via rails runner from the panoptes cmd line via
	# rails r project_classifications_csv_dump_export.rb

	require 'csv'

	PROJECT_ID = 1

	select relation, pg_size_pretty(total_size), pg_size_pretty(size), pg_size_pretty(total_size - size) as index_size from
	(SELECT relname AS "relation", pg_total_relation_size(C.oid) AS "total_size", pg_relation_size(C.oid) AS "size"
	FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
	WHERE nspname NOT IN ('pg_catalog', 'information_schema')
	ORDER BY pg_relation_size(C.oid) DESC
	) as derived
	LIMIT 10;

	#!/bin/bash
	echo " Installing pandas"
	echo "*****************************************"
	sudo pip install pandas