Rounak Datta rounakdatta

Standalone Spark 2.0.0 with s3

###Tested with:

Spark 2.0.0 pre-built for Hadoop 2.7
Mac OS X 10.11
Python 3.5.2

Goal

Use s3 within pyspark with minimal hassle.

Host

gst-launch-1.0 -v v4l2src device=/dev/video0
! "image/jpeg,width=1280, height=720,framerate=30/1"
! rtpjpegpay
! udpsink host=$myip port=$myport

Client

gst-launch-1.0 -e -v udpsrc port=$myport !
application/x-rtp, encoding-name=JPEG,payload=26 !
rtpjpegdepay ! jpegdec ! \

The way I could do it was by using the docker api. I used the docker-py package to access it.

The api exposes a labels dictionary for each container, and the keys com.docker.compose.container-number, com.docker.compose.project and com.docker.compose.service did what was needed to build the hostname.

The code below is a simplified for code I am now using. You can find my advanced code with caching and fancy stuff that at Github at luckydonald/pbft/dockerus.ServiceInfos (backup at gist.github.com).

Pico HTTP Server in C

This is a very simple HTTP server for Unix, using fork(). It's very easy to use

How to use

include header httpd.h
write your route method, handling requests.
call serve_forever("12913") to start serving on port 12913

Spark Tips & Tricks

Misc. Tips & Tricks

If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the

Point Domain to Amazon Web Services (AWS) EC2 Instance

Open the Amazon Route 53 console at https://console.aws.amazon.com/route53/.
If you are new to Amazon Route 53, you see a welcome page; choose Get Started Now for DNS Management. Otherwise, choose Hosted Zones in the navigation pane.
Choose Create Hosted Zone.
For Domain Name, type your domain name.
Choose Create.
Click the Hosted Zone, edit record set.
In the value, add ec2-54-152-134-146.compute-1.amazonaws.com.
Change your DNS file to point to the IPv4 address (This would be in something like GoDaddy).

	<!DOCTYPE html>
	<html>

	<head>
	<title>AWS S3 File Upload</title>
	<script src="https://sdk.amazonaws.com/js/aws-sdk-2.1.12.min.js"></script>
	</head>

	<body>
	<input type="file" id="file-chooser" />

	PASSWORD1 # Replace literal string 'PASSWORD1' with '*REMOVED*' (default)
	PASSWORD2==>examplePass # replace with 'examplePass' instead
	PASSWORD3==> # replace with the empty string
	regex:password=\w+==>password= # Replace, using a regex
	regex:\r(\n)==>$1 # Replace Windows newlines with Unix newlines

	pragma solidity ^0.4.7;

	contract Factory {

	bytes32[] Names;
	address[] newContracts;

	function createContract (bytes32 name) {
	address newContract = new Contract(name);
	newContracts.push(newContract);

	from airflow import DAG
	from airflow.operators import BashOperator
	from datetime import datetime
	import os
	import sys

	args = {
	'owner': 'airflow'
	, 'start_date': datetime(2017, 1, 27)
	, 'provide_context': True