Alexis Seigneurin aseigneurin

Spark - Parquet files

Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:

Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
Columnar storage - more efficient when not all the columns are used or when filtering the data.
Partitioning - files are partitioned out of the box
Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)

The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge, 16 vCPU and 30 GB each).

Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.

The following Scala code is run in a spark-shell:

val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()

Spark - High availability

Components in play

As a reminder, here are the components in play to run an application:

The cluster:
- Spark Master: coordinates the resources
- Spark Workers: offer resources to run the applications
The application:

	#!/usr/bin/python

	import os
	import sys

	import requests

	schema_registry_url = sys.argv[1]
	topic = sys.argv[2]
	schema_file = sys.argv[3]

	<?xml version="1.0"?>
	<!DOCTYPE module PUBLIC
	"-//Puppy Crawl//DTD Check Configuration 1.3//EN"
	"http://www.puppycrawl.com/dtds/configuration_1_3.dtd">

	<!--

	Checkstyle configuration that checks the Google coding conventions from:

	- Google Java Style

	region: Ile-de-France
	departement: Paris
	zipCode: 75011
	city: Paris
	name: Alexis S
	email: alexis@xxx.com
	phoneNumber: "0600000000"
	hidePhoneNumber: false
	password: xxxxxxxxx

	PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'

	ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
	ZSH_THEME_GIT_PROMPT_SUFFIX="%{$fg[white]%})%{$reset_color%}"
	ZSH_THEME_GIT_PROMPT_DIRTY="%{$fg[red]%}*"
	ZSH_THEME_GIT_PROMPT_CLEAN=""

	#!/bin/bash -e

	if [ ! -d data/wikipedia-pagecounts-hours ]; then
	mkdir -p data/wikipedia-pagecounts-hours
	fi
	cd data/wikipedia-pagecounts-hours

	yyyy=2014
	MM=06
	dd=19

	<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

	<modelVersion>4.0.0</modelVersion>
	<groupId>com.sample</groupId>
	<artifactId>Spark_Kafka_Streaming</artifactId>
	<packaging>jar</packaging>
	<version>0.0.1-SNAPSHOT</version>

	<properties>

	case class PendingResult(k1: Long, k2: String, futureResults: ResultSetFuture)

	val pendingResults = ArrayBuffer.empty[PendingResult]
	for (i <- 1 to iterations) {
	val k1 = ...
	val k2 = ...
	val futureResults = session.executeAsync(s"SELECT * FROM ${tableName} WHERE k1=${k1} AND k2='${k2}'")
	pendingResults += PendingResult(k1, k2, futureResults)
	}