Skip to content

Instantly share code, notes, and snippets.

View aseigneurin's full-sized avatar

Alexis Seigneurin aseigneurin

View GitHub Profile
@aseigneurin
aseigneurin / register_schema.py
Last active October 18, 2022 08:26
Register an Avro schema against the Confluent Schema Registry
#!/usr/bin/python
import os
import sys
import requests
schema_registry_url = sys.argv[1]
topic = sys.argv[2]
schema_file = sys.argv[3]
<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
"-//Puppy Crawl//DTD Check Configuration 1.3//EN"
"http://www.puppycrawl.com/dtds/configuration_1_3.dtd">
<!--
Checkstyle configuration that checks the Google coding conventions from:
- Google Java Style
@aseigneurin
aseigneurin / settings.yaml
Created March 6, 2017 12:50
leboncoin-ad-manager
region: Ile-de-France
departement: Paris
zipCode: 75011
city: Paris
name: Alexis S
email: [email protected]
phoneNumber: "0600000000"
hidePhoneNumber: false
password: xxxxxxxxx
@aseigneurin
aseigneurin / alexis.zsh-theme
Last active January 15, 2017 01:30
Oh-My-Zsh configuration
PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'
ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
ZSH_THEME_GIT_PROMPT_SUFFIX="%{$fg[white]%})%{$reset_color%}"
ZSH_THEME_GIT_PROMPT_DIRTY="%{$fg[red]%}*"
ZSH_THEME_GIT_PROMPT_CLEAN=""
@aseigneurin
aseigneurin / Spark parquet.md
Created November 15, 2016 15:25
Spark - Parquet files

Spark - Parquet files

Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:

  • Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
  • Columnar storage - more efficient when not all the columns are used or when filtering the data.
  • Partitioning - files are partitioned out of the box
  • Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)

The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge, 16 vCPU and 30 GB each).

@aseigneurin
aseigneurin / Spark file formats and storage.md
Last active December 17, 2018 10:09
Spark - File formats and storage options

Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.

The following Scala code is run in a spark-shell:

val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()
@aseigneurin
aseigneurin / Spark high availability.md
Created November 1, 2016 16:42
Spark - High availability

Spark - High availability

Components in play

As a reminder, here are the components in play to run an application:

  • The cluster:
    • Spark Master: coordinates the resources
    • Spark Workers: offer resources to run the applications
  • The application:
#!/bin/bash -e
if [ ! -d data/wikipedia-pagecounts-hours ]; then
mkdir -p data/wikipedia-pagecounts-hours
fi
cd data/wikipedia-pagecounts-hours
yyyy=2014
MM=06
dd=19
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sample</groupId>
<artifactId>Spark_Kafka_Streaming</artifactId>
<packaging>jar</packaging>
<version>0.0.1-SNAPSHOT</version>
<properties>
case class PendingResult(k1: Long, k2: String, futureResults: ResultSetFuture)
val pendingResults = ArrayBuffer.empty[PendingResult]
for (i <- 1 to iterations) {
val k1 = ...
val k2 = ...
val futureResults = session.executeAsync(s"SELECT * FROM ${tableName} WHERE k1=${k1} AND k2='${k2}'")
pendingResults += PendingResult(k1, k2, futureResults)
}