Skip to content

Instantly share code, notes, and snippets.

@samklr
samklr / pb-avro-test_pom.xml
Created November 24, 2017 13:29 — forked from alexvictoor/pb-avro-test_pom.xml
Demo of Protobuff integration within Avro
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.avro.is.great</groupId>
<artifactId>protobuff-avro-demo</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>Demo of protobuff integration with Avro</name>
<build>
<plugins>
@samklr
samklr / HyperLogLogStoreUDAF.scala
Created June 9, 2017 20:02 — forked from MLnick/HyperLogLogStoreUDAF.scala
Experimenting with Spark SQL UDAF - HyperLogLog UDAF for distinct counts, that stores the actual HLL for each row to allow further aggregation
class HyperLogLogStoreUDAF extends UserDefinedAggregateFunction {
override def inputSchema = new StructType()
.add("stringInput", BinaryType)
override def update(buffer: MutableAggregationBuffer, input: Row) = {
// This input Row only has a single column storing the input value in String (or other Binary data).
// We only update the buffer when the input value is not null.
if (!input.isNullAt(0)) {
if (buffer.isNullAt(0)) {
@samklr
samklr / gist:05410a4926e6773e707c07cfe59411ab
Created May 19, 2017 17:00 — forked from devsprint/gist:5363023
Write a blob content to Cassandra using datastax\java-driver
private static String WRITE_STATEMENT = "INSERT INTO avatars (id, image_type, avatar) VALUES (?,?,?);";
private final BoundStatement writeStatement=writeStatement = session.prepare(WRITE_STATEMENT)
.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM).bind();
try {
BoundStatement stmt = driver.getWriteStatement();
stmt.enableTracing();
stmt.setLong("id", accountId);
stmt.setString("image_type", image.getType());
stmt.setBytes("avatar", ByteBuffer.wrap(image.getBytes()));
@samklr
samklr / kafka_cassandra_cluster.md
Last active May 18, 2017 16:45 — forked from ferhtaydn/ kafka_cassandra_cluster.md
Confluent Kafka Platform and Cassandra Multi Node Deployment Guide

Step by step guide for multi node Confluent Kafka Platform and Cassandra cluster;

It is a multi node deployment of https://github.com/ferhtaydn/sack

Assume that, we have five Ubuntu 14.04 nodes. Their IPs are as follows;

  • 12.0.5.4
  • 12.0.5.5
  • 12.0.5.6
  • 12.0.1.170
@samklr
samklr / SparkUtils.scala
Created April 13, 2017 18:10 — forked from ibuenros/SparkUtils.scala
Spark productionizing utilities developed by Ooyala, shown in Spark Summit 2014
//==================================================================
// SPARK INSTRUMENTATION
//==================================================================
import com.codahale.metrics.{MetricRegistry, Meter, Gauge}
import org.apache.spark.{SparkEnv, Accumulator}
import org.apache.spark.metrics.source.Source
import org.joda.time.DateTime
import scala.collection.mutable
@samklr
samklr / spark_etl_resume.md
Created October 10, 2016 19:54 — forked from rampage644/spark_etl_resume.md
Spark ETL resume

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

Advanced Functional Programming with Scala - Notes

Copyright © 2017 Fantasyland Institute of Learning. All rights reserved.

1. Mastering Functions

A function is a mapping from one set, called a domain, to another set, called the codomain. A function associates every element in the domain with exactly one element in the codomain. In Scala, both domain and codomain are types.

val square : Int => Int = x => x * x
@samklr
samklr / conf_core-site.xml
Created July 30, 2016 12:57 — forked from chicagobuss/conf_core-site.xml
How to get spark 1.6.0 with hadoop 2.6 working with s3
<configuration>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for Role-based authentication.</description>
<value>YOUR_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for Role-based authentication.</description>
@samklr
samklr / big_query_examples.md
Created July 21, 2016 18:02 — forked from arfon/big_query_examples.md
BigQuery Examples for blog post

How many times shouldn't it happen...

-- https://news.ycombinator.com/item?id=11396045

SELECT count(*)
FROM (SELECT id, repo_name, path
        FROM [bigquery-public-data:github_repos.sample_files]
 ) AS F
@samklr
samklr / ReflectionHelpersSnippet.scala
Created July 17, 2016 22:31 — forked from ConnorDoyle/ReflectionHelpersSnippet.scala
Generic reflective case class instantiation with Scala 2.10.x
package test
import scala.reflect.runtime.universe._
object ReflectionHelpers extends ReflectionHelpers
trait ReflectionHelpers {
protected val classLoaderMirror = runtimeMirror(getClass.getClassLoader)