Marek Wiewiórka mwiewior

First a disclaimer: This is an experimental API that exposes internals that are likely to change in between different Spark releases. As a result, most datasources should be written against the stable public API in org.apache.spark.sql.sources. We expose this mostly to get feedback on what optimizations we should add to the stable API in order to get the best performance out of data sources.

We'll start with a simple artificial data source that just returns ranges of consecutive integers.

/** A data source that returns ranges of consecutive integers in a column named `a`. */
case class SimpleRelation(
    start: Int, 
    end: Int)(
    @transient val sqlContext: SQLContext)

First a disclaimer: This is an experimental API that exposes internals that are likely to change in between different Spark releases. As a result, most datasources should be written against the stable public API in org.apache.spark.sql.sources. We expose this mostly to get feedback on what optimizations we should add to the stable API in order to get the best performance out of data sources.

We'll start with a simple artificial data source that just returns ranges of consecutive integers.

/** A data source that returns ranges of consecutive integers in a column named `a`. */
case class SimpleRelation(
    start: Int, 
    end: Int)(
    @transient val sqlContext: SQLContext)

Phoenix/Spark demo

These demo is part of a webinar on 'HBase for Mission Critical Applications'
The webinar recording /slides are available at http://hortonworks.com/partners/learn/#hbase2
The Ambari service for OpenTSDB is available at https://github.com/abajwa-hw/opentsdb-service
The full list of available workshops is available at http://hortonworks.com/partners/learn

Option 1: prebuilt VM

There is a prebuilt Centos 6.5 VM with the below components installed:

HDP 2.3.0.0-1754
Spark 1.3.1

	#!/usr/bin/env bash
	####################################################################################
	# Slack Bash console script for sending messages.
	####################################################################################
	# Installation
	# $ curl -s https://gist.githubusercontent.com/andkirby/67a774513215d7ba06384186dd441d9e/raw --output /usr/bin/slack
	# $ chmod +x /usr/bin/slack
	####################################################################################
	# USAGE
	# Send message to slack channel/user

	import scala.language.experimental.macros
	import scala.reflect.macros.blackbox.Context

	trait Mappable[T] {
	def toMap(t: T): Map[String, Any]
	def fromMap(map: Map[String, Any]): T
	}

	object Mappable {

	import scala.language.experimental.macros
	import scala.reflect.macros.blackbox.Context

	trait Mappable[T] {
	def toMap(t: T): Map[String, Any]
	def fromMap(map: Map[String, Any]): T
	}

	object Mappable {

	/**
	* Generate Case class from DataFrame.schema
	*
	* val df:DataFrame = ...
	*
	* val s2cc = new Schema2CaseClass
	* import s2cc.implicit._
	*
	* println(s2cc.schemaToCaseClass(df.schema, "MyClass"))
	*


	// import org.apache.commons.lang3.RandomUtils;
	import org.apache.hadoop.conf.Configuration;
	import org.apache.hadoop.fs.FSDataInputStream;
	import org.apache.hadoop.fs.FileSystem;
	import org.apache.hadoop.fs.Path;

	import com.google.common.base.Stopwatch;

	import java.io.IOException;

	import sklearn
	import numpy as np
	import math
	import pickle
	import collections
	class DGA:
	def __init__(self):
	self.model = { 'clf': pickle.loads(open('./dga_model_random_forest.model','rb').read())
	, 'alexa_vc': pickle.loads(open('./dga_model_alexa_vectorizor.model','rb').read())
	, 'alexa_counts': pickle.loads(open('./dga_model_alexa_counts.model','rb').read())