Sergei Turukin rampage644

Dataservices spider development process

Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.

Common

Key information

My Shub Talks

29/07 - Introduce workflow manager

Brief intro

First, I'd like to say hello to everyone and thank for coming.

Introduction

This document describes how Airflow jobs (or workflows) get deployed onto production system.

Directory structure

HOME directory:/home/airflow
DAG directory: $HOME/airflow-git-dir/dags/
Config directory: $HOME/airflow-git-dir/configs/
Unittest directore: $HOME/airflow-git-dir/tests/. Preferable, discoverable by both nose and py.test
Credentials should be accessed by by some library

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

Building Impala

Version: cdh5-2.0_5.2.0
OS: Archlinux 3.17.2-1-ARCH x86_64
gcc version 4.9.2

Berkeley DB version >= 5

Downloads

HDP sandbox

Installation

yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib

OSv + Impala status

I think i get plan-fragment-executor-test run under OSv
But it fails very quickly
Problem is with tcmallocstatic. First, OSv doesn't support sbrk-based memory management. One has to tune tcmallocstatic not to use SbrkMemoryAllocator at all (comment #undef HAVE_SBRK in config.h.in). Second, it still fails with invalid opcode exception.

Issues

tcmallocstatic

Results

Haven't found how to cut-off hardware layer. Virtio lead didn't help.
Osv builds very tricky libraries. Impossible to use as is at host.
Bottom-up approach seems reasonable for now

01 Sep

Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.

	import java.text.SimpleDateFormat
	import java.util.Date

	import org.apache.spark.{SparkContext, SparkConf}
	import org.apache.spark.sql.{SaveMode, Row, SQLContext}
	import com.databricks.spark.csv.CsvSchemaRDD
	import org.apache.spark.sql.functions._

	## Git repo

	Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) README file.

	## Task description

	Original task was to prune impalad to some sort of executor binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.

	My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.