Sergei Turukin rampage644

Benchmarking

Simple experiment showed seccomp-based syscall ~5 times slower than vanila one.

Calling write syscall directly:

const unsigned count = UINT_MAX / 10000;
	unsigned i = 0;

Results

Haven't found how to cut-off hardware layer. Virtio lead didn't help.
Osv builds very tricky libraries. Impossible to use as is at host.
Bottom-up approach seems reasonable for now

01 Sep

Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.

OSv + Impala status

I think i get plan-fragment-executor-test run under OSv
But it fails very quickly
Problem is with tcmallocstatic. First, OSv doesn't support sbrk-based memory management. One has to tune tcmallocstatic not to use SbrkMemoryAllocator at all (comment #undef HAVE_SBRK in config.h.in). Second, it still fails with invalid opcode exception.

Issues

tcmallocstatic

Downloads

HDP sandbox

Installation

yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib

Building Impala

Version: cdh5-2.0_5.2.0
OS: Archlinux 3.17.2-1-ARCH x86_64
gcc version 4.9.2

Berkeley DB version >= 5

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

	#!/usr/bin/env python

	import sys
	import re


	def main():
	regexp = re.compile(r'^(\S+)\((.*)\)\s+=\s+(\d+)$')
	whitelist = ['read', 'write', 'fstat', 'lseek', 'fcntl']
	opened_fd = {}

	# Gnu Pth as thread library for impalad

	In short, it's impossible to use Gnu Pth library with `impalad` "AS IS", i.e. without modification.

	Gnu Pth:

	* Gnu Pth can't fully replace `pthreads`. It lacks some functions, some entities.
	* It doesn't provide versioned symbols

	There are some `*.so` libraries (system/thirdparty) which come precompiled and they are linked against versioned symbols. Be prepared to recompile them replace somehow or just do anything. Example:

	## Git repo

	Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) README file.

	## Task description

	Original task was to prune impalad to some sort of executor binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.

	My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.

	import java.text.SimpleDateFormat
	import java.util.Date

	import org.apache.spark.{SparkContext, SparkConf}
	import org.apache.spark.sql.{SaveMode, Row, SQLContext}
	import com.databricks.spark.csv.CsvSchemaRDD
	import org.apache.spark.sql.functions._