Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.
This document describes how Airflow
jobs (or workflows) get deployed onto production system.
- HOME directory:
/home/airflow
- DAG directory:
$HOME/airflow-git-dir/dags/
- Config directory:
$HOME/airflow-git-dir/configs/
- Unittest directore:
$HOME/airflow-git-dir/tests/
. Preferable, discoverable by bothnose
andpy.test
- Credentials should be accessed by by some library
This document describes sample process of implementing part of existing Dim_Instance
ETL.
I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark
usage as primary ETL tool.
Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).
import java.text.SimpleDateFormat | |
import java.util.Date | |
import org.apache.spark.{SparkContext, SparkConf} | |
import org.apache.spark.sql.{SaveMode, Row, SQLContext} | |
import com.databricks.spark.csv.CsvSchemaRDD | |
import org.apache.spark.sql.functions._ | |
yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib
- I think i get
plan-fragment-executor-test
run underOSv
- But it fails very quickly
- Problem is with
tcmallocstatic
. First,OSv
doesn't supportsbrk
-based memory management. One has to tunetcmallocstatic
not to useSbrkMemoryAllocator
at all (comment#undef HAVE_SBRK
inconfig.h.in
). Second, it still fails with invalid opcode exception.
- Haven't found how to cut-off hardware layer. Virtio lead didn't help.
- Osv builds very tricky libraries. Impossible to use as is at host.
- Bottom-up approach seems reasonable for now
Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.
## Git repo | |
Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) *README* file. | |
## Task description | |
Original task was to prune impalad to some sort of *executor* binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach. | |
My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution. |