Nicolas Paris parisni

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

This experiment converts an SQL version of WordNet 3.0 into a graph, using the python library graph-tool. In order to create a taxonomical structure, only noun synsets, hyponym links and hypernym links are considered.

The result of the conversion is saved as GraphML, then rendered as the following hairball:

Since the graph can be considered a tangled tree, i.e. a tree in which some nodes have multiple parents, two untangled versions (using longest and shortest paths) are also provided as GraphML. Only a few links are lost (about 2%), making the tree a good approximation of the noun taxonomy graph.

	Latency Comparison Numbers (~2012)
	----------------------------------
	L1 cache reference 0.5 ns
	Branch mispredict 5 ns
	L2 cache reference 7 ns 14x L1 cache
	Mutex lock/unlock 25 ns
	Main memory reference 100 ns 20x L2 cache, 200x L1 cache
	Compress 1K bytes with Zippy 3,000 ns 3 us
	Send 1K bytes over 1 Gbps network 10,000 ns 10 us
	Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD