RamGhadiyaram

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

Installing Apache Superset on Windows 10

⚠️ WARN: This doc might be outdated. Use with caution. Only tested with Python v3.7

🙋‍♂️ INFO: If you have fixes/suggestions to for this doc, please comment below.

🌟 STAR: This doc if you found this document helpful.

	import scala.collection.mutable

	/**
	* Bounded priority queue trait that is intended to be mixed into instances of
	* scala.collection.mutable.PriorityQueue. By default PriorityQueue instances in
	* Scala are unbounded. This trait modifies the original PriorityQueue's
	* enqueue methods such that we only retain the top K elements.
	* The top K elements are defined by an implicit Ordering[A].
	* @author Ryan LeCompte ([email protected])
	*/

	package org.shiftehfar.reza.benchmark;

	import java.io.File;
	import java.io.FileOutputStream;
	import java.io.IOException;
	import java.util.ArrayList;
	import java.util.List;

	import org.apache.hadoop.conf.Configuration;
	import org.apache.hadoop.fs.FSDataOutputStream;