Alexander Goida xtrmstep

Layer	Main guarantees	What is still not guaranteed	Mainly interesting for
Bronze	Data is captured; source fidelity is preserved as much as possible; lineage is possible; replay and recovery are possible; ingestion timing is visible	Clean semantics, stable meaning, deduplicated business entities, reporting-safe metrics	Dat

Consumer group	Bronze	Silver	Gold
Data Engineers	Inspect source input, debug ingestion, replay, trace origin	Build stable transformations and integration logic	Use as trusted downstream source
Analytics Engineers	Can inspect, but not ideal for modeling	Main layer for modeling and standardization	Serve curated models and metrics
Data Scientists	Use selectively for deep exploration or raw feature extraction	Good for exploration and feature preparation	Useful when stable business meaning matters
BI Developers	Usually not suitable

Test Name	Null Hypothesis	p-value Criteria	Limitations	Use Cases
Shapiro-Wilk Test	The data is normally distributed.	If `p > 0.05`,

Technique	Purpose
Z-Score	Centers data to mean = 0, std dev = 1; for Gaussian data or regression-based models.
Min-Max	Scales data to a specific range (e.g., [0, 1]); for bounded input in neural networks.
Log Transformation	Compresses large values and reduces skewness; for data with exponential growth patterns.
Robust Scaling	Rescales using median and IQR; for datasets with many outliers.

	#!/bin/bash
	set -e

	DIR=${1:-.}

	# find all files in the DIR and its subfolders, excluding .zip files
	HASH=$(
	find "$DIR" \( -type d \( -name bin -o -name obj \) -prune \) -o \
	-type f -not -name '*.zip' -print0 \|
	sort -z \|

	import json

	import fastavro
	from fastavro.schema import load_schema


	def json_to_avro(json_file_path, avro_file_path, schema_file_path, compression='deflate'):
	try:
	schema = load_schema(schema_file_path)
	except Exception as e:

	DO
	$do$
	declare
	r record;
	query_cmd text;
	begin
	for r in select table_name from information_schema.tables where table_schema = 'public' and table_name like 'prefix%'
	loop
	query_cmd := format('delete from %s where CONDITION', r.table_name);
	-- raise notice '%', query_cmd;

	// usage:
	// 'http://www.website.com/'.urlQueryParameter('id', 2) => http://www.website.com/?id=2
	// 'http://www.website.com/?type=1'.urlQueryParameter('id', 2) => http://www.website.com/?type=1&id=2

	String.prototype.isString = true;
	String.prototype.urlQueryParameter = function(key, value) {
	var uri = this;
	if (uri.isString) {
	var regEx = new RegExp("([?\|&])" + key + "=.*?(&\|$)", "i");
	var separator = uri.indexOf('?') !== -1 ? "&" : "?";

	function odump(object, depth, max) {
	depth = depth \|\| 0;
	max = max \|\| 2;
	if (depth > max) return false;
	var indent = "";
	for (var i = 0; i < depth; i++) indent += " ";
	var output = "";
	for (var key in object) {
	output += "n" + indent + key + ": ";
	switch (typeof object[key]) {

	files = [
	"file://path"
	]
	df = spark.read.json(files)
	catalyst_plan = df._jdf.queryExecution().logical()
	df_size_read = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()