| Layer | Main guarantees | What is still not guaranteed | Mainly interesting for |
|---|---|---|---|
| Bronze | Data is captured; source fidelity is preserved as much as possible; lineage is possible; replay and recovery are possible; ingestion timing is visible | Clean semantics, stable meaning, deduplicated business entities, reporting-safe metrics | Dat |
| Consumer group | Bronze | Silver | Gold |
|---|---|---|---|
| Data Engineers | Inspect source input, debug ingestion, replay, trace origin | Build stable transformations and integration logic | Use as trusted downstream source |
| Analytics Engineers | Can inspect, but not ideal for modeling | Main layer for modeling and standardization | Serve curated models and metrics |
| Data Scientists | Use selectively for deep exploration or raw feature extraction | Good for exploration and feature preparation | Useful when stable business meaning matters |
| BI Developers | Usually not suitable |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/bash | |
| set -e | |
| DIR=${1:-.} | |
| # find all files in the DIR and its subfolders, excluding .zip files | |
| HASH=$( | |
| find "$DIR" \( -type d \( -name bin -o -name obj \) -prune \) -o \ | |
| -type f -not -name '*.zip' -print0 | | |
| sort -z | |
| Test Name | Null Hypothesis | p-value Criteria | Limitations | Use Cases |
|---|---|---|---|---|
| Shapiro-Wilk Test | The data is normally distributed. | If p > 0.05, |
| Technique | Purpose |
|---|---|
| Z-Score | Centers data to mean = 0, std dev = 1; for Gaussian data or regression-based models. |
| Min-Max | Scales data to a specific range (e.g., [0, 1]); for bounded input in neural networks. |
| Log Transformation | Compresses large values and reduces skewness; for data with exponential growth patterns. |
| Robust Scaling | Rescales using median and IQR; for datasets with many outliers. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import json | |
| import fastavro | |
| from fastavro.schema import load_schema | |
| def json_to_avro(json_file_path, avro_file_path, schema_file_path, compression='deflate'): | |
| try: | |
| schema = load_schema(schema_file_path) | |
| except Exception as e: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| DO | |
| $do$ | |
| declare | |
| r record; | |
| query_cmd text; | |
| begin | |
| for r in select table_name from information_schema.tables where table_schema = 'public' and table_name like 'prefix%' | |
| loop | |
| query_cmd := format('delete from %s where CONDITION', r.table_name); | |
| -- raise notice '%', query_cmd; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // usage: | |
| // 'http://www.website.com/'.urlQueryParameter('id', 2) => http://www.website.com/?id=2 | |
| // 'http://www.website.com/?type=1'.urlQueryParameter('id', 2) => http://www.website.com/?type=1&id=2 | |
| String.prototype.isString = true; | |
| String.prototype.urlQueryParameter = function(key, value) { | |
| var uri = this; | |
| if (uri.isString) { | |
| var regEx = new RegExp("([?|&])" + key + "=.*?(&|$)", "i"); | |
| var separator = uri.indexOf('?') !== -1 ? "&" : "?"; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| function odump(object, depth, max) { | |
| depth = depth || 0; | |
| max = max || 2; | |
| if (depth > max) return false; | |
| var indent = ""; | |
| for (var i = 0; i < depth; i++) indent += " "; | |
| var output = ""; | |
| for (var key in object) { | |
| output += "n" + indent + key + ": "; | |
| switch (typeof object[key]) { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| files = [ | |
| "file://path" | |
| ] | |
| df = spark.read.json(files) | |
| catalyst_plan = df._jdf.queryExecution().logical() | |
| df_size_read = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes() |
NewerOlder