Created
May 16, 2023 20:55
-
-
Save rmoff/67bcca1fdaf8f1b887578eba1a318446 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "id": "e8e13cd9", | |
| "metadata": {}, | |
| "source": [ | |
| "<img src=\"https://projectnessie.org/img/nessie.svg\" alt=\"lakeFS logo\" width=200/> " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "78797e22", | |
| "metadata": {}, | |
| "source": [ | |
| "## Write-Audit-Publish (WAP) pattern with Nessie" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "0f89a51c", | |
| "metadata": {}, | |
| "source": [ | |
| "**New to Write-Audit-Publish? This [talk](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) explains it well.**\n", | |
| "\n", | |
| "[@rmoff](https://twitter.com/rmoff/) " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "7d979dfe", | |
| "metadata": {}, | |
| "source": [ | |
| "# Setup & Initialisation" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "id": "45b87dc4", | |
| "metadata": { | |
| "tags": [] | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Requirement already satisfied: pynessie==0.30.0 in /opt/conda/lib/python3.10/site-packages (0.30.0)\n", | |
| "Requirement already satisfied: findspark in /opt/conda/lib/python3.10/site-packages (2.0.1)\n", | |
| "Requirement already satisfied: attrs in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (22.2.0)\n", | |
| "Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (2.8.2)\n", | |
| "Requirement already satisfied: marshmallow in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (3.19.0)\n", | |
| "Requirement already satisfied: confuse==1.7.0 in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (1.7.0)\n", | |
| "Requirement already satisfied: Click<9.0.0,>6.0.0 in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (8.1.3)\n", | |
| "Requirement already satisfied: requests-aws4auth in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (1.2.3)\n", | |
| "Requirement already satisfied: botocore in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (1.29.131)\n", | |
| "Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (2.28.2)\n", | |
| "Requirement already satisfied: simplejson in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (3.19.1)\n", | |
| "Requirement already satisfied: marshmallow-oneofschema in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (3.0.1)\n", | |
| "Requirement already satisfied: desert in /opt/conda/lib/python3.10/site-packages (from pynessie==0.30.0) (2022.9.22)\n", | |
| "Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from confuse==1.7.0->pynessie==0.30.0) (6.0)\n", | |
| "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from botocore->pynessie==0.30.0) (1.0.1)\n", | |
| "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.10/site-packages (from botocore->pynessie==0.30.0) (1.26.15)\n", | |
| "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil->pynessie==0.30.0) (1.16.0)\n", | |
| "Requirement already satisfied: typing-inspect in /opt/conda/lib/python3.10/site-packages (from desert->pynessie==0.30.0) (0.8.0)\n", | |
| "Requirement already satisfied: packaging>=17.0 in /opt/conda/lib/python3.10/site-packages (from marshmallow->pynessie==0.30.0) (23.0)\n", | |
| "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->pynessie==0.30.0) (3.1.0)\n", | |
| "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->pynessie==0.30.0) (2022.12.7)\n", | |
| "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->pynessie==0.30.0) (3.4)\n", | |
| "Requirement already satisfied: mypy-extensions>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from typing-inspect->desert->pynessie==0.30.0) (1.0.0)\n", | |
| "Requirement already satisfied: typing-extensions>=3.7.4 in /opt/conda/lib/python3.10/site-packages (from typing-inspect->desert->pynessie==0.30.0) (4.5.0)\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "import sys\n", | |
| "!{sys.executable} -m pip install pynessie==0.30.0 findspark" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "4e4e906a", | |
| "metadata": {}, | |
| "source": [ | |
| "## Set up Spark " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "id": "fbbb1ab3", | |
| "metadata": { | |
| "tags": [] | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Spark Running\n" | |
| ] | |
| }, | |
| { | |
| "data": { | |
| "text/html": [ | |
| "\n", | |
| " <div>\n", | |
| " <p><b>SparkSession - in-memory</b></p>\n", | |
| " \n", | |
| " <div>\n", | |
| " <p><b>SparkContext</b></p>\n", | |
| "\n", | |
| " <p><a href=\"http://e3ed1f564f29:4040\">Spark UI</a></p>\n", | |
| "\n", | |
| " <dl>\n", | |
| " <dt>Version</dt>\n", | |
| " <dd><code>v3.3.2</code></dd>\n", | |
| " <dt>Master</dt>\n", | |
| " <dd><code>local[*]</code></dd>\n", | |
| " <dt>AppName</dt>\n", | |
| " <dd><code>pyspark-shell</code></dd>\n", | |
| " </dl>\n", | |
| " </div>\n", | |
| " \n", | |
| " </div>\n", | |
| " " | |
| ], | |
| "text/plain": [ | |
| "<pyspark.sql.session.SparkSession at 0xffff76410a30>" | |
| ] | |
| }, | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "import os\n", | |
| "from pyspark.sql import *\n", | |
| "from pyspark import SparkConf\n", | |
| "import pynessie\n", | |
| "\n", | |
| "# import findspark\n", | |
| "# findspark.init()\n", | |
| "# pynessie_version = pynessie.__version__\n", | |
| "\n", | |
| "conf = SparkConf()\n", | |
| "# we need iceberg libraries and the nessie sql extensions\n", | |
| "conf.set(\n", | |
| " \"spark.jars.packages\",\"org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.58.1\"\n", | |
| ")\n", | |
| "# ensure python <-> java interactions are w/ pyarrow\n", | |
| "conf.set(\"spark.sql.execution.pyarrow.enabled\", \"true\")\n", | |
| "# create catalog named arctic as an iceberg catalog\n", | |
| "conf.set(\"spark.sql.catalog.rmoff\", \"org.apache.iceberg.spark.SparkCatalog\")\n", | |
| "\n", | |
| "# tell the catalog that its a Nessie catalog\n", | |
| "conf.set(\"spark.sql.catalog.rmoff.catalog-impl\", \"org.apache.iceberg.nessie.NessieCatalog\")\n", | |
| "\n", | |
| "# set the location for the catalog to store data. Spark writes to this directory\n", | |
| "conf.set(\"spark.sql.catalog.rmoff.warehouse\", \"file://\" + os.getcwd() + \"/spark_warehouse/iceberg\")\n", | |
| "\n", | |
| "# set the location of the Arctic/Nessie server.\n", | |
| "conf.set(\"spark.sql.catalog.rmoff.uri\", \"http://nessie:19120/api/v1\")\n", | |
| "\n", | |
| "# default branch for Arctic catalog to work on\n", | |
| "conf.set(\"spark.sql.catalog.rmoff.ref\", \"main\")\n", | |
| "\n", | |
| "\n", | |
| "# enable the extensions for both Nessie and Iceberg\n", | |
| "conf.set(\n", | |
| " \"spark.sql.extensions\",\n", | |
| " \"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions\",\n", | |
| ")\n", | |
| "\n", | |
| "# finally, start up the Spark server\n", | |
| "spark = SparkSession.builder.config(conf=conf).getOrCreate()\n", | |
| "print(\"Spark Running\")\n", | |
| "spark" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "eceeaa9b", | |
| "metadata": {}, | |
| "source": [ | |
| "## Load test data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "id": "ad496308", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "spark.read.option(\"inferSchema\",\"true\").option(\"multiline\",\"true\").json(\"/home/jovyan/data/nyc_film_permits.json\").createOrReplaceTempView(\"permits_src\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "id": "8827299b", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>borough</th>\n", | |
| " <th>permit_cnt</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Queens</td>\n", | |
| " <td>168</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Brooklyn</td>\n", | |
| " <td>334</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Staten Island</td>\n", | |
| " <td>7</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Manhattan</td>\n", | |
| " <td>463</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Bronx</td>\n", | |
| " <td>28</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------------+------------+\n", | |
| "| borough | permit_cnt |\n", | |
| "+---------------+------------+\n", | |
| "| Queens | 168 |\n", | |
| "| Brooklyn | 334 |\n", | |
| "| Staten Island | 7 |\n", | |
| "| Manhattan | 463 |\n", | |
| "| Bronx | 28 |\n", | |
| "+---------------+------------+" | |
| ] | |
| }, | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%%sql\n", | |
| "SELECT borough, count(*) permit_cnt\n", | |
| "FROM permits_src\n", | |
| "GROUP BY borough" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "3114eec3", | |
| "metadata": {}, | |
| "source": [ | |
| "## Write test data to Iceberg files " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "id": "3143e7e1", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "++\n", | |
| "||\n", | |
| "++\n", | |
| "++" | |
| ] | |
| }, | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%%sql \n", | |
| "\n", | |
| "CREATE TABLE rmoff.permits USING ICEBERG\n", | |
| "AS SELECT * FROM permits_src" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "a0bc6fd3", | |
| "metadata": {}, | |
| "source": [ | |
| "#### Inspect Iceberg metadata" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "id": "df9f12ae", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>content</th>\n", | |
| " <th>file_path</th>\n", | |
| " <th>file_format</th>\n", | |
| " <th>spec_id</th>\n", | |
| " <th>record_count</th>\n", | |
| " <th>file_size_in_bytes</th>\n", | |
| " <th>column_sizes</th>\n", | |
| " <th>value_counts</th>\n", | |
| " <th>null_value_counts</th>\n", | |
| " <th>nan_value_counts</th>\n", | |
| " <th>lower_bounds</th>\n", | |
| " <th>upper_bounds</th>\n", | |
| " <th>key_metadata</th>\n", | |
| " <th>split_offsets</th>\n", | |
| " <th>equality_ids</th>\n", | |
| " <th>sort_order_id</th>\n", | |
| " <th>readable_metrics</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>0</td>\n", | |
| " <td>file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/data/00000-3-bdcc692d-fb2f-47ab-a7d5-b9e205989800-00001.parquet</td>\n", | |
| " <td>PARQUET</td>\n", | |
| " <td>0</td>\n", | |
| " <td>1000</td>\n", | |
| " <td>51115</td>\n", | |
| " <td>{1: 483, 2: 474, 3: 1183, 4: 119, 5: 2736, 6: 5023, 7: 142, 8: 2348, 9: 343, 10: 26704, 11: 1487, 12: 2462, 13: 745, 14: 2358}</td>\n", | |
| " <td>{1: 1000, 2: 1000, 3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000, 9: 1000, 10: 1000, 11: 1000, 12: 1000, 13: 1000, 14: 1000}</td>\n", | |
| " <td>{1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0}</td>\n", | |
| " <td>{}</td>\n", | |
| " <td>{1: bytearray(b'Bronx'), 2: bytearray(b'Commercial'), 3: bytearray(b'0, 2, 3'), 4: bytearray(b'United States of'), 5: bytearray(b'2022-11-04T22:00'), 6: bytearray(b'2022-11-02T13:34'), 7: bytearray(b"Mayor\\'s Office o"), 8: bytearray(b'678909'), 9: bytearray(b'DCAS Prep/Shoot/'), 10: bytearray(b'1 AVENUE between'), 11: bytearray(b'0, 10'), 12: bytearray(b'2022-11-03T00:00'), 13: bytearray(b'Cable-episodic'), 14: bytearray(b'0, 10011')}</td>\n", | |
| " <td>{1: bytearray(b'Staten Island'), 2: bytearray(b'WEB'), 3: bytearray(b'9'), 4: bytearray(b'United States og'), 5: bytearray(b'2023-02-20T18:01'), 6: bytearray(b'2023-01-18T14:35'), 7: bytearray(b"Mayor\\'s Office p"), 8: bytearray(b'691875'), 9: bytearray(b'Theater Load in!'), 10: bytearray(b'WYTHE AVENUE beu'), 11: bytearray(b'94'), 12: bytearray(b'2023-01-20T13:01'), 13: bytearray(b'Variety'), 14: bytearray(b'11693, 11694')}</td>\n", | |
| " <td>None</td>\n", | |
| " <td>[4]</td>\n", | |
| " <td>None</td>\n", | |
| " <td>0</td>\n", | |
| " <td>Row(borough=Row(column_size=483, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Bronx', upper_bound='Staten Island'), category=Row(column_size=474, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Commercial', upper_bound='WEB'), communityboard_s=Row(column_size=1183, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 2, 3', upper_bound='9'), country=Row(column_size=119, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='United States of', upper_bound='United States og'), enddatetime=Row(column_size=2736, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-04T22:00', upper_bound='2023-02-20T18:01'), enteredon=Row(column_size=5023, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-02T13:34', upper_bound='2023-01-18T14:35'), eventagency=Row(column_size=142, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound="Mayor's Office o", upper_bound="Mayor's Office p"), eventid=Row(column_size=2348, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='678909', upper_bound='691875'), eventtype=Row(column_size=343, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='DCAS Prep/Shoot/', upper_bound='Theater Load in!'), parkingheld=Row(column_size=26704, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='1 AVENUE between', upper_bound='WYTHE AVENUE beu'), policeprecinct_s=Row(column_size=1487, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10', upper_bound='94'), startdatetime=Row(column_size=2462, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-03T00:00', upper_bound='2023-01-20T13:01'), subcategoryname=Row(column_size=745, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Cable-episodic', upper_bound='Variety'), zipcode_s=Row(column_size=2358, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10011', upper_bound='11693, 11694'))</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+--------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+---------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
| "| content | file_path | file_format | spec_id | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics |\n", | |
| "+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+--------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+---------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
| "| 0 | file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/data/00000-3-bdcc692d-fb2f-47ab-a7d5-b9e205989800-00001.parquet | PARQUET | 0 | 1000 | 51115 | {1: 483, 2: 474, 3: 1183, 4: 119, 5: 2736, 6: 5023, 7: 142, 8: 2348, 9: 343, 10: 26704, 11: 1487, 12: 2462, 13: 745, 14: 2358} | {1: 1000, 2: 1000, 3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000, 9: 1000, 10: 1000, 11: 1000, 12: 1000, 13: 1000, 14: 1000} | {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0} | {} | {1: bytearray(b'Bronx'), 2: bytearray(b'Commercial'), 3: bytearray(b'0, 2, 3'), 4: bytearray(b'United States of'), 5: bytearray(b'2022-11-04T22:00'), 6: bytearray(b'2022-11-02T13:34'), 7: bytearray(b\"Mayor\\'s Office o\"), 8: bytearray(b'678909'), 9: bytearray(b'DCAS Prep/Shoot/'), 10: bytearray(b'1 AVENUE between'), 11: bytearray(b'0, 10'), 12: bytearray(b'2022-11-03T00:00'), 13: bytearray(b'Cable-episodic'), 14: bytearray(b'0, 10011')} | {1: bytearray(b'Staten Island'), 2: bytearray(b'WEB'), 3: bytearray(b'9'), 4: bytearray(b'United States og'), 5: bytearray(b'2023-02-20T18:01'), 6: bytearray(b'2023-01-18T14:35'), 7: bytearray(b\"Mayor\\'s Office p\"), 8: bytearray(b'691875'), 9: bytearray(b'Theater Load in!'), 10: bytearray(b'WYTHE AVENUE beu'), 11: bytearray(b'94'), 12: bytearray(b'2023-01-20T13:01'), 13: bytearray(b'Variety'), 14: bytearray(b'11693, 11694')} | None | [4] | None | 0 | Row(borough=Row(column_size=483, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Bronx', upper_bound='Staten Island'), category=Row(column_size=474, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Commercial', upper_bound='WEB'), communityboard_s=Row(column_size=1183, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 2, 3', upper_bound='9'), country=Row(column_size=119, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='United States of', upper_bound='United States og'), enddatetime=Row(column_size=2736, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-04T22:00', upper_bound='2023-02-20T18:01'), enteredon=Row(column_size=5023, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-02T13:34', upper_bound='2023-01-18T14:35'), eventagency=Row(column_size=142, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound=\"Mayor's Office o\", upper_bound=\"Mayor's Office p\"), eventid=Row(column_size=2348, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='678909', upper_bound='691875'), eventtype=Row(column_size=343, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='DCAS Prep/Shoot/', upper_bound='Theater Load in!'), parkingheld=Row(column_size=26704, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='1 AVENUE between', upper_bound='WYTHE AVENUE beu'), policeprecinct_s=Row(column_size=1487, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10', upper_bound='94'), startdatetime=Row(column_size=2462, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='2022-11-03T00:00', upper_bound='2023-01-20T13:01'), subcategoryname=Row(column_size=745, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='Cable-episodic', upper_bound='Variety'), zipcode_s=Row(column_size=2358, value_count=1000, null_value_count=0, nan_value_count=None, lower_bound='0, 10011', upper_bound='11693, 11694')) |\n", | |
| "+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------+--------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+---------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT * FROM rmoff.permits.files" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "id": "4a493930", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>made_current_at</th>\n", | |
| " <th>snapshot_id</th>\n", | |
| " <th>parent_id</th>\n", | |
| " <th>is_current_ancestor</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>2023-05-16 20:53:32.769000</td>\n", | |
| " <td>776903278721003441</td>\n", | |
| " <td>None</td>\n", | |
| " <td>True</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+----------------------------+--------------------+-----------+---------------------+\n", | |
| "| made_current_at | snapshot_id | parent_id | is_current_ancestor |\n", | |
| "+----------------------------+--------------------+-----------+---------------------+\n", | |
| "| 2023-05-16 20:53:32.769000 | 776903278721003441 | None | True |\n", | |
| "+----------------------------+--------------------+-----------+---------------------+" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT * FROM rmoff.permits.history" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "id": "13f312d0", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>committed_at</th>\n", | |
| " <th>snapshot_id</th>\n", | |
| " <th>parent_id</th>\n", | |
| " <th>operation</th>\n", | |
| " <th>manifest_list</th>\n", | |
| " <th>summary</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>2023-05-16 20:53:32.769000</td>\n", | |
| " <td>776903278721003441</td>\n", | |
| " <td>None</td>\n", | |
| " <td>append</td>\n", | |
| " <td>file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/metadata/snap-776903278721003441-1-fdaa71ad-87db-4ad5-8862-463e3246dd9e.avro</td>\n", | |
| " <td>{'spark.app.id': 'local-1684270382136', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '1000', 'total-position-deletes': '0', 'added-files-size': '51115', 'total-delete-files': '0', 'total-files-size': '51115', 'total-records': '1000', 'total-data-files': '1'}</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+----------------------------+--------------------+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
| "| committed_at | snapshot_id | parent_id | operation | manifest_list | summary |\n", | |
| "+----------------------------+--------------------+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | |
| "| 2023-05-16 20:53:32.769000 | 776903278721003441 | None | append | file:/home/jovyan/work/spark_warehouse/iceberg/permits_30f4184f-81ed-45c3-b21c-2e93bc1a5d1e/metadata/snap-776903278721003441-1-fdaa71ad-87db-4ad5-8862-463e3246dd9e.avro | {'spark.app.id': 'local-1684270382136', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '1000', 'total-position-deletes': '0', 'added-files-size': '51115', 'total-delete-files': '0', 'total-files-size': '51115', 'total-records': '1000', 'total-data-files': '1'} |\n", | |
| "+----------------------------+--------------------+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT * FROM rmoff.permits.snapshots" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "dd3c5e94", | |
| "metadata": {}, | |
| "source": [ | |
| "# The Setup" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "c8ac1043", | |
| "metadata": {}, | |
| "source": [ | |
| "## Create Nessie branch " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "id": "6eb07626", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "branch='etl_job_42'" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "id": "8351cf07", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>refType</th>\n", | |
| " <th>name</th>\n", | |
| " <th>hash</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Branch</td>\n", | |
| " <td>etl_job_42</td>\n", | |
| " <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------+------------+------------------------------------------------------------------+\n", | |
| "| refType | name | hash |\n", | |
| "+---------+------------+------------------------------------------------------------------+\n", | |
| "| Branch | etl_job_42 | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
| "+---------+------------+------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql CREATE BRANCH {branch} IN rmoff FROM main" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "3beacc72", | |
| "metadata": {}, | |
| "source": [ | |
| "### Use the new branch for reading and writing" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "6f43a4af", | |
| "metadata": {}, | |
| "source": [ | |
| "#### Now change the `REFERENCE`" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "id": "3d230c6c", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>refType</th>\n", | |
| " <th>name</th>\n", | |
| " <th>hash</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Branch</td>\n", | |
| " <td>etl_job_42</td>\n", | |
| " <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------+------------+------------------------------------------------------------------+\n", | |
| "| refType | name | hash |\n", | |
| "+---------+------------+------------------------------------------------------------------+\n", | |
| "| Branch | etl_job_42 | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
| "+---------+------------+------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql USE REFERENCE {branch} IN rmoff" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "f5ab667f", | |
| "metadata": {}, | |
| "source": [ | |
| "### Show list of references in Nessie" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 14, | |
| "id": "a6ae4b95", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>refType</th>\n", | |
| " <th>name</th>\n", | |
| " <th>hash</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Branch</td>\n", | |
| " <td>etl_job_42</td>\n", | |
| " <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Branch</td>\n", | |
| " <td>main</td>\n", | |
| " <td>d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------+------------+------------------------------------------------------------------+\n", | |
| "| refType | name | hash |\n", | |
| "+---------+------------+------------------------------------------------------------------+\n", | |
| "| Branch | etl_job_42 | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
| "| Branch | main | d7b2235a73fc5d6ea59e6239236649d953c769ba845f4f6b535c2e5abba81b8d |\n", | |
| "+---------+------------+------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 14, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql LIST REFERENCES IN rmoff" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "886ac446", | |
| "metadata": {}, | |
| "source": [ | |
| "### Check that we still see the same data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 15, | |
| "id": "bfab197f", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>borough</th>\n", | |
| " <th>permit_cnt</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Queens</td>\n", | |
| " <td>168</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Bronx</td>\n", | |
| " <td>28</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Manhattan</td>\n", | |
| " <td>463</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Brooklyn</td>\n", | |
| " <td>334</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Staten Island</td>\n", | |
| " <td>7</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------------+------------+\n", | |
| "| borough | permit_cnt |\n", | |
| "+---------------+------------+\n", | |
| "| Queens | 168 |\n", | |
| "| Bronx | 28 |\n", | |
| "| Manhattan | 463 |\n", | |
| "| Brooklyn | 334 |\n", | |
| "| Staten Island | 7 |\n", | |
| "+---------------+------------+" | |
| ] | |
| }, | |
| "execution_count": 15, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "2ddc4ff7", | |
| "metadata": {}, | |
| "source": [ | |
| "# Write" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "d88056be", | |
| "metadata": {}, | |
| "source": [ | |
| "Update the dataframe to remove rows matching predicate. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 16, | |
| "id": "82d21f90", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "++\n", | |
| "||\n", | |
| "++\n", | |
| "++" | |
| ] | |
| }, | |
| "execution_count": 16, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql DELETE FROM rmoff.permits WHERE borough='Manhattan'" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "3f6c3a04", | |
| "metadata": {}, | |
| "source": [ | |
| "## Inspecting the staged/unpublished data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "9c09de33", | |
| "metadata": {}, | |
| "source": [ | |
| "### Staged/unpublished data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "d73e5ba8", | |
| "metadata": {}, | |
| "source": [ | |
| "#### The changes are reflected in the table:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "id": "cabb818c", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>borough</th>\n", | |
| " <th>permit_cnt</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Queens</td>\n", | |
| " <td>168</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Bronx</td>\n", | |
| " <td>28</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Brooklyn</td>\n", | |
| " <td>334</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Staten Island</td>\n", | |
| " <td>7</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------------+------------+\n", | |
| "| borough | permit_cnt |\n", | |
| "+---------------+------------+\n", | |
| "| Queens | 168 |\n", | |
| "| Bronx | 28 |\n", | |
| "| Brooklyn | 334 |\n", | |
| "| Staten Island | 7 |\n", | |
| "+---------------+------------+" | |
| ] | |
| }, | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "60e26a26", | |
| "metadata": {}, | |
| "source": [ | |
| "### Published data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "9e5a4eb7", | |
| "metadata": {}, | |
| "source": [ | |
| "The data on the `main` branch remains unchanged. We can validate this by running a query against the data, specifying `main` as the branch using the `@<branch>` suffix:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 18, | |
| "id": "6fefa576", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>borough</th>\n", | |
| " <th>permit_cnt</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Queens</td>\n", | |
| " <td>168</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Bronx</td>\n", | |
| " <td>28</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Manhattan</td>\n", | |
| " <td>463</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Brooklyn</td>\n", | |
| " <td>334</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Staten Island</td>\n", | |
| " <td>7</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------------+------------+\n", | |
| "| borough | permit_cnt |\n", | |
| "+---------------+------------+\n", | |
| "| Queens | 168 |\n", | |
| "| Bronx | 28 |\n", | |
| "| Manhattan | 463 |\n", | |
| "| Brooklyn | 334 |\n", | |
| "| Staten Island | 7 |\n", | |
| "+---------------+------------+" | |
| ] | |
| }, | |
| "execution_count": 18, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "a58244fc", | |
| "metadata": {}, | |
| "source": [ | |
| "# Audit " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "590d8ce7", | |
| "metadata": {}, | |
| "source": [ | |
| "At the moment the data is written to the audit branch (`etl_job_42`), but not published to `main`. \n", | |
| "\n", | |
| "How you audit the data is up to you. The nice thing about the data being staged is that you can do it within the same ETL job, or have another tool do it.\n", | |
| "\n", | |
| "Here's a very simple example of doing in Python. We're going to programatically check that only the four expected boroughs remain in the data.\n", | |
| "\n", | |
| "First, we define those that are expected:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 19, | |
| "id": "62d6b827", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "expected_boroughs = {\"Queens\", \"Brooklyn\", \"Bronx\", \"Staten Island\"}" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "3c05b9a5", | |
| "metadata": {}, | |
| "source": [ | |
| "Then we get a set of the actual boroughs in the staged data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 20, | |
| "id": "77abc608", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "distinct_boroughs = spark.sql(\"SELECT DISTINCT borough FROM rmoff.permits\").toLocalIterator()\n", | |
| "boroughs = {row[0] for row in distinct_boroughs}" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "81156579", | |
| "metadata": {}, | |
| "source": [ | |
| "Now we do two checks:\n", | |
| "\n", | |
| "1. Compare the length of the expected vs actual set\n", | |
| "2. Check that the two sets when unioned are still the same length. This is necessary, since the first test isn't sufficient alone" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 21, | |
| "id": "c91668d4", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Audit has passed 🙌🏻\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "if ( (len(boroughs) != len(expected_boroughs)) \\\n", | |
| " or (len(boroughs) != len(set.union(boroughs, expected_boroughs))) \\\n", | |
| " or (len(expected_boroughs) != len(set.union(boroughs, expected_boroughs)))):\n", | |
| " raise ValueError(f\"Audit failed, borough set does not match expected boroughs: {boroughs} != {expected_boroughs}\")\n", | |
| "else:\n", | |
| " print(f\"Audit has passed 🙌🏻\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "d19526bb", | |
| "metadata": {}, | |
| "source": [ | |
| "# Publish" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "6dc07949", | |
| "metadata": {}, | |
| "source": [ | |
| "Publishing data in Nessie means merging the audit branch back into `main`, making it available to anyone working with the data in that branch." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 22, | |
| "id": "dd0375e9", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>name</th>\n", | |
| " <th>hash</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>main</td>\n", | |
| " <td>48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+------+------------------------------------------------------------------+\n", | |
| "| name | hash |\n", | |
| "+------+------------------------------------------------------------------+\n", | |
| "| main | 48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5 |\n", | |
| "+------+------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 22, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql MERGE BRANCH {branch} INTO main IN rmoff" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "d38524dd", | |
| "metadata": {}, | |
| "source": [ | |
| "## Inspecting the published data" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 25, | |
| "id": "0c3dbf77", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>borough</th>\n", | |
| " <th>permit_cnt</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Queens</td>\n", | |
| " <td>168</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Bronx</td>\n", | |
| " <td>28</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Brooklyn</td>\n", | |
| " <td>334</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Staten Island</td>\n", | |
| " <td>7</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------------+------------+\n", | |
| "| borough | permit_cnt |\n", | |
| "+---------------+------------+\n", | |
| "| Queens | 168 |\n", | |
| "| Bronx | 28 |\n", | |
| "| Brooklyn | 334 |\n", | |
| "| Staten Island | 7 |\n", | |
| "+---------------+------------+" | |
| ] | |
| }, | |
| "execution_count": 25, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT borough, count(*) permit_cnt FROM rmoff.`permits@main` GROUP BY borough" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "37af8102", | |
| "metadata": {}, | |
| "source": [ | |
| "You can also change the REFERENCE context back to `main` and query the table directly" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 23, | |
| "id": "6b4fef6a", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>refType</th>\n", | |
| " <th>name</th>\n", | |
| " <th>hash</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Branch</td>\n", | |
| " <td>main</td>\n", | |
| " <td>48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------+------+------------------------------------------------------------------+\n", | |
| "| refType | name | hash |\n", | |
| "+---------+------+------------------------------------------------------------------+\n", | |
| "| Branch | main | 48601a4f7a54d2d3d985e72da3f9760cc2b8db9e72c98a2004d9fd9671d30bc5 |\n", | |
| "+---------+------+------------------------------------------------------------------+" | |
| ] | |
| }, | |
| "execution_count": 23, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql USE REFERENCE main IN rmoff" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 24, | |
| "id": "a735db09", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<table>\n", | |
| " <tr>\n", | |
| " <th>borough</th>\n", | |
| " <th>permit_cnt</th>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Queens</td>\n", | |
| " <td>168</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Bronx</td>\n", | |
| " <td>28</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Brooklyn</td>\n", | |
| " <td>334</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <td>Staten Island</td>\n", | |
| " <td>7</td>\n", | |
| " </tr>\n", | |
| "</table>" | |
| ], | |
| "text/plain": [ | |
| "+---------------+------------+\n", | |
| "| borough | permit_cnt |\n", | |
| "+---------------+------------+\n", | |
| "| Queens | 168 |\n", | |
| "| Bronx | 28 |\n", | |
| "| Brooklyn | 334 |\n", | |
| "| Staten Island | 7 |\n", | |
| "+---------------+------------+" | |
| ] | |
| }, | |
| "execution_count": 24, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "%sql SELECT borough, count(*) permit_cnt FROM rmoff.permits GROUP BY borough" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "7482b2db", | |
| "metadata": {}, | |
| "source": [ | |
| "# Where Next?" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "814d43a5", | |
| "metadata": {}, | |
| "source": [ | |
| "* For more information about write-audit-publish see [this talk from Michelle Winters](https://www.youtube.com/watch?v=fXHdeBnpXrg&t=1001s) and [this talk from Sam Redai](https://www.dremio.com/wp-content/uploads/2022/05/Sam-Redai-The-Write-Audit-Publish-Pattern-via-Apache-Iceberg.pdf)." | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3 (ipykernel)", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.10.10" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 5 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment