Skip to content

Instantly share code, notes, and snippets.

@davehowell
Last active April 6, 2023 05:37
Show Gist options
  • Select an option

  • Save davehowell/57b7e5a9cf3ae49768c41d22a98da2f3 to your computer and use it in GitHub Desktop.

Select an option

Save davehowell/57b7e5a9cf3ae49768c41d22a98da2f3 to your computer and use it in GitHub Desktop.
databricks

Responses to questions

Also notebooks here https://github.com/databricks-academy/data-engineering-with-databricks-english

Lakehouse (MANDATORY)

What is Lakehouse?

Lakehouse is a modern open architecture, combining the best elements of data lakes and data warehouses.

Data warehouses: structured data schema enforcement and governance transactions for integrity and consistent reads responsive BI workloads

Data lakes: open file formats like parquet compute separate from storage - multiple compute options machine learning, batch processing, end-to-end streaming semi-structured data like JSON and XML unstructured data like audio, video and images

Lakehouse supports everything that data warehouses provide AND lowers the barrier so every business can use the same sophisticated tools & processes as the most advanced data-driven organisations. Lakehouse supports diverse workloads: data science, machine learning, SQL and analytics One platform is simpler and cheaper to manage Less data movement Open architecture and open source tech: Delta lake and MLFlow, avoids vendor lock-in Future compatible

How does Delta Lake work?

Delta Lake Metadata and versioning on top of parquet files ACID transaction logs in the storage layer Unifies batch and streaming support view of history, mechanisms to restore or vacuum optimization commands repair partitions purge soft-deleted data (like dropped columns) file compaction

Transaction Isolation modes "Optimistic concurrency control" concurrency level "append-only writes" snapshot isolation even with concurrent reads (and append-only writes)

default transaction isolation level "write serializable" ensures writes are serializable "serializable" ensures both reads and writes are serializable serializable: allowed where there exists a serial sequence of executing them one-at-a-time that generates thje same outcome as that seen in the table.

  • data file sizes - maxRecordsPerFile, less skew in parallel reads and quicker data scans for better performance

  • optimized writes - partition data collected on node before writing default target size of 128MB files reduce number of files written Traditionally, writes are direct from nodes to partitions which creates more files. Some cost of shuffling which could pay off in better write throughput or at least subsquent reads are better. "Adaptive shuffle" means you can remove coalesce and repartition

  • autocompaction - compacts to 128MB files

  • optimize - compacts to 1GB files

  • auto optimize - default for merge, update, delete. minNumFiles default 50, could be good (data skipping, minimize file re-writes) or could be bad (overpartitioning)

  • Z-order indexes best for data skipping reduce data read Colocates data data skipping (partition pruning) less skew in parallel reads quicker scans Uses the z-order curve algorithm to map multidimensional data into a single dimension for easy traversal in a binary tree or skip list. good sorting index across multiple columns when used in any order

What if you z-order twice? Only new rows will be sorted. Relies on having column stats so must be within delta.dataSkippingNumIndexedCols = default 32 Z-order on a high-cardinality column that appears in predicates - used in where clauses or joins

Auto optimize DOES NOT support z-ordering, need to optimize and then z-order periodically for large (10TB+) tables .

  • Vacuum file retention default 7 days ... default values and when would you change them

How does Auto Loader work?

  • cloudFiles source
  • python & SQL within DLT
  • parquet, ORC, csv, avro, json, text, binaryfile
  • key-value checkpoints in rocksDB in checkpoint location
  • resume with exactly-once guarantees, fault tolerance all managed for you
  • dir list vs file notify mode
    • dir list is optimised by running in parallel, can do lexical/incremental on azure/gcp
    • notification service + queue - (S3 + SQS queue + SNS subscription) better for scaling, more config and permissions required
  • trigger regular backfills to deal with cloud provider lack of latency SLA on file events

Delta Live Tables

  • extends spark structured streaming - abstraction layer provides simplicity expectations static live table workflows / multi-task jobs

I have another partner proposing a DW-centric solution. Why Lakehouse?

automated prediction and decisions streaming & ML

Data warehouses are good for their traditional business use case of BI and Dashboards. Those are useful outcomes but are much lower down the curve of maturity when compared to high value competitive advantage provided by automated prediction and decisions, which are enabled by modern data processing paradigms like realtime streaming and Machine Learning with live inferences. The Lakehouse supports everything the DW-centric solution supports as well as enabling the same sophisticated tooling and approaches used by the most advanced data-driven organisations.

Why would you migrate from Hadoop to Databricks?

Auto-optimise off?

Streaming, then a scheduled optimize instead of continuously

Depending on the nature of your work, you are also expected to have a well-rounded awareness of 2 out of the three topics below.

Unity Catalog & Governance

  • user management & metastore in the account layer, not the workspace layer
  • Single definition, standards compliant, built-in auditing
  • Metastores must be attached to workspaces

metastore | catalogs | schemas |- external tables, managed tables, views ( materialized?)

Delta sharing

How do you set up and provide access for a recipient Choose metastore - enable delta sharing, configure recipient token lifetime,

E2

Control plane -> Data plane Account with multiple workspaces Secure clusters - no public IP ( except for pypi libs) Customer KMS keys

DBFS in customer side Data plane Web console, notebooks, jobs & queries, cluster mgt in Control plane logs in control plane unless configured to write to logging bucket in data plane

Private link options for front & backend

Building Data Pipelines on Databricks

What is the Bronze / Silver / Gold architecture? Streaming What are Auto-loaders? What is the value? What are Delta Live Tables? What is the value? What are Multi-task Jobs? What is the value? How to orchestrate jobs from external orchestrators e.g. Airflow? How does Delta Change Feed make pipeline building simpler and faster? How to handle logging, testing and CI/CD?

Analytics using Databricks SQL

Why is this feature important? Queries and Dashboards Integrations with external tools like PowerBI & Tableau Serverless vs Classic Endpoints

  • Classic runs on client infra

  • Pro

  • Serverless How to fine-tune performance and handle endpoint sizing?

  • 40 concurrent users

  • scaling by clusters (can still be multiple nodes)

Machine Learning on Databricks

ML Runtimes Feature Stores AutoML Model serving - serverless MLFlow - use-cases and how it could benefit

  • Tracking - log params, code versions, metrics and artifacts
  • Projects - packaging reusable DS code, a dir or git repo, project version can be the commit, can be chained into workflows
  • Models - dir with files and a descriptor file. tracking remebers which project models are in
  • Registry - model store, set of APIs for lifecycle. model lineage for experiments, model versions, stage transitions (e.g. to prod)

Feature store

A delta table plus additional metadata Must have PK Tracks its lineage Online & offline store (publish to online store for low latency and realtime model inference) supports streaming Training set can be extracted from the feature store Doesn't work with unity catalog, must use default hive metastore

Delta sharing - provider and recipient, a token & URL with table-level access - uses short-lived URLs

deep clone - disaster recovery & backup

Study topics for Databricks

Lake House Pitch

Data Lakehouse

Medallion Architecture, ACID

https://docs.databricks.com/lakehouse/index.html

Delta Lake

Transaction Logs, ACID, Time Travel, Upserts, Change Data Feed

https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html https://docs.databricks.com/delta/index.html https://docs.databricks.com/_static/notebooks/delta/cdf-demo.html https://www.databricks.com/blog/2020/11/23/acid-transactions-on-data-lakes.html https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html "

Auto-Loader

Concepts, Schema Infrerance and evolution in Auto Loader, File Notification and Directory Listing modes, Autoloader Options, Common Data Loading Patterns

https://docs.databricks.com/ingestion/auto-loader/index.html https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/

Delta Live Tables

Delta Live Tables Concepts, Data Quality, Data Sources, Streaming, Change Data Capture, API, Event Logs, Workflow Orchestration

https://docs.databricks.com/workflows/delta-live-tables/index.html https://learn.microsoft.com/en-gb/azure/databricks/workflows/delta-live-tables/

Workflows

Orchestrate Jobs using Interactive and/or Job clusters, Delta Live Tables. Jobs APIs, Manage dependencies in Data Pipelines

https://www.databricks.com/product/workflows https://www.databricks.com/blog/2022/05/10/introducing-databricks-workflows.html https://docs.databricks.com/workflows/index.html https://learn.microsoft.com/en-gb/azure/databricks/workflows/

Clusters

Create Clusters, Configure Clusters, Cluster Policies, Autoscalling, Security Mode, Spark Condifuration, Init Scripts, Cluster Logs, Monitor Performance, Pools, Best Practices etc, Logging

https://learn.microsoft.com/en-gb/azure/databricks/clusters/ https://docs.databricks.com/clusters/index.html

Streaming

Apache Spark Streaming Concepts, Structured Streaming & Unity Catalog, PubSub System, ForeachBatch etc

https://www.databricks.com/spark/getting-started-with-apache-spark/streaming https://docs.databricks.com/structured-streaming/index.html https://learn.microsoft.com/en-gb/azure/databricks/structured-streaming/ https://www.databricks.com/session/a-deep-dive-into-structured-streaming/ https://docs.databricks.com/administration-guide/account-settings/audit-logs.html https://learn.microsoft.com/en-us/azure/databricks/administration-guide/account-settings/azure-diagnostic-logs

Databricks SQL

SQL warehouse, Grant permissions, Serverless, security model and data access overview, Query Profile, Query Caching, Transfer Ownership, ODBC driver connection to SQL Warehouse, Databricks SQL API reference, PAT Token Auth, Encrypt queries

https://docs.databricks.com/sql/admin/sql-endpoints.html https://docs.databricks.com/sql/get-started/admin-onboarding.html#step-2-grant-permissions https://docs.databricks.com/sql/admin/serverless.html https://docs.databricks.com/sql/user/security/data-access-overview.html https://docs.databricks.com/sql/admin/query-profile.html https://docs.databricks.com/sql/admin/query-caching.html https://docs.databricks.com/sql/admin/transfer-ownership.html#transfer-ownership-of-databricks-sql-objects https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#odbc-driver https://docs.databricks.com/sql/api/index.html#databricks-sql-api-reference https://docs.databricks.com/sql/api/authentication.html https://docs.databricks.com/sql/user/security/encryption.html https://www.databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html

Optimation and Performance

Dynamic File Pruning, Shuffle Merge, Adaptive Query Exection, Cost-Based Optimizer, Range & Skew Join optimization, Z-Order, Bloom Filter Indexes, Isolation Levels, Clones

https://docs.databricks.com/optimizations/index.html https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/ https://www.youtube.com/watch?v=CLDcdVDupMg https://docs.databricks.com/optimizations/clone.html https://www.databricks.com/blog/2020/09/15/easily-clone-your-delta-lake-for-testing-sharing-and-ml-reproducibility.html

Git Integration

Integrate Databricks with Source Repository https://docs.databricks.com/repos/index.html https://learn.microsoft.com/en-gb/azure/databricks/repos/

Security & Compliance

Access & Secret management, Managing Keys, Credential Passthrough, Secure Cluster Connectivity, No-Public-Ip, Firewall rules, Encrypt Traffic, FedRAMP, HIPPA, PCI-DSS, Monitoring https://docs.databricks.com/security/index.html https://learn.microsoft.com/en-gb/azure/databricks/security/

Unity Catalog

Metastore, Catalog, Database/Schema, Table Access Control, Credential Passthrough, Storage Credentials, External Location, Managed & External Tables, Admin roles of Unity Catalog, Cluster access modes, Data lineage, Migrate to Unity Catalog, Deployment of Unity Ctalog using Terraform

https://docs.databricks.com/data-governance/unity-catalog/index.html https://learn.microsoft.com/en-gb/azure/databricks/data-governance/unity-catalog/ https://www.databricks.com/dataaisummit/session/practitioners-guide-unity-catalog-technical-deep-dive https://docs.databricks.com/dbfs/unity-catalog.html https://learn.microsoft.com/en-us/azure/databricks/dbfs/unity-catalog https://docs.databricks.com/data-governance/unity-catalog/migrate.html https://learn.microsoft.com/en-gb/azure/databricks/data-governance/unity-catalog/migrate https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/unity-catalog-azure

Setup Unity Catalog for an Organization, SCIM provisioning

https://docs.databricks.com/data-governance/unity-catalog/get-started.html https://learn.microsoft.com/en-gb/azure/databricks/data-governance/unity-catalog/get-started https://learn.microsoft.com/en-us/azure/databricks/administration-guide/users-groups/scim/ https://docs.databricks.com/administration-guide/users-groups/scim/index.html

Programming

Python UDFs https://www.databricks.com/blog/2022/07/22/power-to-the-sql-people-introducing-python-udfs-in-databricks-sql.html

Machine Learning

ML RunTime

https://docs.databricks.com/runtime/mlruntime.html#:~:text=Introduction%20to%20Databricks%20Runtime%20for%20Machine%20Learning,-This%20tutorial%20is&text=It%20takes%20about%2010%20minutes,API%20and%20MLflow%20Model%20Registry https://learn.microsoft.com/en-us/azure/databricks/runtime/mlruntime https://www.youtube.com/watch?app=desktop&v=5v1qFWDCKd4&t=1256s

Ml Flow

https://docs.databricks.com/mlflow/index.html https://learn.microsoft.com/en-gb/azure/databricks/mlflow/ https://www.youtube.com/watch?v=5CpaimNhMzs https://mlflow.org/

Feature Store

https://docs.databricks.com/machine-learning/feature-store/index.html https://www.youtube.com/watch?v=avt1s0Q7hf8 https://www.youtube.com/watch?v=ia5ZxFDPPzo https://www.youtube.com/watch?v=l6xfFYZAyns

AutoML

https://docs.databricks.com/machine-learning/automl/index.html/ https://cprosenjit.medium.com/databricks-automl-model-serving-8b35181728e3 https://www.youtube.com/watch?v=m4QuLsrgb44 https://www.youtube.com/watch?v=XIFZetgx96g

Model Serving

https://docs.databricks.com/mlflow/model-serving.html https://docs.databricks.com/machine-learning/model-inference/index.html https://learn.microsoft.com/en-us/azure/databricks/machine-learning/manage-model-lifecycle/ https://cprosenjit.medium.com/databricks-automl-model-serving-8b35181728e3 https://www.youtube.com/watch?v=Q8ieZEiDTbE https://www.youtube.com/watch?v=dMU3nIl-jDA https://www.youtube.com/watch?v=JApPzAnbfPI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment