Tracking issue: #10600
Some things to consider:
Although sources and sinks are inverse concepts, sources have a one-to-many relationship with downstream relations, while sinks have a one-to-one relationship with upstream relations. Relations have a zero-to-many relationship with downstream sinks, though, which gets in the way of implementing them as inverse dbt concepts (e.g. using pre- and post-hooks).
Something else to consider is that source and sink configuration might have different ownership than model development in the wild (e.g. data engineers vs. analytics engineers), so it'd be preferable not to tightly couple them.
From a developer perspective, this would require us to implement Materialize-specific versions of the following macros from dbt-external-tables
:
-
create_external_table.sql
-
get_external_build_plan.sql
-
dropif.sql
Our implementation wouldn't need to live in the dbt-external-tables
package, we could simply override the macros within dbt-materialize
(for reference, see Firebolt's implementation).
and then:
- Deprecate the custom
source
materialization (codebase+documentation) - Adapt the dbt integration guide
- Adapt the dbt get started demo
- Consider adapting the MZ Hack Day demo
- Add a new section to the Materialize configurations page in the dbt documentation
From a user perspective, defining sources as external tables would have the following workflow:
1. Setting up dbt-external-tables
Add the dbt-external-tables
package to packages.yml
:
packages:
- package: dbt-labs/dbt_external_tables
version: <version>
Modify dbt_project.yml
to include:
dispatch:
- macro_namespace: dbt_external_tables
search_order: ['dbt', 'dbt_external_tables']
Install the dbt-external-tables
package dependency:
dbt deps
2. Defining a source
Define a table as external
in dbt_project.yml
:
sources:
- name: kafka_source
loader: kafka
tables:
- name: sometable
external:
host: 'kafka:9092'
topic: 'sometopic'
...
Run stage_external_sources
, the entrypoint macro of the dbt-external-tables
package:
dbt run-operation stage_external_sources
The biggest downside is that this adds a bunch of overhead to what is the entrypoint of users to Materialize. It's not a straightforward workflow.
Option 2: pre-hook
on models
From a developer perspective, this would require:
- Implement a
create_source
macro
This option sounds borked from the get-go, since it would tightly couple sources with models (when the relationship between them might not be one-to-one).
1. Defining a pre-hook
in a(n entry?) model
{{
config({
"materialized":"materializedview",
"pre-hook": [
"{{ materialize.create_source(...
host='kafka:9092',
topic='sink_topic',
...) }}"
]
})
}}
Option 1: post-hook
on models
From a developer perspective, this would require:
- Implement a
create_sink
macro (similar to theunload_table
macro indbt-redshift
) - Consider (automatically) creating an
exposure
for lineage (see Option 2 👇)
and then:
- Deprecate the custom
sink
materialization (codebase+documentation) - Adapt the dbt integration guide
- Consider adapting the MZ Hack Day demo
- Add a new section to the Materialize configurations page in the dbt documentation
From a user perspective, defining sinks as post-hooks would have the following workflow:
1. Defining a post-hook
in the model to sink
{{
config({
"materialized":"materializedview",
"post-hook": [
"{{ materialize.create_sink(...
this.materializedview,
host='kafka:9092',
topic='sink_topic',
...) }}"
]
})
}}
Option 2: custom metadata on exposures
From a developer perspective, it's a bit unclear how this could be implemented since exposures
seem like...a purely metadata/documentation-based feature. According to Jeremy from dbt Labs, it might be possible to go this route using the meta
configuration and some custom macros.
TBH, I'm not sure how this would work since exposures
aren't compilable or executable, but maybe we can figure it out based on these two helpful threads:
It's also not possible to use a custom string as the exposure
type (at least yet, see dbt #2835), so we'd have to go with one of the accepted values: dashboard
, notebook
, analysis
, ml
or application
; this mainly dictates how exposures are surfaced in the dbt documentation, and having sinks listed under any of these options isn't ideal.
One of the benefits of using exposures would be having sinks as end nodes in the DAG. In contrast, with . Maybe there's a way to combine Option 1 and Option 2 (i.e. define a sink as a post-hooks
we'd lose track of lineage information (AFAIU)post-hook
and automatically create an exposure
for lineage), so we get the best of both worlds?
1. Defining an exposure
Define an exposure
with a custom meta
configuration in dbt_project.yml
:
exposures:
- name: kafka_sink
type: <exposure-type>
description: >
Some description.
depends_on:
- ref('mv_churn_prediction')
meta: ...
owner:
email: [email protected]
For all cases, credentials should be handled as (secret) environment variables that are inherited from the development environment users are running dbt against.
I get where @dataders is coming from re: hackles being raised seeing DDL inside of a dbt model; the original sin here is mine, and I rightfully deserve the blame for it (along with a few other things I noticed as I was reading through the dbt-materialize source code with a much more mature eye after working on a few more adapters.)
If we get back for a second to the core of what makes dbt so magical for analytics engineers, it's the simple fact that you can seamlessly go from a
SELECT
statement that you can copy-paste from any other context to a database object that is integrated into a production pipeline. That magical feeling relies pretty heavily on the fact that almost all SQL databases provide both aCREATE VIEW AS SELECT ...
and aCREATE TABLE AS SELECT ...
construct that the dbt adapter can build on to make the magic happen- and obviously for Materialize, we rely on that same magic for constructing views and matviews, and everything works exactly the way we would like it to.This breaks down, of course, for how we define sources and sinks; e.g. if we prune off the DDL prefix of a
CREATE SOURCE
statement, what we're left with is a stub that isn't actually an executable statement in Materialize in the same way that aSELECT
statement would be. In my dream world (i.e., what I would do if I was, say, the TL for the developer ecosystem team at MZ), is change that by modifying the MZ database itself to add a valid token to the DB that indicated that I wanted to view what the contents of a SOURCE definition would look like, but without actually creating the source-- the same way aSELECT
statement lets me view the results of a query without materializing a database object. Strawman: call the statementPREVIEW
, have its syntax conform to everything that you are allowed to do in aCREATE SOURCE
statement after the initialCREATE [MATERIALIZED] SOURCE {src_name} IF NOT EXISTS
DDL, and have it simply show you like 100-1000 lines max of output from what you would get if you had created such aSOURCE
and then done aSELECT
against it. Then have thesource
materialization in dbt-materialize simply prune off thePREVIEW
(or alternatively, just ignore aPREVIEW
token that is inside of aCREATE SOURCE
DDL statement), add the DDL, and then be on your merry way. FWIW, I suspect that such aPREVIEW
statement would be helpful to any and all MZ developers, who might like a shortcut way to do the equivalent of aCREATE SOURCE {src_name}
followed by aSELECT * FROM {src_name}
to get fast feedback that they had defined theirSOURCE
correctly.The
SINK
side of things is tougher right now; like, I'm not sure if the same sort ofPREVIEW
token trick would work here (like, it's not as clear to me what exactly that would mean in that context, but the MZ folks may be able to think of a use case for something like this that I'm unaware of.) dbt exposures might be able to handle some of the sink construction workloads if we get to a good place in terms of the external nodes feature, but that work is still very early and could go any number of different ways, so I wouldn't want to bet on it. Still want to noodle on this one a bit and/or see if the MZ folks think there is a plausible equivalent to thePREVIEW
approach forSOURCE
s forSINK
s (also obviously thePREVIEW
idea might not be the right thing to do in MZ for any number of other reasons I'm unaware of.)