Tracking issue: #10600
Some things to consider:
Although sources and sinks are inverse concepts, sources have a one-to-many relationship with downstream relations, while sinks have a one-to-one relationship with upstream relations. Relations have a zero-to-many relationship with downstream sinks, though, which gets in the way of implementing them as inverse dbt concepts (e.g. using pre- and post-hooks).
Something else to consider is that source and sink configuration might have different ownership than model development in the wild (e.g. data engineers vs. analytics engineers), so it'd be preferable not to tightly couple them.
From a developer perspective, this would require us to implement Materialize-specific versions of the following macros from dbt-external-tables
:
-
create_external_table.sql
-
get_external_build_plan.sql
-
dropif.sql
Our implementation wouldn't need to live in the dbt-external-tables
package, we could simply override the macros within dbt-materialize
(for reference, see Firebolt's implementation).
and then:
- Deprecate the custom
source
materialization (codebase+documentation) - Adapt the dbt integration guide
- Adapt the dbt get started demo
- Consider adapting the MZ Hack Day demo
- Add a new section to the Materialize configurations page in the dbt documentation
From a user perspective, defining sources as external tables would have the following workflow:
1. Setting up dbt-external-tables
Add the dbt-external-tables
package to packages.yml
:
packages:
- package: dbt-labs/dbt_external_tables
version: <version>
Modify dbt_project.yml
to include:
dispatch:
- macro_namespace: dbt_external_tables
search_order: ['dbt', 'dbt_external_tables']
Install the dbt-external-tables
package dependency:
dbt deps
2. Defining a source
Define a table as external
in dbt_project.yml
:
sources:
- name: kafka_source
loader: kafka
tables:
- name: sometable
external:
host: 'kafka:9092'
topic: 'sometopic'
...
Run stage_external_sources
, the entrypoint macro of the dbt-external-tables
package:
dbt run-operation stage_external_sources
The biggest downside is that this adds a bunch of overhead to what is the entrypoint of users to Materialize. It's not a straightforward workflow.
Option 2: pre-hook
on models
From a developer perspective, this would require:
- Implement a
create_source
macro
This option sounds borked from the get-go, since it would tightly couple sources with models (when the relationship between them might not be one-to-one).
1. Defining a pre-hook
in a(n entry?) model
{{
config({
"materialized":"materializedview",
"pre-hook": [
"{{ materialize.create_source(...
host='kafka:9092',
topic='sink_topic',
...) }}"
]
})
}}
Option 1: post-hook
on models
From a developer perspective, this would require:
- Implement a
create_sink
macro (similar to theunload_table
macro indbt-redshift
) - Consider (automatically) creating an
exposure
for lineage (see Option 2 👇)
and then:
- Deprecate the custom
sink
materialization (codebase+documentation) - Adapt the dbt integration guide
- Consider adapting the MZ Hack Day demo
- Add a new section to the Materialize configurations page in the dbt documentation
From a user perspective, defining sinks as post-hooks would have the following workflow:
1. Defining a post-hook
in the model to sink
{{
config({
"materialized":"materializedview",
"post-hook": [
"{{ materialize.create_sink(...
this.materializedview,
host='kafka:9092',
topic='sink_topic',
...) }}"
]
})
}}
Option 2: custom metadata on exposures
From a developer perspective, it's a bit unclear how this could be implemented since exposures
seem like...a purely metadata/documentation-based feature. According to Jeremy from dbt Labs, it might be possible to go this route using the meta
configuration and some custom macros.
TBH, I'm not sure how this would work since exposures
aren't compilable or executable, but maybe we can figure it out based on these two helpful threads:
It's also not possible to use a custom string as the exposure
type (at least yet, see dbt #2835), so we'd have to go with one of the accepted values: dashboard
, notebook
, analysis
, ml
or application
; this mainly dictates how exposures are surfaced in the dbt documentation, and having sinks listed under any of these options isn't ideal.
One of the benefits of using exposures would be having sinks as end nodes in the DAG. In contrast, with . Maybe there's a way to combine Option 1 and Option 2 (i.e. define a sink as a post-hooks
we'd lose track of lineage information (AFAIU)post-hook
and automatically create an exposure
for lineage), so we get the best of both worlds?
1. Defining an exposure
Define an exposure
with a custom meta
configuration in dbt_project.yml
:
exposures:
- name: kafka_sink
type: <exposure-type>
description: >
Some description.
depends_on:
- ref('mv_churn_prediction')
meta: ...
owner:
email: [email protected]
For all cases, credentials should be handled as (secret) environment variables that are inherited from the development environment users are running dbt against.
Thank you for all of this awesome work Marta! Tackling the source piece of this (I havent even gotten to the sink part yet) in this thread.
I was playing around with the dbt-get-started demo and got a source created just by defining it in the source.yml. (committed to my open branch to show you).
Here is the corresponding DAG:

^^^ This is a little bit awkward because of the double entry for the source creation (the source materialization + the actual source node), but it does seem possible to make the green nodes and keep the source DDL around. Important to note that we'll never need to refresh a materialize source or check how fresh it is, which are the big positives of defining dbt sources.
I think our goal needs to be to enable version control on source creation while making things as simple as possible for the user to get up and running. Another thing to keep in mind is that the upcoming platform changes seem like they'll make creating sources even more 'set it and forget it' than they are now.
My thoughts on the two approaches for source creation you outlined:
The external table approach seems like we are shoving a square peg into a round hole. External tables in dbt were modeled off of hive external tables, which need a batch translation layer to keep them up to date. Materialize sources won't ever need to be refreshed, and we'd write a freshness check on a downstream view instead of the actual source anyway.
Using a pre-hooks makes sense from a dbt perspective, but may be confusing for folks just getting started. If we do go this route, I think we should allow users to write vanilla sql. The
CREATE SOURCE
piece of the SQL seems like the only thing consistent across materialize source types, so I'm not convinced that we need to enable macros for all of the different types of sources we support. Also important to consider is that we don't yet know what the source and sink SQL will actually look like in platform - that piece of secrets is still TBD AFAIK.Note: We will probably? need a way for dbt users to create secrets to then use in source and sink statements.