Skip to content

Instantly share code, notes, and snippets.

@morsapaes
Last active May 23, 2022 19:58
Show Gist options
  • Save morsapaes/03cb046ba4a984da644263c7e22cf094 to your computer and use it in GitHub Desktop.
Save morsapaes/03cb046ba4a984da644263c7e22cf094 to your computer and use it in GitHub Desktop.

Evolving the dbt-materialize adapter

Tracking issue: #10600

Some things to consider:

Although sources and sinks are inverse concepts, sources have a one-to-many relationship with downstream relations, while sinks have a one-to-one relationship with upstream relations. Relations have a zero-to-many relationship with downstream sinks, though, which gets in the way of implementing them as inverse dbt concepts (e.g. using pre- and post-hooks).

Something else to consider is that source and sink configuration might have different ownership than model development in the wild (e.g. data engineers vs. analytics engineers), so it'd be preferable not to tightly couple them.

Sources

Option 1: dbt-external-tables

From a developer perspective, this would require us to implement Materialize-specific versions of the following macros from dbt-external-tables:

  • create_external_table.sql
  • get_external_build_plan.sql
  • dropif.sql

Our implementation wouldn't need to live in the dbt-external-tables package, we could simply override the macros within dbt-materialize (for reference, see Firebolt's implementation).

and then:

User workflow

From a user perspective, defining sources as external tables would have the following workflow:

1. Setting up dbt-external-tables

Add the dbt-external-tables package to packages.yml:

packages:
  - package: dbt-labs/dbt_external_tables
    version: <version>

Modify dbt_project.yml to include:

dispatch:
  - macro_namespace: dbt_external_tables
    search_order: ['dbt', 'dbt_external_tables']

Install the dbt-external-tables package dependency:

dbt deps

2. Defining a source

Define a table as external in dbt_project.yml:

sources:
  - name: kafka_source
    loader: kafka

    tables:
      - name: sometable
        external:
          host: 'kafka:9092'
          topic: 'sometopic'
          ...

Run stage_external_sources, the entrypoint macro of the dbt-external-tables package:

dbt run-operation stage_external_sources

The biggest downside is that this adds a bunch of overhead to what is the entrypoint of users to Materialize. It's not a straightforward workflow.

Option 2: pre-hook on models

From a developer perspective, this would require:

  • Implement a create_source macro

This option sounds borked from the get-go, since it would tightly couple sources with models (when the relationship between them might not be one-to-one).

User workflow

1. Defining a pre-hook in a(n entry?) model

{{
  config({
    "materialized":"materializedview",
    "pre-hook": [
      "{{ materialize.create_source(...
                                  host='kafka:9092',
                                  topic='sink_topic',
                                  ...) }}"
    ]
  })
}}

Sinks

Option 1: post-hook on models

From a developer perspective, this would require:

  • Implement a create_sink macro (similar to the unload_table macro in dbt-redshift)
  • Consider (automatically) creating an exposure for lineage (see Option 2 👇)

and then:

  • Deprecate the custom sink materialization (codebase+documentation)
  • Adapt the dbt integration guide
  • Consider adapting the MZ Hack Day demo
  • Add a new section to the Materialize configurations page in the dbt documentation

User workflow

From a user perspective, defining sinks as post-hooks would have the following workflow:

1. Defining a post-hook in the model to sink

{{
  config({
    "materialized":"materializedview",
    "post-hook": [
      "{{ materialize.create_sink(...
                                  this.materializedview,
                                  host='kafka:9092',
                                  topic='sink_topic',
                                  ...) }}"
    ]
  })
}}

Option 2: custom metadata on exposures

From a developer perspective, it's a bit unclear how this could be implemented since exposures seem like...a purely metadata/documentation-based feature. According to Jeremy from dbt Labs, it might be possible to go this route using the meta configuration and some custom macros.

TBH, I'm not sure how this would work since exposures aren't compilable or executable, but maybe we can figure it out based on these two helpful threads:

It's also not possible to use a custom string as the exposure type (at least yet, see dbt #2835), so we'd have to go with one of the accepted values: dashboard, notebook, analysis, ml or application; this mainly dictates how exposures are surfaced in the dbt documentation, and having sinks listed under any of these options isn't ideal.

One of the benefits of using exposures would be having sinks as end nodes in the DAG. In contrast, with post-hooks we'd lose track of lineage information (AFAIU). Maybe there's a way to combine Option 1 and Option 2 (i.e. define a sink as a post-hook and automatically create an exposure for lineage), so we get the best of both worlds?

User workflow

1. Defining an exposure

Define an exposure with a custom meta configuration in dbt_project.yml:

exposures:
  - name: kafka_sink
    type: <exposure-type>
    description: >
      Some description.
    
    depends_on:
      - ref('mv_churn_prediction')
      
    meta: ...
      
    owner:
      email: [email protected]

Handling credentials

For all cases, credentials should be handled as (secret) environment variables that are inherited from the development environment users are running dbt against.

@morsapaes
Copy link
Author

morsapaes commented May 11, 2022

Logging some interesting points from a conversation with a dbt-materialize user, for additional context:

Sources

They're using pre-hooks and a custom create_source macro to handle source creation. This is convenient to them as there is an internal practice of making staging views one-to-one with sources.

Sinks

They're handling sink creation using analyses — SQL statements that get compiled but not executed at runtime. This allows them to delay sink creation until the upstream views are fully hydrated (~15-20 minutes after creation), and make sure there aren't any unexpected race conditions. From an old thread:

User: Do you have any practical advice about whether it makes sense to wait until all the sources have loaded and the views are populated to create sinks or whether creating them at the same time as the views should perform equally well? We were doing the former, but contemplating the later as part of moving to dbt.

Frank: Probably best to chill until "the right data" are pretty well loaded, just to take out the randomness of making MZ do lots of different things.

In addition, they're doing Blue-Green deployments and can't have dual publishers. Creating the sinks manually at a later stage also allows them to wait until the Green deployment is ready to take over.


Takeaway:

So far, the only thing I'm 💯 sold on is that we need to provide create_source and create_sink macros that encapsulate the different SQL grammar variations to 1) reduce boilerplate code and 2) avoid that each user has to implement (and therefore maintain) their own.

@ahelium
Copy link

ahelium commented May 12, 2022

Thank you for all of this awesome work Marta! Tackling the source piece of this (I havent even gotten to the sink part yet) in this thread.

I was playing around with the dbt-get-started demo and got a source created just by defining it in the source.yml. (committed to my open branch to show you).

Here is the corresponding DAG:
image

^^^ This is a little bit awkward because of the double entry for the source creation (the source materialization + the actual source node), but it does seem possible to make the green nodes and keep the source DDL around. Important to note that we'll never need to refresh a materialize source or check how fresh it is, which are the big positives of defining dbt sources.

I think our goal needs to be to enable version control on source creation while making things as simple as possible for the user to get up and running. Another thing to keep in mind is that the upcoming platform changes seem like they'll make creating sources even more 'set it and forget it' than they are now.

My thoughts on the two approaches for source creation you outlined:

The external table approach seems like we are shoving a square peg into a round hole. External tables in dbt were modeled off of hive external tables, which need a batch translation layer to keep them up to date. Materialize sources won't ever need to be refreshed, and we'd write a freshness check on a downstream view instead of the actual source anyway.

Using a pre-hooks makes sense from a dbt perspective, but may be confusing for folks just getting started. If we do go this route, I think we should allow users to write vanilla sql. The CREATE SOURCE piece of the SQL seems like the only thing consistent across materialize source types, so I'm not convinced that we need to enable macros for all of the different types of sources we support. Also important to consider is that we don't yet know what the source and sink SQL will actually look like in platform - that piece of secrets is still TBD AFAIK.

Note: We will probably? need a way for dbt users to create secrets to then use in source and sink statements.

@morsapaes
Copy link
Author

morsapaes commented May 12, 2022

Another thing to keep in mind is that the upcoming platform changes seem like they'll make creating sources even more 'set it and forget it' than they are now.

The ongoing source work might not affect what we're doing here much (off the top of my head, I can think of CREATE CLUSTER, CREATE CONNECTOR and CREATE SECRET as new DDL statements, but all of that kind of sounds like things that should be handled...through configs, maybe?). We do need a follow-up epic for Cloud-related changes!

The more I think about it, and the more I talk to actual dbt-materialize users, the less hopeful I am that we can find an existing dbt abstraction that fits sources and sinks in an intuitive way. It all feels like "shoving a square peg into a round hole", like you put it. So we're aligned here. 👍

Ah, a relevant comment from Jeremy that I forgot to include:

For context: I am thinking, later this year, about ways we could add better support in dbt-core for configuring "other database-specific constructs" — while leaving it up to each adapter plugin (or each user) to define what those constructs actually are, the configs they can take, and the SQL needed to create/manage each one so the answer here could be: we don't have the right answer just yet, but I'd really like to!

It might be worth just waiting for dbt to be more flexible and see what we can do then, and for now just add a bunch of niceties that make life easier for users (like auto-generated templates for create_source and create_sink macros, or generic macros that somehow allow you to reuse core source configuration and just take whatever is customizable as arguments). Let's see what Josh thinks tomorrow, and also if Anders has any comments, too!

@dataders
Copy link

I'm still in favor of the dbt-external-tables paradigm. I have a great deal of empathy for the "square peg into a round hole" perspective. But I echo @jtcohen6, in saying that it should fall on dbt-core should likely add some sharper corners to their hole (forgive my shitty metaphor).

We know that the peg doesn't fit well when we observer that dbt-materialize's materializations largely just inject DDL SQL with materialize__create_arbitrary_object(). In "pure" dbt world, there is no DDL in a dbt project, that is all boiler-plate abstracted from the user. So my hackles do raise a bit when I see CREATE statements in a dbt model file. At least in dbt-external-tables, CREATE statements are welcome (though only CREATE EXTERNAL TABLE) right now (though the integration tests do make use of a CREATE EXTERNAL SOURCE/STAGE).

This makes me think macros and YAML definitions might be best for sources and sinks?

{% macro materialize__create_arbitrary_object(sql) -%}
    {{ sql }}
{%- endmacro %}

p.s. I'm still not sure what the workflow is for using dbt-materialize alongside another dbt project (isn't this the normal usecase?). Maybe @morsapaes you can I can discuss this later

@jwills
Copy link

jwills commented May 16, 2022

I get where @dataders is coming from re: hackles being raised seeing DDL inside of a dbt model; the original sin here is mine, and I rightfully deserve the blame for it (along with a few other things I noticed as I was reading through the dbt-materialize source code with a much more mature eye after working on a few more adapters.)

If we get back for a second to the core of what makes dbt so magical for analytics engineers, it's the simple fact that you can seamlessly go from a SELECT statement that you can copy-paste from any other context to a database object that is integrated into a production pipeline. That magical feeling relies pretty heavily on the fact that almost all SQL databases provide both a CREATE VIEW AS SELECT ... and a CREATE TABLE AS SELECT ... construct that the dbt adapter can build on to make the magic happen- and obviously for Materialize, we rely on that same magic for constructing views and matviews, and everything works exactly the way we would like it to.

This breaks down, of course, for how we define sources and sinks; e.g. if we prune off the DDL prefix of a CREATE SOURCE statement, what we're left with is a stub that isn't actually an executable statement in Materialize in the same way that a SELECT statement would be. In my dream world (i.e., what I would do if I was, say, the TL for the developer ecosystem team at MZ), is change that by modifying the MZ database itself to add a valid token to the DB that indicated that I wanted to view what the contents of a SOURCE definition would look like, but without actually creating the source-- the same way a SELECT statement lets me view the results of a query without materializing a database object. Strawman: call the statement PREVIEW, have its syntax conform to everything that you are allowed to do in a CREATE SOURCE statement after the initial CREATE [MATERIALIZED] SOURCE {src_name} IF NOT EXISTS DDL, and have it simply show you like 100-1000 lines max of output from what you would get if you had created such a SOURCE and then done a SELECT against it. Then have the source materialization in dbt-materialize simply prune off the PREVIEW (or alternatively, just ignore a PREVIEW token that is inside of a CREATE SOURCE DDL statement), add the DDL, and then be on your merry way. FWIW, I suspect that such a PREVIEW statement would be helpful to any and all MZ developers, who might like a shortcut way to do the equivalent of a CREATE SOURCE {src_name} followed by a SELECT * FROM {src_name} to get fast feedback that they had defined their SOURCE correctly.

The SINK side of things is tougher right now; like, I'm not sure if the same sort of PREVIEW token trick would work here (like, it's not as clear to me what exactly that would mean in that context, but the MZ folks may be able to think of a use case for something like this that I'm unaware of.) dbt exposures might be able to handle some of the sink construction workloads if we get to a good place in terms of the external nodes feature, but that work is still very early and could go any number of different ways, so I wouldn't want to bet on it. Still want to noodle on this one a bit and/or see if the MZ folks think there is a plausible equivalent to the PREVIEW approach for SOURCEs for SINKs (also obviously the PREVIEW idea might not be the right thing to do in MZ for any number of other reasons I'm unaware of.)

@jwills
Copy link

jwills commented May 16, 2022

Just read up in the thread a bit-- doing the sinks as analyses is smart, esp. given that constraint re: the lag with which they should be created relative to the rest of the pipeline. /cc @morsapaes

@ahelium
Copy link

ahelium commented May 18, 2022

PREVIEW would be super neat! One way to accomplish previewing a non materialized source is to use a TAIL command within a transaction to limit the number of rows returned. And now that TAIL can have internal queries, you can run any dev materialized view statement to peep your data:

materialize=> BEGIN;
BEGIN
materialize=> DECLARE c CURSOR for TAIL (SELECT convert_from(data, 'utf8') AS data FROM rp_flight_information);
DECLARE CURSOR
materialize=> FETCH 2 c;
mz_timestamp  | mz_diff |                                                                                                                                                                            data
---------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1652882799999 |       1 | {"icao24": "345682", "callsign": "", "origin_country": "Spain", "time_position": null, "last_contact": 1652829778, "longitude": null, "latitude": null, "baro_altitude": null, "on_ground": false, "velocity": 260.21, "true_track": 100.71, "vertical_rate": 0, "sensors": null, "geo_altitude": null, "squawk": null, "spi": false, "position_source": 0}
 1652882799999 |       1 | {"icao24": "38a1db", "callsign": "", "origin_country": "France", "time_position": null, "last_contact": 1652821139, "longitude": null, "latitude": null, "baro_altitude": null, "on_ground": true, "velocity": 0, "true_track": 123.75, "vertical_rate": null, "sensors": null, "geo_altitude": null, "squawk": "7776", "spi": false, "position_source": 0}
(2 rows)

I wonder if we could codify that process somehow for users, to make things simpler.

@morsapaes
Copy link
Author

Thanks for writing that down, @jwills! Your comment + having a chat with @dataders (+ some 🚿 time) made me look at things from more of a workflow separation perspective. The TL;DR for the refactor (as we follow the progress of things like external nodes and the short-term plans to revamp the programmatic interfaces of dbt itself) sounds like:

Both sources and sinks should be completely decoupled from SQL models, which also means the creation of these objects should not happen at any point during dbt run, but as a separate staging step (much like what happens in dbt-external-tables). These should be YAML-ified (since they're pure DDL statements) and created using something like dbt stage --sources/dbt stage --sinks.

Does this sound reasonable? This separation should also make it easier to integrate with CI/CD pipelines that can trigger the creation of the right objects at the right time.


+1 that something like PREVIEW would be useful, but I'm having trouble understanding how that could help in context of the adapter. It'd be cool to wrap what @ahelium pointed out in a dry run-like command to preview the results of a (transformation) model, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment