dbt_adapter.md

jwills commented May 16, 2022

I get where @dataders is coming from re: hackles being raised seeing DDL inside of a dbt model; the original sin here is mine, and I rightfully deserve the blame for it (along with a few other things I noticed as I was reading through the dbt-materialize source code with a much more mature eye after working on a few more adapters.)

If we get back for a second to the core of what makes dbt so magical for analytics engineers, it's the simple fact that you can seamlessly go from a SELECT statement that you can copy-paste from any other context to a database object that is integrated into a production pipeline. That magical feeling relies pretty heavily on the fact that almost all SQL databases provide both a CREATE VIEW AS SELECT ... and a CREATE TABLE AS SELECT ... construct that the dbt adapter can build on to make the magic happen- and obviously for Materialize, we rely on that same magic for constructing views and matviews, and everything works exactly the way we would like it to.

This breaks down, of course, for how we define sources and sinks; e.g. if we prune off the DDL prefix of a CREATE SOURCE statement, what we're left with is a stub that isn't actually an executable statement in Materialize in the same way that a SELECT statement would be. In my dream world (i.e., what I would do if I was, say, the TL for the developer ecosystem team at MZ), is change that by modifying the MZ database itself to add a valid token to the DB that indicated that I wanted to view what the contents of a SOURCE definition would look like, but without actually creating the source-- the same way a SELECT statement lets me view the results of a query without materializing a database object. Strawman: call the statement PREVIEW, have its syntax conform to everything that you are allowed to do in a CREATE SOURCE statement after the initial CREATE [MATERIALIZED] SOURCE {src_name} IF NOT EXISTS DDL, and have it simply show you like 100-1000 lines max of output from what you would get if you had created such a SOURCE and then done a SELECT against it. Then have the source materialization in dbt-materialize simply prune off the PREVIEW (or alternatively, just ignore a PREVIEW token that is inside of a CREATE SOURCE DDL statement), add the DDL, and then be on your merry way. FWIW, I suspect that such a PREVIEW statement would be helpful to any and all MZ developers, who might like a shortcut way to do the equivalent of a CREATE SOURCE {src_name} followed by a SELECT * FROM {src_name} to get fast feedback that they had defined their SOURCE correctly.

The SINK side of things is tougher right now; like, I'm not sure if the same sort of PREVIEW token trick would work here (like, it's not as clear to me what exactly that would mean in that context, but the MZ folks may be able to think of a use case for something like this that I'm unaware of.) dbt exposures might be able to handle some of the sink construction workloads if we get to a good place in terms of the external nodes feature, but that work is still very early and could go any number of different ways, so I wouldn't want to bet on it. Still want to noodle on this one a bit and/or see if the MZ folks think there is a plausible equivalent to the PREVIEW approach for SOURCEs for SINKs (also obviously the PREVIEW idea might not be the right thing to do in MZ for any number of other reasons I'm unaware of.)

jwills commented May 16, 2022

Just read up in the thread a bit-- doing the sinks as analyses is smart, esp. given that constraint re: the lag with which they should be created relative to the rest of the pipeline. /cc @morsapaes

ahelium commented May 18, 2022 •

edited

Loading

PREVIEW would be super neat! One way to accomplish previewing a non materialized source is to use a TAIL command within a transaction to limit the number of rows returned. And now that TAIL can have internal queries, you can run any dev materialized view statement to peep your data:

materialize=> BEGIN;
BEGIN
materialize=> DECLARE c CURSOR for TAIL (SELECT convert_from(data, 'utf8') AS data FROM rp_flight_information);
DECLARE CURSOR
materialize=> FETCH 2 c;
mz_timestamp  | mz_diff |                                                                                                                                                                            data
---------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1652882799999 |       1 | {"icao24": "345682", "callsign": "", "origin_country": "Spain", "time_position": null, "last_contact": 1652829778, "longitude": null, "latitude": null, "baro_altitude": null, "on_ground": false, "velocity": 260.21, "true_track": 100.71, "vertical_rate": 0, "sensors": null, "geo_altitude": null, "squawk": null, "spi": false, "position_source": 0}
 1652882799999 |       1 | {"icao24": "38a1db", "callsign": "", "origin_country": "France", "time_position": null, "last_contact": 1652821139, "longitude": null, "latitude": null, "baro_altitude": null, "on_ground": true, "velocity": 0, "true_track": 123.75, "vertical_rate": null, "sensors": null, "geo_altitude": null, "squawk": "7776", "spi": false, "position_source": 0}
(2 rows)

I wonder if we could codify that process somehow for users, to make things simpler.

Author

morsapaes commented May 18, 2022

Thanks for writing that down, @jwills! Your comment + having a chat with @dataders (+ some 🚿 time) made me look at things from more of a workflow separation perspective. The TL;DR for the refactor (as we follow the progress of things like external nodes and the short-term plans to revamp the programmatic interfaces of dbt itself) sounds like:

Both sources and sinks should be completely decoupled from SQL models, which also means the creation of these objects should not happen at any point during dbt run, but as a separate staging step (much like what happens in dbt-external-tables). These should be YAML-ified (since they're pure DDL statements) and created using something like dbt stage --sources/dbt stage --sinks.

Does this sound reasonable? This separation should also make it easier to integrate with CI/CD pipelines that can trigger the creation of the right objects at the right time.

+1 that something like PREVIEW would be useful, but I'm having trouble understanding how that could help in context of the adapter. It'd be cool to wrap what @ahelium pointed out in a dry run-like command to preview the results of a (transformation) model, though!

morsapaes/dbt_adapter.md

Evolving the `dbt-materialize` adapter

Sources

Option 1: `dbt-external-tables`

User workflow

Option 2: `pre-hook` on models

User workflow

Sinks

Option 1: `post-hook` on models

User workflow

Option 2: custom metadata on `exposures`

User workflow

Handling credentials

jwills commented May 16, 2022

jwills commented May 16, 2022

ahelium commented May 18, 2022 •

edited

Loading

morsapaes commented May 18, 2022

morsapaes/dbt_adapter.md

Evolving the dbt-materialize adapter

Sources

Option 1: dbt-external-tables

User workflow

Option 2: pre-hook on models

User workflow

Sinks

Option 1: post-hook on models

User workflow

Option 2: custom metadata on exposures

User workflow

Handling credentials

jwills commented May 16, 2022

jwills commented May 16, 2022

ahelium commented May 18, 2022 • edited Loading

morsapaes commented May 18, 2022

Evolving the `dbt-materialize` adapter

Option 1: `dbt-external-tables`

Option 2: `pre-hook` on models

Option 1: `post-hook` on models

Option 2: custom metadata on `exposures`

ahelium commented May 18, 2022 •

edited

Loading