How to do multi-threaded / parallel operations with dbt #dbt

How to do multi-threaded / parallel operations with dbt

-!  🚨                                          WARNING                                          🚨  !-
Just because you could does not mean you should!

Examples below are using PostgreSQL however the same idea apply across adapters (Snowflake,etc).

In order to run arbitrary SQL commands with dbt - we would typically use hooks or operations. Imagine for a second, we have to run an operation that performs some action multiple times:

insert into foo_1 ...;
insert into foo_2 ...;
insert into foo_3 ...;
insert into foo_4 ...;
insert into foo_5 ...;
insert into foo_6 ...;

If we were to run those SQL statements in a single run-operation - then this would be executed serially. Let's look at an example.

First let's create a function that simulates something doing something "slow":

create or replace function wait_then_return(t integer)
returns integer as $$
begin
  perform pg_sleep(t); -- sleep for t seconds.
  return t;            -- return t.
end;
$$ language plpgsql;

With that function created, when we call that function, the query would take t seconds to execute:

postgres=# select wait_then_return(10) as c;
 c  
----
 10
(1 row)

Time: 10012.288 ms (00:10.012)

postgres=# select wait_then_return(10) as c; select wait_then_return(10) as c; select wait_then_return(10) as c;
 c  
----
 10
(1 row)

Time: 10012.910 ms (00:10.013)
 c  
----
 10
(1 row)

Time: 10012.328 ms (00:10.012)
 c  
----
 10
(1 row)

Time: 10012.681 ms (00:10.013)

In a dbt project, things would look something like:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: "1.0.0"

-- macros/op.sql
{% macro op() %}
    {% do log('Procedure starting: ' ~ modules.datetime.datetime.now(), true) %}
    {% set query %}
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
    {% endset %}
    {% do run_query(query) %}
    {% do log('Procedure finished: ' ~  modules.datetime.datetime.now(), true) %}
{% endmacro %}

$ dbt run-operation op
01:30:25  Running with dbt=1.6.9
01:30:25  Registered adapter: postgres=1.6.9
01:30:26  Found 0 sources, 0 exposures, 0 metrics, 353 macros, 0 groups, 0 semantic models
01:30:26  Procedure starting: 2023-11-24 14:30:26.350642
01:31:26  Procedure finished: 2023-11-24 14:31:26.425314

We can see it took 1 minute to execute those 6 queries serially. What if we want to run those queries in parallel? dbt has threads - however it is not possible to have multi-threaded run-operations, only the building of models can take advantage of multiple threads.

Since dbt can build multiple models at the same time, lets use that to our advantage.

Create a new materialization:

-- macros/m13n.sql
{% materialization m13n, default %}
    {%- set identifier = model['alias'] -%}
    {%- set target_relation = api.Relation.create(identifier=identifier, schema=schema, database=database, type='view') -%}
    {% call statement('main') -%}
        {{ sql }}
    {%- endcall %}
    {{ return({'relations': [target_relation]}) }}
{% endmaterialization %}

This materialization does not do anything except to run the templated SQL statement {{ sql }} which will come from the SQL text in a model sql file.

Create multiple models that all use our custom materialization:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: "1.0.0"

models:
  my_dbt_project:
    +materialized: m13n
    
on-run-start: "{% do log('Procedure starting: ' ~ modules.datetime.datetime.now(), true) %}"
on-run-end: "{% do log('Procedure finished: ' ~  modules.datetime.datetime.now(), true) %}"

I've added some on-run-x hooks so we can see some timestamps being logged.

-- models/foo_1.sql
select wait_then_return(10) as c
-- models/foo_2.sql
select wait_then_return(10) as c
-- models/foo_3.sql
select wait_then_return(10) as c
-- models/foo_4.sql
select wait_then_return(10) as c
-- models/foo_5.sql
select wait_then_return(10) as c
-- models/foo_6.sql
select wait_then_return(10) as c

And now let's run:

$ dbt run --threads 1
01:43:48  Running with dbt=1.6.9
01:43:48  Registered adapter: postgres=1.6.9
01:43:49  Procedure starting: 2023-11-24 14:43:49.342662
01:43:49  Procedure finished: 2023-11-24 14:43:49.350480
01:43:49  Found 6 models, 2 operations, 0 sources, 0 exposures, 0 metrics, 354 macros, 0 groups, 0 semantic models
01:43:49  
01:43:49  
01:43:49  Running 1 on-run-start hook
01:43:49  Procedure starting: 2023-11-24 14:43:49.601148
01:43:49  1 of 1 START hook: my_dbt_project.on-run-start.0 ............................... [RUN]
01:43:49  1 of 1 OK hook: my_dbt_project.on-run-start.0 .................................. [OK in 0.00s]
01:43:49  
01:43:49  Concurrency: 1 threads (target='pg-local')
01:43:49  
01:43:49  1 of 6 START sql m13n model public.foo_1 ....................................... [RUN]
01:43:59  1 of 6 OK created sql m13n model public.foo_1 .................................. [SELECT 1 in 10.06s]
01:43:59  2 of 6 START sql m13n model public.foo_2 ....................................... [RUN]
01:44:09  2 of 6 OK created sql m13n model public.foo_2 .................................. [SELECT 1 in 10.06s]
01:44:09  3 of 6 START sql m13n model public.foo_3 ....................................... [RUN]
01:44:19  3 of 6 OK created sql m13n model public.foo_3 .................................. [SELECT 1 in 10.05s]
01:44:19  4 of 6 START sql m13n model public.foo_4 ....................................... [RUN]
01:44:29  4 of 6 OK created sql m13n model public.foo_4 .................................. [SELECT 1 in 10.06s]
01:44:29  5 of 6 START sql m13n model public.foo_5 ....................................... [RUN]
01:44:39  5 of 6 OK created sql m13n model public.foo_5 .................................. [SELECT 1 in 10.05s]
01:44:39  6 of 6 START sql m13n model public.foo_6 ....................................... [RUN]
01:44:49  6 of 6 OK created sql m13n model public.foo_6 .................................. [SELECT 1 in 10.05s]
01:44:50  
01:44:50  Running 1 on-run-end hook
01:44:50  Procedure finished: 2023-11-24 14:44:50.018001
01:44:50  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
01:44:50  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
01:44:50  
01:44:50  
01:44:50  Finished running 6 m13n models, 2 hooks in 0 hours 1 minutes and 0.60 seconds (60.60s).
01:44:50  
01:44:50  Completed successfully
01:44:50  
01:44:50  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

$ dbt run --threads 6
01:45:13  Running with dbt=1.6.9
01:45:13  Registered adapter: postgres=1.6.9
01:45:14  Procedure starting: 2023-11-24 14:45:14.079126
01:45:14  Procedure finished: 2023-11-24 14:45:14.087990
01:45:14  Found 6 models, 2 operations, 0 sources, 0 exposures, 0 metrics, 354 macros, 0 groups, 0 semantic models
01:45:14  
01:45:14  
01:45:14  Running 1 on-run-start hook
01:45:14  Procedure starting: 2023-11-24 14:45:14.325350
01:45:14  1 of 1 START hook: my_dbt_project.on-run-start.0 ............................... [RUN]
01:45:14  1 of 1 OK hook: my_dbt_project.on-run-start.0 .................................. [OK in 0.00s]
01:45:14  
01:45:14  Concurrency: 6 threads (target='pg-local')
01:45:14  
01:45:14  1 of 6 START sql m13n model public.foo_1 ....................................... [RUN]
01:45:14  2 of 6 START sql m13n model public.foo_2 ....................................... [RUN]
01:45:14  3 of 6 START sql m13n model public.foo_3 ....................................... [RUN]
01:45:14  4 of 6 START sql m13n model public.foo_4 ....................................... [RUN]
01:45:14  5 of 6 START sql m13n model public.foo_5 ....................................... [RUN]
01:45:14  6 of 6 START sql m13n model public.foo_6 ....................................... [RUN]
01:45:24  2 of 6 OK created sql m13n model public.foo_2 .................................. [SELECT 1 in 10.33s]
01:45:24  1 of 6 OK created sql m13n model public.foo_1 .................................. [SELECT 1 in 10.34s]
01:45:24  4 of 6 OK created sql m13n model public.foo_4 .................................. [SELECT 1 in 10.33s]
01:45:24  6 of 6 OK created sql m13n model public.foo_6 .................................. [SELECT 1 in 10.32s]
01:45:24  3 of 6 OK created sql m13n model public.foo_3 .................................. [SELECT 1 in 10.34s]
01:45:24  5 of 6 OK created sql m13n model public.foo_5 .................................. [SELECT 1 in 10.33s]
01:45:24  
01:45:24  Running 1 on-run-end hook
01:45:24  Procedure finished: 2023-11-24 14:45:24.739074
01:45:24  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
01:45:24  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
01:45:24  
01:45:24  
01:45:24  Finished running 6 m13n models, 2 hooks in 0 hours 0 minutes and 10.58 seconds (10.58s).
01:45:24  
01:45:24  Completed successfully
01:45:24  
01:45:24  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

$ dbt run --threads 3
01:45:38  Running with dbt=1.6.9
01:45:38  Registered adapter: postgres=1.6.9
01:45:39  Procedure starting: 2023-11-24 14:45:39.289995
01:45:39  Procedure finished: 2023-11-24 14:45:39.297111
01:45:39  Found 6 models, 2 operations, 0 sources, 0 exposures, 0 metrics, 354 macros, 0 groups, 0 semantic models
01:45:39  
01:45:39  
01:45:39  Running 1 on-run-start hook
01:45:39  Procedure starting: 2023-11-24 14:45:39.492161
01:45:39  1 of 1 START hook: my_dbt_project.on-run-start.0 ............................... [RUN]
01:45:39  1 of 1 OK hook: my_dbt_project.on-run-start.0 .................................. [OK in 0.00s]
01:45:39  
01:45:39  Concurrency: 3 threads (target='pg-local')
01:45:39  
01:45:39  1 of 6 START sql m13n model public.foo_1 ....................................... [RUN]
01:45:39  2 of 6 START sql m13n model public.foo_2 ....................................... [RUN]
01:45:39  3 of 6 START sql m13n model public.foo_3 ....................................... [RUN]
01:45:49  1 of 6 OK created sql m13n model public.foo_1 .................................. [SELECT 1 in 10.09s]
01:45:49  3 of 6 OK created sql m13n model public.foo_3 .................................. [SELECT 1 in 10.09s]
01:45:49  2 of 6 OK created sql m13n model public.foo_2 .................................. [SELECT 1 in 10.09s]
01:45:49  4 of 6 START sql m13n model public.foo_4 ....................................... [RUN]
01:45:49  5 of 6 START sql m13n model public.foo_5 ....................................... [RUN]
01:45:49  6 of 6 START sql m13n model public.foo_6 ....................................... [RUN]
01:45:59  4 of 6 OK created sql m13n model public.foo_4 .................................. [SELECT 1 in 10.10s]
01:45:59  5 of 6 OK created sql m13n model public.foo_5 .................................. [SELECT 1 in 10.10s]
01:45:59  6 of 6 OK created sql m13n model public.foo_6 .................................. [SELECT 1 in 10.10s]
01:45:59  
01:45:59  Running 1 on-run-end hook
01:45:59  Procedure finished: 2023-11-24 14:45:59.751741
01:45:59  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
01:45:59  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
01:45:59  
01:45:59  
01:45:59  Finished running 6 m13n models, 2 hooks in 0 hours 0 minutes and 20.39 seconds (20.39s).
01:45:59  
01:45:59  Completed successfully
01:45:59  
01:45:59  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

By controlling the number of threads - we control number of "models" dbt executes at any one time and when we set that to be equal to the number of models (6), the whole run took at most the lenght of time a single execution would have taken (~ 10 seconds) vs (~ 1 minute).

01:45:24  Finished running 6 m13n models, 2 hooks in 0 hours 0 minutes and 10.58 seconds (10.58s).

We can of course also achieve the same outcome by using normal table/view materializations and using pre/post-hooks but I didn't want to also have to then drop those models after they have been created since I want to keep my database clear of random tables/views.

One caveat with using this pattern is then you have random nodes showing up in your dbt DAG (even if we choose to hide those nodes via show: false):

@pcc-mtiner You've mentioned a tag, I assume you mean -- depends_on: {{ ref('<model>') }}? Do you have any thoughts about how to pass this dynamically in the case where you have a large number of inserts? I've tried all the suggestions here but no luck.

That's correct, I was talking about that feature. So far I haven't found way to have it be dynamic, so if I want to expand/reduce number of parallel splits it can be a little tedious.

Also it seems you can't delete the view as a post-hook because dbt expects the entity to be available:
Unhandled error while executing target/run/dbt_athena_iceberg_datalake/models/<model>.sql`
An error occurred (EntityNotFoundException) when calling the GetTableVersions operation: Table <model> not found.

I haven't run into this yet myself, running several views in parallel and then dropping the views worked for my case. I suspect it might have something to do with trying to use the model named by the dropped view in some downstream dependency. I think I have an idea that might accomodate. The main process would consist of three main steps:

Start with a 'run-initialization' view model that creates your destination table (or does any cleanup tasks), then drops view in post-hook. Destination table will be created by manual calling DDL using dbt run_query macro. This table will should have different name than the run-intialization model itself.
Run several parallel view model that all call macro to insert into the table created in step 1, with posthooks to drop view after. Each model should de dependent on the initialization model, to ensure table is ready to receive records.
Have a 'run-finalization' model. This model should have the same name as the destination table created by step 1, that way you can reference the destination table in downstream models. Since the table is already created in Step 1, you will want to set some type of materialization that won't try to recreate the table in the last step. Can either create custom materialization, or perhaps use 'incremental' type since this wont try to alter table if it already exists. Could then just have some simple script that won't actually insert any further records. -- depends_on: tags could also be consolidated in this one model if you want to add/remove parallel splits.

Potentially may need to do all the DROP VIEW posthooks in the run-finalization model, rather than in their individual models. Hopefully that helps get through your issue!

jeremyyeo/README.md