humorless/dbt-duckdb-tutorial.md

Last active May 10, 2025 07:33

Star (2) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/humorless/e1ca70f85a7174d1706ce3f9e87b6764.js"></script>
Save humorless/e1ca70f85a7174d1706ce3f9e87b6764 to your computer and use it in GitHub Desktop.

Download ZIP

Modern data stack by REPLWARE

Raw

dbt-duckdb-tutorial.md

Modern data stack

Main features

REPLWARE 建議的資料分析技術棧 (modern data stack) 主要有下列特色：

ELT over ETL
SQL based analytics over non-SQL based analytics
Analytic Engineer as a new position
When the data is not exceeding 1T, your desktop/notebook is fast enough.

參考資料：

為何要用 SQL 取代 spreadsheet

當分析師在利用 spreadsheet 做分析工作時，每次向右邊拉出一個新的「欄」(column) ，就是在做一次的資料建模 (data modeling)。然而，由於上述的資料建模方式，每一次的建模只能表現一個維度 (dimension)，這樣子的方式，對於資料建模而言，表現的語彙是相當受限的。另一方面，如果是使用 SQL 的話，每一次的查詢 (query) 就是在做一次的資料建模 (data modeling)。由於，生成的資料是兩個維度的表格 (table)，表現能力自然會豐富許多。

需要的 technologies & tools

git
SQL formatter for Java
duckdb ;; data warehouse
dbt ;; Transform - T
metabase ;; Dashboard & Report
database schema graph 　　 ;; Database Schema Graph
Optional - nvim (用來編輯 yaml 與 sql)

課程規畫

lesson 1 - install software

config git
config nvim
安裝 pyenv (dbt 會需要 pip, python)
安裝 dbt
安裝 command-line duckdb

Optional

讀 Main features 那一段落的參考資料
https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

lesson 2 - play in the example dbt project

玩一下 jaffle_shop 這個 example project 。

lesson 3 - start dbt from scratch

dbt init -> 生成 ~/.dbt/profiles.yml 與 project_name directory
duckdb 登入 database duckdb dbt.duckdb
登入 duckdb 之後，嘗試幾個指令：參考
- select * from duckdb_tables;
- select * from duckdb_views;
- select * from duckdb_schemas;
- decribe [TABLE] 顯示 TABLE 的定義
- .read [SQL_CMD_FILE_NAME]
- .exit

Example content of ~/.dbt/profiles.yml

duck:
  outputs:
    dev:
      type: duckdb
      path: /Users/laurencechen/analytics/duck/dbt.duckdb  
  target: dev

lesson 4 - import data

dimension data

dbt seed
dbt seed --full-refresh

facts data

source
useful SQL command: CREATE TABLE, DROP TABLE, CREATE SCHEMA

TODO

補充其它有效的 load data into duckdb 方式

lesson 5 - basic SQL & modeling through dbt (source & ref)

The concept of primary key
3 types of table relationships: 1-to-1, 1-to-m, m-to-m
寫 dbt 的 model: 使用 ref, source
利用 dbt docs 來看 DAG
如何利用 SQL views 來做建模

lesson 6 - metabase

與 dbt 整合
Basic skill: dimension and measure
Metabase specific high-level semantic: segments and metrics
Core func 1: Visualization
Core func 2: Dashboard
Optional: Automation

command to execute metabase

java -jar metabase.jar

lesson 7 - basic SQL & reporting

4 種 SQL 的基礎語法: select *, select columns, where, join explaining join
group by, having, order by, aggregation functions
1Keydata SQL tutorial
distinct 的常見 4 種用法: distinct, distinct on, is distinct from, distinct in aggregation fun
Query Process Steps
1. Getting Data (From, Join)
2. Row Filter (Where)
3. Grouping (Group by)
4. Aggregate function
5. Group Filter (Having)
6. Window Function
7. SELECT
8. Distinct
9. Union
10. Order by
11. Offset
12. Limit/Fetch/Top

lesson 8 - intermediate SQL

NULL, IS NULL, IS NOT NULL, COALESCE
UNION, UNION ALL
GROUPING SET, ROLLUP, CUBE
There are 3 places to put filtering expressions:
- case/end inside aggregation function
- where
- having
Pivot table created by SQL: using case end inside aggregation function with group by

lesson 9 - advanced SQL

Conceptual, Logical, And Physical Data Models
lateral join
window function: running sum, delta, rank
- over
  - running => with order by, the default window frame is range between unbounded preceding and current row
  - moving => without order by, the default window frame is rows between unbounded preceding and unbounded following
- aggregate functions, ranking functions, analytic functions
date spine & generate_series
ARRAY_AGG with GROUP BY

CREATE TABLE bar AS SELECT * FROM ( VALUES                                      
  (1, 2, 3),                                                                    
  (1, 2, 4),                                                                    
  (1, 2, 5),                                                                    
  (2, 2, 3),                                                                    
  (2, 2, 4),                                                                    
  (2, 3, 5)                                                                     
) AS t(x,y,z);                                                                  
                                                                                
select x, (array_agg(y))[3]  from bar group by x;                               
=>                                                                              
  x │ array_agg                                                                
──┼───────────                                                    
  2 │   {2,2,2} ->     2                                                       
  1 │   {2,2,3} ->     3

lesson 10 - test & snapshot

dbt test & dbt test --select
四種 tests: unique, not_null, accepted_values, relationships
dbt snapshot

lesson 11 - jinja & macro

jinja
- if
- set
- for
install plugin -> dbt deps (dbt-labs/dbt_utils, dbt-labs/codegen)
write your own macro
manage database UDF by dbt

lesson 12 - hook & operation & dbt project checklist

lesson 13 - Data Review

lesson 14 - EL & CDC

Singer
Meltano
Real Time Analytics & stream processing
- CDC: change data capture

Appendix - Analytic Engineering toolkit

SQL
Jinja/Macro
dbt test
Metabase - visualization & dashboard

Author

humorless commented Apr 11, 2023

Resources from metabase

(1) https://www.metabase.com/dashboards/marketing-teams
(2) https://www.metabase.com/dashboards/sales-teams
(3) https://www.metabase.com/dashboards/engineering-teams
(4) https://www.metabase.com/blog/sql-cheatsheet-simple-sql-commands-to-get-started-in-data-analytics

Author

humorless commented Aug 4, 2023 •

edited

Loading

Postgres Query Optimization & Index

(1) https://www.cybertec-postgresql.com/en/join-strategies-and-performance-in-postgresql/
(2) https://atlasgo.io/guides/postgres/included-columns
(3) https://explain.dalibo.com/
(4) https://www.citusdata.com/blog/2022/05/19/speeding-up-sort-performance-in-postgres-15/

Author

humorless commented Aug 12, 2023 •

edited

Loading

SQL tutorial

Author

humorless commented Aug 13, 2023

SQL windows function semantic summary

Author

humorless commented Aug 17, 2023 •

edited

Loading

SQL lateral join

What is it?

A LATERAL join is like a SQL foreach loop, in which the SQL server will iterate over each row in a result set and evaluate a subquery (correlated subquery) using that row as a parameter.

Syntax & Semantic

LATERAL [subquery | derived table]

Typical use cases

Top N per group
Define a local variable for select section to use
Explode JSON

Example 1: Referencing variable

input

SELECT
  *
FROM
  generate_series(1, 4) AS t1
  CROSS JOIN LATERAL generate_series(1, t1.t1) AS t2;

output

 t1 │ t2 
────┼────
  1 │  1
  2 │  1
  2 │  2
  3 │  1
  3 │  2
  3 │  3
  4 │  1
  4 │  2
  4 │  3
  4 │  4
(10 rows)

Example 2: explode JSON

CREATE TABLE mytab (
   id bigint PRIMARY KEY,
   data jsonb
);
 
INSERT INTO mytab VALUES
   (1, '{ "key": ["five", "six"] }'),
   (2, '{ "key": ["pick", "up", "sticks"] }');

SELECT mytab.id, j.elem
FROM mytab
   CROSS JOIN LATERAL
      jsonb_array_elements_text(
         mytab.data -> 'key'
      ) AS j(elem);
 
 id |  elem  
----+--------
  1 | five
  1 | six
  2 | pick
  2 | up
  2 | sticks
(5 rows)

Author

humorless commented Aug 17, 2023 •

edited

Loading

Using Postgres

Install and Config

Download Postgres
psql 設定檔
- .psqlrc: config file for psql
- .pgpass : 無密碼登入
My .psqlrc

\set PROMPT1 '%~%x%# '
\x auto
4. Format of `.pgpass`
`host:port:db_name:user_name:password`

--- \set ECHO_HIDDEN false
\set ON_ERROR_STOP on
\set ON_ERROR_ROLLBACK interactive
\set HISTFILE ~/.psql_history-:DBNAME

\set VERBOSITY verbose

\pset null '¤'
\pset linestyle 'unicode'

-- \pset unicode_border_linestyle single
-- \pset unicode_column_linestyle single
-- \pset unicode_header_linestyle double

set intervalstyle to 'postgres_verbose';

-- \setenv LESS '-iMFXSx4R'

Basic shell command

生成一個 postgres 資料庫 -> createdb crm_db
psql 登入 database crm_db psql -d crm_db or psql -h localhost -d crm_db -U $USERNAME

Commands after connecting to Postgres server

連線後改變 postgres schema: set search_path to dbt_develop;

Load data into Postgres

Using script to prepare the table

#!/usr/bin/env bash                                                                                             
PGOPTIONS="--search_path=dbt_develop"                                                                           
export PGOPTIONS                                                                                                
                                                                                                                
psql -d crm_db -c "DROP TABLE IF EXISTS raw_data_table_name;"                                                            
psql -d crm_db -c "CREATE TABLE raw_data_table_name(                                                                    
    id integer,                                                                            
    name text,
    created_at timestamp with time zone,
    updated_at timestamp with time zone                                            
);"

Using psql command to copy

psql -d crm_db -c "\copy raw_data_table_name FROM '/abs/path/to/the/source/file' DELIMITER ',' CSV HEADER"

Miscellaneous

Author

humorless commented Aug 17, 2023 •

edited

Loading

Configure Neovim for SQL/yaml

安裝 neovim 的插件管理 plug

在 terminal 先執行

sh -c 'curl -fLo "${XDG_DATA_HOME:-$HOME/.local/share}"/nvim/site/autoload/plug.vim --create-dirs \
       https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim'

neovim config file

再把下列的 config 貼到 ~/.config/nvim/init.vim

" default yaml setup
autocmd FileType yaml setlocal ts=2 sts=2 sw=2 expandtab indentkeys-=0# indentkeys-=<:> foldmethod=indent nofoldenable

" Begin the plugin section
call plug#begin()
" Specify your required plugins here

Plug 'liuchengxu/vim-better-default'
call plug#end()

function! Sqlfmt()
 "`:call Sqlfmt()` to format the current sql file
 " The sqlfmt requires the installation of sqlparse
 " `pip install sqlparse`
 !sqlformat --reindent --keywords upper --identifiers lower % -o %
 :e
endfunction

function! Yamlfmt()
 "`:call Yamlfmt()` to format the current yaml file
 " pip install yamlfix
 !yamlfix %
 :e
endfunction

Author

humorless commented May 18, 2024

常用的 dbt package

dbt_utils
codegen
dbt_project_evaluator
elementary
dbt_artifacts

也許值得一試

audit_helper

Author

humorless commented May 10, 2025

Essential dbt checklist

humorless/dbt-duckdb-tutorial.md

Modern data stack

Main features

為何要用 SQL 取代 spreadsheet

需要的 technologies & tools

課程規畫

lesson 1 - install software

lesson 2 - play in the example dbt project

lesson 3 - start dbt from scratch

lesson 4 - import data

dimension data

facts data

TODO

lesson 5 - basic SQL & modeling through dbt (source & ref)

lesson 6 - metabase

lesson 7 - basic SQL & reporting

lesson 8 - intermediate SQL

lesson 9 - advanced SQL

lesson 10 - test & snapshot

lesson 11 - jinja & macro

lesson 12 - hook & operation & dbt project checklist

lesson 13 - Data Review

lesson 14 - EL & CDC

Appendix - Analytic Engineering toolkit

humorless commented Apr 11, 2023

Resources from metabase

Uh oh!

humorless commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Postgres Query Optimization & Index

Uh oh!

humorless commented Aug 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SQL tutorial

Uh oh!

humorless commented Aug 13, 2023

SQL windows function semantic summary

Uh oh!

humorless commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SQL lateral join

What is it?

Syntax & Semantic

Typical use cases

Example 1: Referencing variable

Example 2: explode JSON

Uh oh!

humorless commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Using Postgres

Install and Config

Basic shell command

Commands after connecting to Postgres server

Load data into Postgres

Miscellaneous

Uh oh!

humorless commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Configure Neovim for SQL/yaml

安裝 neovim 的插件管理 plug

neovim config file

Uh oh!

humorless commented May 18, 2024

常用的 dbt package

也許值得一試

Uh oh!

humorless commented May 10, 2025

Uh oh!

humorless commented Aug 4, 2023 •

edited

Loading

humorless commented Aug 12, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading