Modern data stack

Main features

REPLWARE 建議的資料分析技術棧 (modern data stack) 主要有下列特色：

ELT over ETL
SQL based analytics over non-SQL based analytics
Analytic Engineer as a new position
When the data is not exceeding 1T, your desktop/notebook is fast enough.

參考資料：

為何要用 SQL 取代 spreadsheet

當分析師在利用 spreadsheet 做分析工作時，每次向右邊拉出一個新的「欄」(column) ，就是在做一次的資料建模 (data modeling)。然而，由於上述的資料建模方式，每一次的建模只能表現一個維度 (dimension)，這樣子的方式，對於資料建模而言，表現的語彙是相當受限的。另一方面，如果是使用 SQL 的話，每一次的查詢 (query) 就是在做一次的資料建模 (data modeling)。由於，生成的資料是兩個維度的表格 (table)，表現能力自然會豐富許多。

需要的 technologies & tools

git
SQL formatter for Java
duckdb ;; data warehouse
dbt ;; Transform - T
metabase ;; Dashboard & Report
database schema graph 　　 ;; Database Schema Graph
Optional - nvim (用來編輯 yaml 與 sql)

課程規畫

lesson 1 - install software

config git
config nvim
安裝 pyenv (dbt 會需要 pip, python)
安裝 dbt
安裝 command-line duckdb

Optional

讀 Main features 那一段落的參考資料
https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

lesson 2 - play in the example dbt project

玩一下 jaffle_shop 這個 example project 。

lesson 3 - start dbt from scratch

dbt init -> 生成 ~/.dbt/profiles.yml 與 project_name directory
duckdb 登入 database duckdb dbt.duckdb
登入 duckdb 之後，嘗試幾個指令：參考
- select * from duckdb_tables;
- select * from duckdb_views;
- select * from duckdb_schemas;
- decribe [TABLE] 顯示 TABLE 的定義
- .read [SQL_CMD_FILE_NAME]
- .exit

Example content of ~/.dbt/profiles.yml

duck:
  outputs:
    dev:
      type: duckdb
      path: /Users/laurencechen/analytics/duck/dbt.duckdb  
  target: dev

lesson 4 - import data

dimension data

dbt seed
dbt seed --full-refresh

facts data

source
useful SQL command: CREATE TABLE, DROP TABLE, CREATE SCHEMA

TODO

補充其它有效的 load data into duckdb 方式

lesson 5 - basic SQL & modeling through dbt (source & ref)

The concept of primary key
3 types of table relationships: 1-to-1, 1-to-m, m-to-m
寫 dbt 的 model: 使用 ref, source
利用 dbt docs 來看 DAG
如何利用 SQL views 來做建模

lesson 6 - metabase

與 dbt 整合
Basic skill: dimension and measure
Metabase specific high-level semantic: segments and metrics
Core func 1: Visualization
Core func 2: Dashboard
Optional: Automation

command to execute metabase

java -jar metabase.jar

lesson 7 - basic SQL & reporting

4 種 SQL 的基礎語法: select *, select columns, where, join explaining join
group by, having, order by, aggregation functions
1Keydata SQL tutorial
distinct 的常見 4 種用法: distinct, distinct on, is distinct from, distinct in aggregation fun
Query Process Steps
1. Getting Data (From, Join)
2. Row Filter (Where)
3. Grouping (Group by)
4. Aggregate function
5. Group Filter (Having)
6. Window Function
7. SELECT
8. Distinct
9. Union
10. Order by
11. Offset
12. Limit/Fetch/Top

lesson 8 - intermediate SQL

NULL, IS NULL, IS NOT NULL, COALESCE
UNION, UNION ALL
GROUPING SET, ROLLUP, CUBE
There are 3 places to put filtering expressions:
- case/end inside aggregation function
- where
- having
Pivot table created by SQL: using case end inside aggregation function with group by

lesson 9 - advanced SQL

Conceptual, Logical, And Physical Data Models
lateral join
window function: running sum, delta, rank
- over
  - running => with order by, the default window frame is range between unbounded preceding and current row
  - moving => without order by, the default window frame is rows between unbounded preceding and unbounded following
- aggregate functions, ranking functions, analytic functions
date spine & generate_series
ARRAY_AGG with GROUP BY

CREATE TABLE bar AS SELECT * FROM ( VALUES                                      
  (1, 2, 3),                                                                    
  (1, 2, 4),                                                                    
  (1, 2, 5),                                                                    
  (2, 2, 3),                                                                    
  (2, 2, 4),                                                                    
  (2, 3, 5)                                                                     
) AS t(x,y,z);                                                                  
                                                                                
select x, (array_agg(y))[3]  from bar group by x;                               
=>                                                                              
  x │ array_agg                                                                
──┼───────────                                                    
  2 │   {2,2,2} ->     2                                                       
  1 │   {2,2,3} ->     3

lesson 10 - test & snapshot

dbt test & dbt test --select
四種 tests: unique, not_null, accepted_values, relationships
dbt snapshot

lesson 11 - jinja & macro

jinja
- if
- set
- for
install plugin -> dbt deps (dbt-labs/dbt_utils, dbt-labs/codegen)
write your own macro
manage database UDF by dbt

lesson 12 - hook & operation & dbt project checklist

lesson 13 - Data Review

lesson 14 - EL & CDC

Singer
Meltano
Real Time Analytics & stream processing
- CDC: change data capture

Appendix - Analytic Engineering toolkit

SQL
Jinja/Macro
dbt test
Metabase - visualization & dashboard

Example 2: explode JSON

CREATE TABLE mytab ( id bigint PRIMARY KEY, data jsonb ); INSERT INTO mytab VALUES (1, '{ "key": ["five", "six"] }'), (2, '{ "key": ["pick", "up", "sticks"] }');

Install and Config

Download Postgres

psql 設定檔

.psqlrc: config file for psql
.pgpass : 無密碼登入

My .psqlrc

\set PROMPT1 '%~%x%# ' \x auto 4. Format of `.pgpass` `host:port:db_name:user_name:password` --- \set ECHO_HIDDEN false \set ON_ERROR_STOP on \set ON_ERROR_ROLLBACK interactive \set HISTFILE ~/.psql_history-:DBNAME \set VERBOSITY verbose \pset null '¤' \pset linestyle 'unicode' -- \pset unicode_border_linestyle single -- \pset unicode_column_linestyle single -- \pset unicode_header_linestyle double set intervalstyle to 'postgres_verbose'; -- \setenv LESS '-iMFXSx4R'

Load data into Postgres

Using script to prepare the table

#!/usr/bin/env bash PGOPTIONS="--search_path=dbt_develop" export PGOPTIONS psql -d crm_db -c "DROP TABLE IF EXISTS raw_data_table_name;" psql -d crm_db -c "CREATE TABLE raw_data_table_name( id integer, name text, created_at timestamp with time zone, updated_at timestamp with time zone );"

Using psql command to copy

psql -d crm_db -c "\copy raw_data_table_name FROM '/abs/path/to/the/source/file' DELIMITER ',' CSV HEADER"

neovim config file

再把下列的 config 貼到 ~/.config/nvim/init.vim

" default yaml setup autocmd FileType yaml setlocal ts=2 sts=2 sw=2 expandtab indentkeys-=0# indentkeys-=<:> foldmethod=indent nofoldenable " Begin the plugin section call plug#begin() " Specify your required plugins here Plug 'liuchengxu/vim-better-default' call plug#end() function! Sqlfmt() "`:call Sqlfmt()` to format the current sql file " The sqlfmt requires the installation of sqlparse " `pip install sqlparse` !sqlformat --reindent --keywords upper --identifiers lower % -o % :e endfunction function! Yamlfmt() "`:call Yamlfmt()` to format the current yaml file " pip install yamlfix !yamlfix % :e endfunction

humorless/dbt-duckdb-tutorial.md

Modern data stack

Main features

為何要用 SQL 取代 spreadsheet

需要的 technologies & tools

課程規畫

lesson 1 - install software

lesson 2 - play in the example dbt project

lesson 3 - start dbt from scratch

lesson 4 - import data

dimension data

facts data

TODO

lesson 5 - basic SQL & modeling through dbt (source & ref)

lesson 6 - metabase

lesson 7 - basic SQL & reporting

lesson 8 - intermediate SQL

lesson 9 - advanced SQL

lesson 10 - test & snapshot

lesson 11 - jinja & macro

lesson 12 - hook & operation & dbt project checklist

lesson 13 - Data Review

lesson 14 - EL & CDC

Appendix - Analytic Engineering toolkit

humorless commented Aug 12, 2023 •

edited

Loading

humorless commented Aug 13, 2023

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented May 18, 2024

humorless/dbt-duckdb-tutorial.md

Modern data stack

Main features

為何要用 SQL 取代 spreadsheet

需要的 technologies & tools

課程規畫

lesson 1 - install software

lesson 2 - play in the example dbt project

lesson 3 - start dbt from scratch

lesson 4 - import data

dimension data

facts data

TODO

lesson 5 - basic SQL & modeling through dbt (source & ref)

lesson 6 - metabase

lesson 7 - basic SQL & reporting

lesson 8 - intermediate SQL

lesson 9 - advanced SQL

lesson 10 - test & snapshot

lesson 11 - jinja & macro

lesson 12 - hook & operation & dbt project checklist

lesson 13 - Data Review

lesson 14 - EL & CDC

Appendix - Analytic Engineering toolkit

humorless commented Aug 12, 2023 • edited Loading

SQL tutorial

humorless commented Aug 13, 2023

SQL windows function semantic summary

humorless commented Aug 17, 2023 • edited Loading

SQL lateral join

What is it?

Syntax & Semantic

Typical use cases

Example 1: Referencing variable

Example 2: explode JSON

humorless commented Aug 17, 2023 • edited Loading

Using Postgres

Install and Config

Basic shell command

Commands after connecting to Postgres server

Load data into Postgres

Miscellaneous

humorless commented Aug 17, 2023 • edited Loading

Configure Neovim for SQL/yaml

安裝 neovim 的插件管理 plug

neovim config file

humorless commented May 18, 2024

常用的 dbt package

也許值得一試

humorless commented Aug 12, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading

humorless commented Aug 17, 2023 •

edited

Loading