REPLWARE 建議的資料分析技術棧 (modern data stack) 主要有下列特色:
- ELT over ETL
- SQL based analytics over non-SQL based analytics
- Analytic Engineer as a new position
- When the data is not exceeding 1T, your desktop/notebook is fast enough.
參考資料:
- ELT方法的變革對數據驅動營運的重要性
- Holistics Analytics Setup Guidebook
- The analytic engineering guide: build data team
- big data is dead
當分析師在利用 spreadsheet 做分析工作時,每次向右邊拉出一個新的「欄」(column) ,就是在做一次的資料建模 (data modeling)。然而,由於上述的資料建模方式,每一次的建模只能表現一個維度 (dimension),這樣子的方式,對於資料建模而言,表現的語彙是相當受限的。另一方面,如果是使用 SQL 的話,每一次的查詢 (query) 就是在做一次的資料建模 (data modeling)。由於,生成的資料是兩個維度的表格 (table),表現能力自然會豐富許多。
- git
- SQL formatter for Java
- duckdb ;; data warehouse
- dbt ;; Transform -
T
- metabase ;; Dashboard & Report
- database schema graph ;; Database Schema Graph
- Optional - nvim (用來編輯 yaml 與 sql)
- config git
- config nvim
- 安裝 pyenv (dbt 會需要 pip, python)
- 安裝 dbt
- 安裝 command-line duckdb
Optional
- 讀 Main features 那一段落的參考資料
- https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/
玩一下 jaffle_shop 這個 example project 。
- dbt init -> 生成
~/.dbt/profiles.yml
與project_name
directory - duckdb 登入 database
duckdb dbt.duckdb
- 登入 duckdb 之後,嘗試幾個指令: 參考
select * from duckdb_tables;
select * from duckdb_views;
select * from duckdb_schemas;
decribe [TABLE]
顯示 TABLE 的定義.read [SQL_CMD_FILE_NAME]
.exit
Example content of ~/.dbt/profiles.yml
duck:
outputs:
dev:
type: duckdb
path: /Users/laurencechen/analytics/duck/dbt.duckdb
target: dev
- dbt seed
- dbt seed --full-refresh
- source
- useful SQL command:
CREATE TABLE
,DROP TABLE
,CREATE SCHEMA
補充其它有效的 load data into duckdb 方式
- The concept of
primary key
- 3 types of table relationships: 1-to-1, 1-to-m, m-to-m
- 寫 dbt 的 model: 使用
ref
,source
- 利用 dbt docs 來看 DAG
- 如何利用 SQL views 來做建模
- 與 dbt 整合
- Basic skill: dimension and measure
- Metabase specific high-level semantic: segments and metrics
- Core func 1: Visualization
- Core func 2: Dashboard
- Optional: Automation
command to execute metabase
java -jar metabase.jar
4
種 SQL 的基礎語法:select *
,select columns
,where
,join
explaining join- group by, having, order by, aggregation functions
- 1Keydata SQL tutorial
distinct
的常見 4 種用法:distinct
,distinct on
,is distinct from
,distinct
in aggregation fun- Query Process Steps
- Getting Data (From, Join)
- Row Filter (Where)
- Grouping (Group by)
- Aggregate function
- Group Filter (Having)
- Window Function
- SELECT
- Distinct
- Union
- Order by
- Offset
- Limit/Fetch/Top
- NULL, IS NULL, IS NOT NULL, COALESCE
- UNION, UNION ALL
- GROUPING SET, ROLLUP, CUBE
- There are 3 places to put
filtering expressions
:case/end
inside aggregation functionwhere
having
- Pivot table created by SQL: using
case end
inside aggregation function withgroup by
- Conceptual, Logical, And Physical Data Models
- lateral join
- window function: running sum, delta, rank
- over
- running => with
order by
, the default window frame isrange between unbounded preceding and current row
- moving => without
order by
, the default window frame isrows between unbounded preceding and unbounded following
- running => with
- aggregate functions, ranking functions, analytic functions
- over
- date spine &
generate_series
ARRAY_AGG
withGROUP BY
CREATE TABLE bar AS SELECT * FROM ( VALUES
(1, 2, 3),
(1, 2, 4),
(1, 2, 5),
(2, 2, 3),
(2, 2, 4),
(2, 3, 5)
) AS t(x,y,z);
select x, (array_agg(y))[3] from bar group by x;
=>
x │ array_agg
──┼───────────
2 │ {2,2,2} -> 2
1 │ {2,2,3} -> 3
- dbt test &
dbt test --select
- 四種 tests: unique, not_null, accepted_values, relationships
- dbt snapshot
- jinja
- if
- set
- for
- install plugin ->
dbt deps
(dbt-labs/dbt_utils, dbt-labs/codegen) - write your own macro
- manage database UDF by dbt
- SQL
- Jinja/Macro
- dbt test
- Metabase - visualization & dashboard
SQL tutorial