select * from `region-here`.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT
where job_id = "bquxjob_blah_blah`select
table_type, table_name
--, count(*)
from `[<project AKA catalog>.<dataset AKA schema>.INFORMATION_SCHEMA.TABLES`
where table_name like '%_raw'
--group by table_typeselect * from `region-australia-southeast1`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECTCapacitor - Proprietary columnar format Colossus - distributed storage Dremel - Distributed processing engine Jupiter - petabit network
API request management -> Lexing & parsing SQL -> Query planning Lexing & parsing SQL -> Query planning <-> Catalog resolution Lexing & parsing SQL <-> Catalog resolution Query planning -> Query plan execution <-> Scheduling and dynamic planning Query plan execution -> Finalize results
Summary of https://youtu.be/iz6lxi9BczA?feature=shared
-
fewer columns: minimize columnstore bytes-scanned, use
except -
ensure predicates align to partitions & clustering; i.e. partition and cluster on common join & where clause columns
- clustering improves efficiency even more than partitioning
-
consider late aggregation (shuffle as few times and as late as possible)
- Although, filtering early is often better
-
Avoid group by (and therefore shuffle) by nesting repeated values & e.g. use array_length instead of counts
-
Sometimes repeating a where clause filter for the same column in different tables helps the query planner; don't rely on the join to do this (aka star join pre-filter optimization works on BQ)
-
Use the largest table first in the joins - then decreasing size. This helps with broadcast joins
-
Put the most selective filter (eliminates the most data) first in the where clause
-
Order by &
limithappen last, using the final single slot, so avoid using them.- Use the preview option instead of a limit query when exploring the data
-
use execution details to look at the plan & graph, look for
-
dominating stages
-
skew (consider balancing partitions & clustering)
-
cpu time (consider approximate functions)
Partitions should be > 1GB, use on low cardinality column Types allowed for partitioning:
- Ingestion timestamp
- A date/time column down to 1 hourly
- Integer
Cluster on up to 4 columms, it physically sorts (within partitions) BigQuery automatically reclusters to maintain sort order
could reduce data scanned, unless underlying table is frequently changing - this is probably better for performance than cost