Skip to content

Instantly share code, notes, and snippets.

@davehowell
Last active January 16, 2025 00:52
Show Gist options
  • Select an option

  • Save davehowell/02de6163981abcfe70b05d9a2001d98d to your computer and use it in GitHub Desktop.

Select an option

Save davehowell/02de6163981abcfe70b05d9a2001d98d to your computer and use it in GitHub Desktop.
gcp

Execution details

select * from `region-here`.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT
where job_id = "bquxjob_blah_blah`

list tables

select 
table_type, table_name
--, count(*) 
from `[<project AKA catalog>.<dataset AKA schema>.INFORMATION_SCHEMA.TABLES` 
where table_name like '%_raw'
--group by table_type

list tables with size

select * from `region-australia-southeast1`.INFORMATION_SCHEMA.TABLE_STORAGE_BY_PROJECT

apply clustering with table swap

https://cloud.google.com/bigquery/docs/manage-partition-cluster-recommendations#apply_cluster_recommendations

Architecture

Capacitor - Proprietary columnar format Colossus - distributed storage Dremel - Distributed processing engine Jupiter - petabit network

Query lifecycle

API request management -> Lexing & parsing SQL -> Query planning Lexing & parsing SQL -> Query planning <-> Catalog resolution Lexing & parsing SQL <-> Catalog resolution Query planning -> Query plan execution <-> Scheduling and dynamic planning Query plan execution -> Finalize results

Query optimisation

Summary of https://youtu.be/iz6lxi9BczA?feature=shared

  • fewer columns: minimize columnstore bytes-scanned, use except

  • ensure predicates align to partitions & clustering; i.e. partition and cluster on common join & where clause columns

    • clustering improves efficiency even more than partitioning
  • consider late aggregation (shuffle as few times and as late as possible)

    • Although, filtering early is often better
  • Avoid group by (and therefore shuffle) by nesting repeated values & e.g. use array_length instead of counts

  • Sometimes repeating a where clause filter for the same column in different tables helps the query planner; don't rely on the join to do this (aka star join pre-filter optimization works on BQ)

  • Use the largest table first in the joins - then decreasing size. This helps with broadcast joins

  • Put the most selective filter (eliminates the most data) first in the where clause

  • Order by & limit happen last, using the final single slot, so avoid using them.

    • Use the preview option instead of a limit query when exploring the data
  • use execution details to look at the plan & graph, look for

  • dominating stages

  • skew (consider balancing partitions & clustering)

  • cpu time (consider approximate functions)

partitioning & clustering

Partitions should be > 1GB, use on low cardinality column Types allowed for partitioning:

  • Ingestion timestamp
  • A date/time column down to 1 hourly
  • Integer

Cluster on up to 4 columms, it physically sorts (within partitions) BigQuery automatically reclusters to maintain sort order

Materialized view

could reduce data scanned, unless underlying table is frequently changing - this is probably better for performance than cost

check gcloud logins

gcloud auth list

To login, run:

    gcloud auth login `email account`

get GKE temporary creds

    gcloud container clusters get-credentials --project=<project_id> --regions=<region> <cluster-name>
## Compute Engine
MIG - managed instance groups (for autoscaling)
- lift n shift, where you don't want to refactor
## Kubernetes Engine GKE
- managed, cloud-native K8s
- supports complex configurations
- capable, but complex, e.g. complex manifests
- autopilot helps scale-on-demand
- Pods, Deployments, ReplicaSets, DaemonSets
## Cloud Functions
## App Engine
- GCP first serverless offering (so its kindof legacy)
- limited language runtime support
- particular structure and boilerplate required
## Cloud Run
- full managed for simpler container use cases
- narrow set of use cases, limited
- good for synchronous event-driven applications and microservices
- also for async, events from http, gRPC, Pub/Sub, Cloud Scheduler, Cloud Tasks
- Stateless, i.e. if you need db or backend then use GKE instead
## Google Cloud SQL
- Managed
- MySQL, Postgres, MSSQL
## Google Cloud Spanner
- Compute nodes + distributed storage using Colossus (distributed, replicated file system)
- multilang support: C#, C++, Golang, Java, Node.js, PHP, Python, Ruby
- burst trafficß
- proprietary, write API is not just SQL
- Scalable, but expensive
- minimum 3 node
## BigTable
- Good for timeseries, IoT
## BigQuery
- Built on... see https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview
- Dremel (compute), sql -> exec trees. Leaves are "slots", branches are "mixers" (aggregations and calculations)
- Jupiter (petabit network for shuffling)
- Colossus (global columnar storage optimised with nested type compression), successor to GFS. Also used by Spanner. "Curator" (scalable metadata)
- internal format is "Capacitor" see https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format
- Borg (orchestrator, a precursor to Kubernetes), allocates hardware resources
## Anthos
- orchestration and management layer, scalable, managed k8's (GKE) & VMs,
- also supports hybrid & multi-cloud (Google Distributed Cloud) - onprem / edge / cloud , includes hardware & software
## BeyondCorp
- zero trust security model
- not just network-based, includes MFA and managed device policy awareness
- enables security outside a VPN (really? ... if it's good enough for GOOG then I guess)
## Sensitive Data Protection SDP
- Cloud Data Loss Prevention and the DLP API are now part of SDP
- auto profiling on bigquery
- can feed to "Chronicle" and Security Command Center
- classify & de-identify / masking
## Storage
- blob - Cloud Storage (GCS) - global edge caching, cheap archiving
- block (persistent disk, or local SSD)
- filestore (share between apps, managed, scalable, predictable)
- this is mountable NFS, used for k8s or where shared proper filesystem required
- firebase (mobile apps) AKA cloud datastore
- artifact registry ( containers, OS & library packages)
- workspace storage ( google workspace essentials, i.e. g-suite stuff)
- data transfer ( services (DTS) or transfer appliance )
- Storage tiers standard, nearline (access monthly), coldline (access quaterly), archive (access yearly)
## Security foundations blueprint
- defense in depth, at scale, by default
- BeyondProd (beyondcorp? is this the same)
- shared fate relationship (not just shared responsibility)
## Networking
Edge Nodes / Edge point of presence
Zonal
Regional (at least 3 zones)
Multi-Regional
global
## Connectivity
- Cloud interconnect (3 options):
- Dedicated interconnect - private IP address space connectivity, you manage the equipment at a colo G PoP
- Partner interconnect - private IP address space connectivity, ISP manages equipment at colo
- Cross-cloud - between clouds e.g. Azure <-> GCP
- Peering - for google workspace applications (AKA G-Suite: gmail, drive, docs, sheets, etc)
- Carrier peering - public IP connectivity, ISP manages the equipment at colo
- Direct Peering
- MPLS
- CE customer edge
- PE provider edge
- AC attachment circuit - physical or vitual circuit attaching a CE to a PE
- PW psuedowire, a bidirectional virtual connection between two PEs
## Identity, role, privilege
Members (Principals/Entity)
|- user account
|- service account
Roles
|- basic
|- predefined
|- custom
Members have Roles which provide privileges for resources within a project
CLIs - gcloud, gsutil, bq, kubectl, etc

Anonymous users from all over the world access a public health information website hosted in an on-premises EHR data center. The servers that host this website are older, and users are complaining about sluggish response times. There has also been a recent increase of distributed denial-of-service attacks toward the website. The attacks always come from the same IP address ranges. EHR management has identified the public health information website as an easy, low risk application to migrate to Google Cloud. You need to improve access latency and provide a security solution that will prevent the denial-of-service traffic from entering your Virtual Private Cloud (VPC) network. What should you do?

A. Deploy an external HTTP(S) load balancer, configure VPC firewall rules, and move the applications onto Compute Engine virtual machines. B. Deploy an external HTTP(S) load balancer, configure Google Cloud Armor, and move the application onto Compute Engine virtual machines. C. Containerize the application and move it into Google Kubernetes Engine (GKE). Create a GKE service to expose the pods within the cluster, and set up a GKE network policy. D. Containerize the application and move it into Google Kubernetes Engine (GKE). Create an internal load balancer to expose the pods outside the cluster, and configure Identity-Aware Proxy (IAP) for access.

A is not correct because firewall rules do not block malicious traffic into a VPC but rather block it at the VM level.

B is correct because the external HTTP(s) load balancer will improve access latency and Cloud Armor can be configured to block the Distributed Denial-of-Service (DDoS) attack.

C is not correct because a GKE service does not expose a set of pods outside of a cluster and a GKE network policy only filters traffic between pods and services.

D is not correct because a GKE internal load balancer will not load balance external traffic and anonymous users need access to the website so IAP is not a fit.

Format bytes

only works with up to 3 options

[<1000000]0.00," KB";[<1000000000]0.00,," MB";0.00,,," GB"

[<100000]0.00," KB";[<100000000]0.00,," MB";0.00,,," GB"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment