Skip to content

Instantly share code, notes, and snippets.

@jamiekt
jamiekt / get-role-members.sh
Last active May 27, 2022 15:42
Useful gcloud IAM commands
gcloud projects get-iam-policy project-id \
--filter="bindings.role:roles/workflows.editor" \
--flatten="bindings[].members" \
--format="table(bindings.members)"
#### output
# group:[email protected]
# user:[email protected]
# servceAccount:[email protected]
@jamiekt
jamiekt / gist:d09e5c45afc4bc50aba39a0fcc6c860f
Created May 10, 2021 21:22
terraform debug output when attempting to deploy bigquery table with a policy tag
2021/05/10 21:13:23 [WARN] Provider "registry.terraform.io/hashicorp/google" produced an invalid plan for google_bigquery_table.table, but we are tolerating it because it is using the legacy plugin SDK.
The following problems may be the cause of any confusing errors from downstream operations:
- .schema: planned value cty.StringVal("[{\"mode\":\"NULLABLE\",\"name\":\"col1\",\"policyTags\":\"projects/msm-groupdata-datalake-dev/locations/eu/taxonomies/7353867875344393839/policyTags/5061567500970922730\",\"type\":\"INTEGER\"}]") does not match config value cty.StringVal("[\n {\n \"name\": \"col1\",\n \"type\": \"INTEGER\",\n \"mode\": \"NULLABLE\",\n \"policyTags\": \"projects/msm-groupdata-datalake-dev/locations/eu/taxonomies/7353867875344393839/policyTags/5061567500970922730\"\n }\n]\n")
google_bigquery_table.table: Creating...
2021/05/10 21:13:23 [DEBUG] EvalApply: ProviderMeta config value set
2021/05/10 21:13:23 [DEBUG] google_bigquery_table.table: applying the pl
@jamiekt
jamiekt / callBQStoredProc.yaml
Created April 15, 2021 16:35
Google Cloud Workflows subworkflow for calling BigQuery stored procedures
callBQStoredProc:
params: [stored_proc_name, project_id, poll_wait_time_seconds: 5, bq_location: EU]
steps:
- init:
assign:
- job:
jobReference:
location: $${bq_location}
configuration:
query:
gcloud init #choose the project that you will be deploying to
gcloud services enable dataproc.googleapis.com sqladmin.googleapis.com
export GCP_PROJECT=$(gcloud config get-value project)
gsutil mb gs://${GCP_PROJECT}-tf-state #terraform state bucket used as the back-end for the Google provider
make init apply
@jamiekt
jamiekt / view2_again.tf
Created January 25, 2019 18:22
view2 with interpolation expression
resource "google_bigquery_table" "view2"{
dataset_id = "${google_bigquery_dataset.dataset1.dataset_id}"
table_id = "view2"
view {
query = "select * from `${google_bigquery_table.view1.project}.${google_bigquery_table.view1.dataset_id}.${google_bigquery_table.view1.table_id}`"
use_legacy_sql = false
}
}
@jamiekt
jamiekt / view2.tf
Last active January 25, 2019 18:51
Using the depends_on attribute
resource "google_bigquery_table" "view2"{
dataset_id = "${google_bigquery_dataset.dataset1.dataset_id}"
table_id = "view2"
view {
query = "select * from `${var.project}.dataset1.view1`"
use_legacy_sql = false
depends_on = ["google_bigquery_table.view1"]
}
}
@jamiekt
jamiekt / main.tf
Last active January 25, 2019 18:15
terraform configuration to deploy two views to bigquery
resource "google_bigquery_dataset" "dataset1"{
dataset_id = "dataset1"
}
resource "google_bigquery_table" "view1"{
dataset_id = "${google_bigquery_dataset.dataset1.dataset_id}"
table_id = "view1"
view {
query = "select 1 as col union all select 2 union all select 3"
@jamiekt
jamiekt / Dockerfile
Created September 18, 2018 10:34
Dockerfile to install sklearn
FROM ubuntu:18.04
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install python python-pip
RUN pip install sklearn
@jamiekt
jamiekt / derive_products_tally_in_basket_distribution.py
Last active July 18, 2018 07:52
PySpark code to derive a histogram of number of products in a basket
from pyspark.sql.functions import col, lit
# df is a Spark DataFrame: DataFrame[basket: string, product: string, customer: string, store: string]
baskets_tally = df.groupBy().agg(countDistinct(col('basket'))).collect()[0][0]
df = df.groupBy(col('basket')).count().withColumnRenamed('count', 'tally_of_products_per_basket')
df = df.groupBy("tally_of_products_per_basket") \
.count() \
.withColumnRenamed('count', 'tally_of_baskets_containing_products_tally') \
.orderBy(col("tally_of_products_per_basket").asc())
df = df.withColumn(
'fraction_of_baskets_containing_products_tally',

An example of how to count distinct values in a column using pyspark