jamiekt’s gists

jamiekt / get-role-members.sh

Last active May 27, 2022 15:42

Useful gcloud IAM commands

	gcloud projects get-iam-policy project-id \
	--filter="bindings.role:roles/workflows.editor" \
	--flatten="bindings[].members" \
	--format="table(bindings.members)"
	#### output
	# group:[email protected]
	# user:[email protected]
	# servceAccount:[email protected]

jamiekt / gist:d09e5c45afc4bc50aba39a0fcc6c860f

Created May 10, 2021 21:22

terraform debug output when attempting to deploy bigquery table with a policy tag

	2021/05/10 21:13:23 [WARN] Provider "registry.terraform.io/hashicorp/google" produced an invalid plan for google_bigquery_table.table, but we are tolerating it because it is using the legacy plugin SDK.
	The following problems may be the cause of any confusing errors from downstream operations:
	- .schema: planned value cty.StringVal("[{\"mode\":\"NULLABLE\",\"name\":\"col1\",\"policyTags\":\"projects/msm-groupdata-datalake-dev/locations/eu/taxonomies/7353867875344393839/policyTags/5061567500970922730\",\"type\":\"INTEGER\"}]") does not match config value cty.StringVal("[\n {\n \"name\": \"col1\",\n \"type\": \"INTEGER\",\n \"mode\": \"NULLABLE\",\n \"policyTags\": \"projects/msm-groupdata-datalake-dev/locations/eu/taxonomies/7353867875344393839/policyTags/5061567500970922730\"\n }\n]\n")
	google_bigquery_table.table: Creating...
	2021/05/10 21:13:23 [DEBUG] EvalApply: ProviderMeta config value set
	2021/05/10 21:13:23 [DEBUG] google_bigquery_table.table: applying the pl

jamiekt / callBQStoredProc.yaml

Created April 15, 2021 16:35

Google Cloud Workflows subworkflow for calling BigQuery stored procedures

	callBQStoredProc:
	params: [stored_proc_name, project_id, poll_wait_time_seconds: 5, bq_location: EU]
	steps:
	- init:
	assign:
	- job:
	jobReference:
	location: $${bq_location}
	configuration:
	query:

jamiekt / script.sh

Created November 18, 2019 22:14

Instructions for deploying https://github.com/dunnhumby/democratizing-dataproc

	gcloud init #choose the project that you will be deploying to
	gcloud services enable dataproc.googleapis.com sqladmin.googleapis.com
	export GCP_PROJECT=$(gcloud config get-value project)
	gsutil mb gs://${GCP_PROJECT}-tf-state #terraform state bucket used as the back-end for the Google provider
	make init apply

jamiekt / view2_again.tf

Created January 25, 2019 18:22

view2 with interpolation expression

	resource "google_bigquery_table" "view2"{
	dataset_id = "${google_bigquery_dataset.dataset1.dataset_id}"
	table_id = "view2"

	view {
	query = "select * from `${google_bigquery_table.view1.project}.${google_bigquery_table.view1.dataset_id}.${google_bigquery_table.view1.table_id}`"
	use_legacy_sql = false
	}
	}

jamiekt / view2.tf

Last active January 25, 2019 18:51

Using the depends_on attribute

	resource "google_bigquery_table" "view2"{
	dataset_id = "${google_bigquery_dataset.dataset1.dataset_id}"
	table_id = "view2"

	view {
	query = "select * from `${var.project}.dataset1.view1`"
	use_legacy_sql = false
	depends_on = ["google_bigquery_table.view1"]
	}
	}

jamiekt / main.tf

Last active January 25, 2019 18:15

terraform configuration to deploy two views to bigquery

	resource "google_bigquery_dataset" "dataset1"{
	dataset_id = "dataset1"
	}

	resource "google_bigquery_table" "view1"{
	dataset_id = "${google_bigquery_dataset.dataset1.dataset_id}"
	table_id = "view1"

	view {
	query = "select 1 as col union all select 2 union all select 3"

jamiekt / Dockerfile

Created September 18, 2018 10:34

Dockerfile to install sklearn

	FROM ubuntu:18.04
	RUN apt-get update
	RUN apt-get upgrade -y
	RUN apt-get install python python-pip
	RUN pip install sklearn

jamiekt / derive_products_tally_in_basket_distribution.py

Last active July 18, 2018 07:52

PySpark code to derive a histogram of number of products in a basket

	from pyspark.sql.functions import col, lit
	# df is a Spark DataFrame: DataFrame[basket: string, product: string, customer: string, store: string]
	baskets_tally = df.groupBy().agg(countDistinct(col('basket'))).collect()[0][0]
	df = df.groupBy(col('basket')).count().withColumnRenamed('count', 'tally_of_products_per_basket')
	df = df.groupBy("tally_of_products_per_basket") \
	.count() \
	.withColumnRenamed('count', 'tally_of_baskets_containing_products_tally') \
	.orderBy(col("tally_of_products_per_basket").asc())
	df = df.withColumn(
	'fraction_of_baskets_containing_products_tally',

jamiekt / README.md

Last active July 17, 2019 11:26

An example of how to count distinct values in a column using pyspark

Jamie Thomson jamiekt