Skip to content

Instantly share code, notes, and snippets.

@slopp
Last active July 14, 2024 16:35
Show Gist options
  • Save slopp/d4091bd616bdc5e542457acd9b35c42b to your computer and use it in GitHub Desktop.
Save slopp/d4091bd616bdc5e542457acd9b35c42b to your computer and use it in GitHub Desktop.
BigQuery and Google Maps

Introduction

With BigQuery's new remote user defined functions (in preview) it is now possible to bring the power of Google Maps to your analytic data warehouse. Using Google Maps API endpoints in Cloud Functions called by BigQuery you can:

  • Geocode Addresses
  • Determine drive time distance between locations
  • Supplementing address or location data with Google Map's data such as elevation or place descriptions

By enriching location datasets in BigQuery you can accomplish advanced spatial analysis including:

  • Route and drive time optimization
  • Geographic clustering and cohort analysis

For example, the following query will calculate the drive time between a set of distribution centers and a list of users, allowing organizations to optimize distribution center deliveries.

WITH shipping AS (
    SELECT 
        CONCAT(users.first_name, " ", users.last_name) AS destination_name,
        CONCAT('{\"lat\": ', users.latitude, ', \"lng\": ', users.longitude, '}') AS destination,
        dc.name AS origin_name,
        CONCAT('{\"lat\": ', dc.latitude, ', \"lng\": ', dc.longitude, '}') AS origin
    FROM `looker-private-demo`.thelook.users
    CROSS JOIN `looker-private-demo`.thelook.distribution_centers AS dc
    WHERE users.state="Tennessee" 
)
SELECT 
    bq_udf_test.drivetime(shipping.origin, shipping.destination) AS best_drivetime,
    shipping.destination_name,
    shipping.origin_name
FROM shipping 
ORDER BY 1, 2

This example calculates 10K route times in approx 30 seconds!

Screen Shot 2022-03-10 at 10 37 03 AM

The supporting drivetime function is a very simple cloud function built around 2 Python functions, the functional psuedo-code is below:

# This function handles the formatting, accepting a BQ request, calling our get_drivetime function, and formatting the result
def drivetime_from_addresses(request):
    # initiate a googlemaps client using
    # a runtime environment variable that injects a cloud secret
    api_key = os.environ.get('MAPS_GEOCODE_API_KEY', 'No Maps API Key Available')
    gmaps = googlemaps.Client(key=api_key)

    # get the BQ request
    return_value = []
    request_json = request.get_json()
    rows = request_json['calls']
    # iterate through rows with two columns: origin, destination
    for row in rows:
      return_value.append(get_drivetime(row[0], row[1], gmaps))
    
    # format the result as BQ expects
    replies = [float(x) for x in return_value]
    return_json = json.dumps( { "replies" :  replies} )

    return(return_json)
    

# get drivetime in seconds for an origin, destination pair
def get_drivetime(origin, destination, gmaps):
    print(origin)
    print(destination)
    content = gmaps.distance_matrix([origin], [destination])

    drivetime_sec = content['rows'][0]['elements'][0]['duration']['value']
    return(json.dumps(drivetime_sec))

Step-by-Step Example

The following steps go into more detail, using an example that shows how to geocode addresses to retrieve their latitude and longitude.

  1. Create a BigQuery dataset including a table with address information, or use:
CREATE TABLE your_dataset.addresses AS
    SELECT 
        CONCAT(street, " ",
            SPLIT(client.address,'|')[OFFSET(0)], " ", 
            SPLIT(client.address,'|')[OFFSET(1)], " ",
            SPLIT(client.address,'|')[OFFSET(2)]
        ) AS address
    FROM `looker-private-demo.retail_banking.client` AS client
    LIMIT 100
  1. Enable the Google Maps API and create an API key of interest (e.g. the geocoding API). Add the API key as a Cloud Secret.

  2. Create and deploy a cloud function with the example main.py file and the requirements.txt file, enabling access to the Maps API key secret. Use lat_lon_from_addresses as the entrypoint.

Screen Shot 2022-02-24 at 2 03 44 PM

  1. Follow the steps to enable an external BigQuery connection, sourced from this document:
bq mk --connection --display_name='geocode fn conn' --connection_type=CLOUD_RESOURCE
      --project_id=my-project-id --location=US geocode_fn_conn
  1. Give the service account from step 4 the cloud invoker permission on your function from step 3

Screen Shot 2022-02-24 at 2 26 25 PM

  1. In BigQuery, register your cloud function as a user defined function:
CREATE FUNCTION `your-gcp-project.your_dataset`.geocode(call STRING) RETURNS STRING
REMOTE WITH CONNECTION `your-gcp-project.us.geocode_fn_conn`
OPTIONS (endpoint = 'https://your-cloud-function-endpoint-url/geocode')
  1. In BigQuery, use your new cloud function:
WITH result AS (
    SELECT 
        address, 
        `your-gcp-project.your_dataset`.geocode(address) AS location 
    FROM `your-gcp-project.your_dataset.addresses`
)
SELECT 
    address, 
    json_extract_scalar(location, '$.lat') AS lat, 
    json_extract_scalar(location, '$.lng') AS lng
FROM result 

Screen Shot 2022-02-24 at 2 38 34 PM

Screen Shot 2022-02-24 at 2 39 01 PM

This pattern is highly extensible to other data enrichment tasks. Update the Cloud Function to use any Maps API endpoint, or copy and deploy a second function to chain together tasks.

import requests
import urllib.parse
import json
import os
import googlemaps
# request_json = json.loads('{"requestId": "60f25f54-4b5c-4ecf-8abc-61a91fba693f", "caller": "//bigquery.googleapis.com/projects/lopp-looker-vertex/jobs/bquxjob_5eacdc93_17f2cffbbcc", "sessionUser": "[email protected]", "calls": [["1001 Soda Creek Rd Evergreen CO 80439"], ["300 W Labonte Dillon CO 80435"]]}')
# def print_request(request):
# request_json = request.get_json()
# calls = request_json['calls']
# print(request_json)
# for call in calls:
# print(call)
# reply = '200'
# return(json.dumps({'replies:': reply}))
# function takes an address as a string and a googlemaps client
# retuns the lat / lng for the address as a json string
# TODO error handling
# TODO handle addresses that map to more than one lat/lng
def get_latlon(address, gmaps):
# base_url = 'https://maps.googleapis.com/maps/api/geocode/json?address='
# url_address = base_url + urllib.parse.quote_plus(address)
# url_key = url_address + '&key=' + api_key
# resp = requests.get(url_key)
# content = json.loads(resp.content)
content = gmaps.geocode(address)
location = content[0]['geometry']['location']
return(json.dumps(location))
# functions accepts a BQ remote function request
# return a JSON array called replies with a json string with lat / long for each address
# TODO error handling
# TODO rate limiting API calls or BQ inbound request . . .
def lat_lon_from_addresses(request):
# initiate a googlemaps client using
# a runtime environment variable that injects a cloud secret
api_key = os.environ.get('MAPS_GEOCODE_API_KEY', 'No Maps API Key Available')
gmaps = googlemaps.Client(key=api_key)
return_value = []
request_json = request.get_json()
rows = request_json['calls']
for row in rows:
return_value.append(get_latlon(row[0], gmaps))
# format the result as BQ expects
replies = [str(x) for x in return_value]
return_json = json.dumps( { "replies" : replies} )
return(return_json)
# function takes two addresses as strings and a googlemaps client
# retuns the current drive time between the two addresses
# TODO error handling
# TODO handle addresses that map to more than one lat/lng
def get_drivetime(origin, destination, gmaps):
print(origin)
print(destination)
content = gmaps.distance_matrix([origin], [destination])
drivetime_sec = content['rows'][0]['elements'][0]['duration']['value']
return(json.dumps(drivetime_sec))
# functions accepts a BQ remote function request
# return a JSON array called replies with a json string with drivetime in secs for each origin/destination combo
# TODO error handling
# TODO rate limiting API calls or BQ inbound request . . .
def drivetime_from_addresses(request):
# initiate a googlemaps client using
# a runtime environment variable that injects a cloud secret
api_key = os.environ.get('MAPS_GEOCODE_API_KEY', 'No Maps API Key Available')
gmaps = googlemaps.Client(key=api_key)
return_value = []
request_json = request.get_json()
rows = request_json['calls']
for row in rows:
return_value.append(get_drivetime(row[0], row[1], gmaps))
# format the result as BQ expects
replies = [float(x) for x in return_value]
return_json = json.dumps( { "replies" : replies} )
return(return_json)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment