Skip to content

Instantly share code, notes, and snippets.

@alex-bezverkhniy
Last active May 9, 2025 02:45
Show Gist options
  • Save alex-bezverkhniy/a48cbad3b1fd992ca474cefb76883741 to your computer and use it in GitHub Desktop.
Save alex-bezverkhniy/a48cbad3b1fd992ca474cefb76883741 to your computer and use it in GitHub Desktop.
Create ML Models with BigQuery ML

๐Ÿง  BigQuery ML Cheat Sheet

๐Ÿ“Œ Basics

BigQuery ML allows you to create and execute machine learning models directly in BigQuery using SQL syntax.


๐Ÿ“Š Create a Model

CREATE OR REPLACE MODEL `project.dataset.model_name`
OPTIONS(
  model_type = 'linear_reg' | 'logistic_reg' | 'kmeans' | 'xgboost' | 'autoencoder' | ...,
  input_label_cols = ['label_column']
) AS
SELECT * FROM `project.dataset.table`

Common model_type values

Model Type Description
linear_reg Linear Regression
logistic_reg Binary/Multi-class Classification
kmeans Clustering
xgboost Boosted Trees for reg/class
tensorflow Import TF model from Cloud Storage
autoencoder Unsupervised Anomaly Detection
matrix_factorization Recommender Systems
arima_plus Time Series Forecasting
dnn_classifier Deep Neural Network Classifier
dnn_regressor Deep Neural Network Regressor

๐Ÿ” Evaluate Model Performance

SELECT *
FROM ML.EVALUATE(MODEL `project.dataset.model_name`, (
  SELECT * FROM `project.dataset.eval_data`
));

๐Ÿ”ฎ Predict with a Model

SELECT *
FROM ML.PREDICT(MODEL `project.dataset.model_name`, (
  SELECT * FROM `project.dataset.new_data`
));

๐Ÿ“ˆ Explain Predictions (Feature Importance)

SELECT *
FROM ML.EXPLAIN_PREDICT(MODEL `project.dataset.model_name`, (
  SELECT * FROM `project.dataset.new_data`
));

๐Ÿ“… Time Series Forecasting (ARIMA+)

CREATE OR REPLACE MODEL `project.dataset.model_arima`
OPTIONS(
  model_type='ARIMA_PLUS',
  time_series_timestamp_col='date',
  time_series_data_col='sales',
  horizon=12
) AS
SELECT date, sales FROM `project.dataset.sales_data`;
SELECT *
FROM ML.FORECAST(MODEL `project.dataset.model_arima`, STRUCT(12 AS horizon));

๐Ÿ” Hyperparameter Tuning Example

CREATE OR REPLACE MODEL `project.dataset.model_tuned`
OPTIONS(
  model_type='logistic_reg',
  auto_class_weights=TRUE,
  l1_reg=0.01,
  l2_reg=1.0,
  learn_rate=0.1,
  max_iterations=20
) AS
SELECT * FROM `project.dataset.table`;

๐Ÿค Model Info & Metadata

-- List all models in dataset
SELECT * FROM `project.dataset.INFORMATION_SCHEMA.MODELS`;

-- Get detailed model info
SELECT * FROM ML.MODEL_INFO(MODEL `project.dataset.model_name`);

๐Ÿงฎ Inspect Model Weights

SELECT * FROM ML.WEIGHTS(MODEL `project.dataset.model_name`);

๐Ÿ“ฅ Import / Export TensorFlow Models

-- Export TF model to GCS
EXPORT MODEL `project.dataset.tf_model`
OPTIONS(uri='gs://your-bucket/model-path/');

-- Import TF model from GCS
CREATE OR REPLACE MODEL `project.dataset.tf_model`
OPTIONS(
  model_type='tensorflow',
  model_path='gs://your-bucket/model-path/'
);

Getting Started with BigQuery ML

Task 1. Create a dataset

Click on right (three dots) menu and select "Create Dataset"

Task 2. Create a model

Go to query editor and run next query to create new Model:

#standardSQL
CREATE OR REPLACE MODEL `bqml_lab.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170631'
LIMIT 100000;

Here the visitor's device's operating system is used, whether said device is a mobile device, the visitor's country and the number of page views as the criteria for whether a transaction has been made.

In this case, bqml_lab is the name of the dataset and sample_model is the name of the model. The model type specified is binary logistic regression. In this case, label is what you're trying to fit to.

Task 3. Evaluate the model

To evaluate the model run next script:

#standardSQL
SELECT
  *
FROM
  ml.EVALUATE(MODEL `bqml_lab.sample_model`, (
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'));

If used with a linear regression model, the above query returns the following columns:

`mean_absolute_error`, `mean_squared_error`, `mean_squared_log_error`,
`median_absolute_error`, `r2_score`, `explained_variance`.

If used with a logistic regression model, the above query returns the following columns:

`precision`, `recall`
`accuracy`, `f1_score`
`log_loss`, `roc_auc`

Please consult the machine learning glossary or run a Google search to understand how each of these metrics are calculated and what they mean.

You'll realize the SELECT and FROM portions of the query are identical to that used during training. The WHERE portion reflects the change in time frame and the FROM portion shows that you're calling ml.EVALUATE.

Task 4. Use the model

Predict purchases per country

With this query you will try to predict the number of transactions made by visitors of each country, sort the results, and select the top 10 countries by purchases:

#standardSQL
SELECT
  country,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_lab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10;

This query is very similar to the evaluation query demonstrated in the previous section. Instead of ml.EVALUATE, you're using ml.PREDICT and the BigQuery ML portion of the query is wrapped with standard SQL commands. For this lab you're interested in the country and the sum of purchases for each country, so that's why SELECT, GROUP BY and ORDER BY. LIMIT is used to ensure you only get the top 10 results.

Predict purchases per user

Here is another example. This time you will try to predict the number of transactions each visitor makes, sort the results, and select the top 10 visitors by transactions:

#standardSQL
SELECT
  fullVisitorId,
  SUM(predicted_label) as total_predicted_purchases
FROM
  ml.PREDICT(MODEL `bqml_lab.sample_model`, (
SELECT
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(totals.pageviews, 0) AS pageviews,
  IFNULL(geoNetwork.country, "") AS country,
  fullVisitorId
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment