Big Query ML Introduction

BigQuery ML BQML is a powerful feature within Google BigQuery that allows users to create, train, and execute machine learning models directly using SQL queries.

This approach aims to democratize machine learning by enabling SQL practitioners to build models with their existing tools, eliminating the need to move large amounts of data to separate ML frameworks or services.

BigQuery ML supports a variety of machine learning models and algorithms, catering to different analytical needs. Here's a breakdown of the types you can use:

Internally Trained Models

These models can be built directly within BigQuery ML using SQL:

Linear Regression: Used for predicting continuous numerical values.
Binary Logistic Regression: For binary classification tasks (e.g., yes/no, 0/1).
Multiclass Logistic Regression: For classification tasks with multiple categories.
K-Means Clustering: An unsupervised learning algorithm for dividing data points into clusters.
Matrix Factorization: Used for recommendation systems, based on explicit or implicit feedback.
Time Series Forecasting (ARIMA_PLUS): For forecasting future values based on historical time-series data.
Boosted Trees (XGBoost): An ensemble learning method that combines multiple decision trees for accurate predictions.
Deep Neural Networks (DNNs): Inspired by the human brain, used for various tasks like image classification and natural language processing.
Principal Component Analysis (PCA): For dimensionality reduction.

Externally Trained Models (via Vertex AI)

BigQuery ML integrates with Vertex AI, allowing you to leverage more advanced models trained and deployed on Vertex AI. Predictions from these models can then be imported back into BigQuery for analysis. These include:

DNN (Deep Neural Networks).
Wide & Deep.
Autoencoder.
Random Forest.
AutoML.

Imported Models

You can also load previously trained models from other frameworks into BigQuery ML for seamless predictions. This includes:

TensorFlow models: You can load these into BigQuery ML and perform predictions [0].
ONNX format models: This includes scikit-learn and PyTorch models [0].
XGBoost models: An optimized distributed gradient boosting library [0].

BigQuery ML also offers features like hyperparameter tuning, model explainability functions, and feature preprocessing to enhance model performance.

Training a model

BigQuery ML models are stored within BigQuery datasets, similar to how tables and views are stored, making them easily accessible for prediction queries. This approach allows you to keep your data and ML models in the same environment, simplifying workflows and speeding up deployment.

Step 1: Train a BigQuery ML Model

To train a model, you use the CREATE MODEL statement. You specify the model type, input options, and the data to train on.

CREATE OR REPLACE MODEL `my_bqml_dataset.my_logistic_reg_model`
OPTIONS(
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['label']
) AS
SELECT
  feature1,
  feature2,
  label
FROM
  `my_bqml_dataset.my_training_data`
WHERE
  -- Optionally, filter your training data if needed
  DATE(timestamp_column) BETWEEN '2024-01-01' AND '2024-09-30';

Step 2: Evaluate the Model

After training, it's crucial to evaluate your model's performance. You use the ML.EVALUATE function for this.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `my_bqml_dataset.my_logistic_reg_model`,
    (
    SELECT
      feature1,
      feature2,
      label
    FROM
      `my_bqml_dataset.my_evaluation_data`
    WHERE
      -- Use a separate evaluation dataset or a different time period
      DATE(timestamp_column) BETWEEN '2024-10-01' AND '2024-10-31'
    )
  );

Step 3: Make Predictions with the Trained Model

Once you're satisfied with your model's performance, you can use it to make predictions on new data using the ML.PREDICT function.

SELECT
  *
FROM
  ML.PREDICT(MODEL `my_bqml_dataset.my_logistic_reg_model`,
    (
    SELECT
      feature1,
      feature2
    FROM
      `my_bqml_dataset.my_new_data_for_prediction`
    WHERE
      -- Select the data you want to make predictions on
      DATE(timestamp_column) = '2024-11-01'
    )
  );

These three steps cover the core workflow of training, evaluating, and consuming a machine learning model using BigQuery ML. Remember to replace the dataset, table, and column names with your actual values.

Consuming a model

BigQuery ML, and BigQuery in general, offers a robust REST API that allows for seamless integration with your applications. This means you can programmatically interact with your BigQuery ML models, run queries, load data, and manage resources directly from your application using standard HTTP requests and JSON.

REST API

This is the foundational API built on HTTP and JSON. It offers direct access to BigQuery resources and operations. You can use standard HTTP methods (GET, POST, PUT, DELETE) to interact with specific endpoints. While powerful, using the REST API directly requires manual handling of authentication, request formatting, response parsing, and error handling. BigQuery API | Google Cloud

Client Libraries

Google provides high-level client libraries for various popular programming languages (Python, Java, Go, Node.js, C#, PHP, Ruby). These libraries wrap the underlying REST API, offering a more developer-friendly experience. They simplify common tasks, handle authentication (often automatically via Application Default Credentials), manage retries, and reduce boilerplate code. This is the recommended approach for most application development. APIs and reference | BigQuery | Google Cloud

MLOps

A machine learning pipeline automates the steps involved in going from raw data → trained model → deployed inference. It ensures repeatability, scalability, and easier management of data science workflows.

Typical stages:

Data ingestion & preparation
Feature engineering
Model training
Model evaluation
Model deployment
Prediction & monitoring

1️⃣ Data Ingestion

Collect raw text data (e.g., customer reviews, support tickets) and store it in BigQuery or import from Cloud Storage (CSV/Parquet).

CREATE OR REPLACE TABLE `my_project.my_bqml_dataset.reviews_raw` AS
SELECT * FROM EXTERNAL_QUERY("my_project.us.external_connection_id", 
"SELECT * FROM my_dataset.reviews_source");

2️⃣ Data Cleaning & Feature Preparation

Normalize case, remove stopwords or special characters, etc. You can use BigQuery SQL functions for text preprocessing or remote Python functions.

CREATE OR REPLACE TABLE `my_project.my_bqml_dataset.reviews_cleaned` AS
SELECT
  review_id,
  LOWER(REGEXP_REPLACE(review_text, r'[^a-zA-Z0-9\s]', '')) AS clean_text,
  rating >= 4 AS is_positive
FROM
  `my_project.my_bqml_dataset.reviews_raw`;

3️⃣ Model Training

Train a logistic regression model using BigQuery ML.

CREATE OR REPLACE MODEL `my_project.my_bqml_dataset.logistic_reg_model`
OPTIONS(
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['is_positive']
) AS
SELECT
  clean_text,
  is_positive
FROM
  `my_project.my_bqml_dataset.reviews_cleaned`;

4️⃣ Model Evaluation

Evaluate the model's performance using BigQuery ML.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `my_project.my_bqml_dataset.logistic_reg_model`,
    (
    SELECT
      clean_text,
      is_positive
    FROM
      `my_project.my_bqml_dataset.reviews_cleaned`
    WHERE
      -- Use a separate evaluation dataset or a different time period
      DATE(timestamp_column) BETWEEN '2024-10-01' AND '2024-10-31'
    )
  );

5️⃣ Model Deployment

You can export the model and deploy via Vertex AI for online predictions, or keep it in BigQuery ML and use ML.PREDICT() in batc.

SELECT
  review_id,
  predicted_is_positive,
  predicted_is_positive_probs
FROM
  ML.PREDICT(MODEL `my_project.my_bqml_dataset.review_sentiment_model`,
              (SELECT review_id, clean_text 
               FROM `my_project.my_bqml_dataset.new_reviews`));

6️⃣ Monitoring & Automation

Automate everything with Cloud Composer (Airflow) or Vertex AI Pipelines. Track model performance over time (drift, accuracy drop, etc.) and retrain periodically.

Tools to Orchestrate the Pipeline

Stage	Recommended Tool	Description
Data ingestion	BigQuery, Cloud Storage, Dataflow	Batch or streaming import
Feature engineering	BigQuery SQL, DataPrep, or Python in Vertex Workbench	Data cleaning & transformation
Model training	BigQuery ML, Vertex AI Training, or AutoML	Train ML models
Deployment	BigQuery ML (batch) or Vertex AI Endpoint	Serve models
Automation	Vertex AI Pipelines or Cloud Composer (Airflow)	Orchestrate and schedule
Monitoring	Vertex AI Model Monitoring, Cloud Logging, BigQuery dashboards	Track model performance

Notes

While BigQuery ML itself doesn't directly offer a built-in MODEL_TYPE specifically for Named Entity Recognition (NER), you can absolutely train and use NER models by leveraging BigQuery ML's integration with other Google Cloud AI services, particularly Vertex AI and Cloud AI APIs. You can use the ML.GENERATE_TEXT function in BigQuery ML, which integrates with large language models (LLMs), to perform NER by crafting appropriate prompts. While not a dedicated NER model, LLMs can extract entities based on instructions.

SELECT
  ml_generate_text_result.*
FROM
  ML.GENERATE_TEXT(
    MODEL `your_project.your_dataset.your_llm_model`, -- This would be a remote model connected to an LLM
    (
      SELECT
        text_column AS prompt_text,
        'Extract all named entities (persons, organizations, locations, dates) from the following text and return them as a JSON array of objects with "entity_type" and "entity_name" fields.' AS prompt_template
      FROM
        `your_project.your_dataset.your_text_data`
    ),
    STRUCT(
      0.2 AS temperature,
      100 AS max_output_tokens
    )
  ) AS ml_generate_text_result;

Text classification is one of the common use cases for BigQuery ML, especially for tasks like sentiment analysis, spam detection, or categorizing customer feedback.

BigQuery ML automatically handles text preprocessing (like tokenization and feature extraction) for text columns when you use appropriate model types.

CREATE OR REPLACE MODEL `my_bqml_dataset.my_text_classifier_model`
OPTIONS(
  MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
  INPUT_LABEL_COLS=['sentiment_label'],
  -- Optional: Specify HPARAMS for better performance, e.g.,
  -- NUM_PARALLEL_TREE=10,
  -- L1_REG=0.1,
  -- L2_REG=0.1
) AS
SELECT
  text_content,
  sentiment_label
FROM
  `my_bqml_dataset.my_training_text_data`
WHERE
  -- Use a portion of your data for training
  MOD(ABS(FARM_FINGERPRINT(text_content)), 10) < 8; -- 80% for training

References

CREATE OR REPLACE MODEL `projuris-datalake-prd.artigos_categorias.text_classifier`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['label'],
  auto_class_weights=TRUE
) AS
SELECT
  ML.NGRAMS(SPLIT(LOWER(text), ' '), [1, 2]) AS tokens,
  label
FROM
  `projuris-datalake-prd.artigos_categorias.articles_pt_br`;

SELECT
  *
FROM
  ML.EVALUATE(MODEL `projuris-datalake-prd.artigos_categorias.text_classifier`);

DECLARE input_text STRING DEFAULT """
A ciência é a busca por conhecimento sobre o universo através da observação e experimentação,
enquanto a tecnologia é a aplicação prática desse conhecimento para resolver problemas e melhorar a vida humana.
Juntas, elas impulsionam o progresso social e econômico, transformando o cotidiano com avanços na saúde, comunicação e produção,
mas também apresentam desafios como a desigualdade digital e os impactos ambientais negativos do desenvolvimento tecnológico.
""";


SELECT
  text,
  predicted_label,
  predicted_label_probs
FROM
  ML.PREDICT(
    MODEL `projuris-datalake-prd.artigos_categorias.text_classifier`,
    (
      SELECT
        ML.NGRAMS(SPLIT(LOWER(text), ' '), [1, 2]) AS tokens,
        text
      FROM (SELECT input_text AS text)
    )
  );

SELECT
  text,
  predicted_label,
  predicted_label_probs
FROM
  ML.PREDICT(
    MODEL `projuris-datalake-prd.artigos_categorias.text_classifier`,
    (
      SELECT
        ML.NGRAMS(SPLIT(LOWER(text), ' '), [1, 2]) AS tokens,
        text
      FROM 
        (
          SELECT 'Junior Sornoza, ex-Corinthians e Fluminense, marcou de pênalti e abriu o placar para o time equatoriano. Já nos acréscimos finais do jogo, Dudu deixou tudo igual, balançando as redes pela primeira vez com a camisa do Galo. Agora, o Atlético buscará sua classificação à decisão no confronto de volta, que acontecerá na próxima quarta-feira (28), às 21h30 (de Brasília), na Arena MRV, em Belo Horizonte.' AS text
          UNION ALL
          SELECT 'O dólar inicia a sessão desta quarta-feira (22) em queda. Por volta das 9h10, a moeda americana recuava 0,07%, sendo negociada a R$ 5,3860. Já o Ibovespa, principal índice da bolsa brasileira, abre às 10h. Os mercados iniciam o dia em compasso de expectativa, com atenções voltadas à cena política e fiscal. Enquanto o governo brasileiro busca alternativas para cobrir um rombo de R$ 35 bilhões no Orçamento de 2026, investidores monitoram a possível reunião entre Lula e Trump e os desdobramentos da paralisação do governo americano.' AS text
          UNION ALL
          SELECT 'Tecladista das bandas das cantoras Paula Lima e Tássia Reis, Deusnir Souza lança o primeiro álbum, Harmonia de gigante, no qual parte do jazz contemporâneo para cruzar harmonias da música negra norte-americana com o suingue latino brasileiro. O repertório é um mix de jazz, samba, soul, choro e rap. No álbum, Deusnir referencia Doobie Powell (artista norte-americano de música gospel) em Hubert e expõe a influência do pianista Oscar Peterson (1925 – 2007) em Choro canadense. Paula Lima canta A paz. Já Tássia Reis figura em Além de dez. O álbum Harmonia de gigante tem lançamento agendado para 30 de outubro.' AS text
        )
    )
  );

SELECT
  text,
  label AS actual_label,
  predicted_label,
  predicted_label_probs
FROM
  ML.PREDICT(
    MODEL `projuris-datalake-prd.artigos_categorias.text_classifier`,
    (
      SELECT
        ML.NGRAMS(SPLIT(LOWER(text), ' '), [1, 2]) AS tokens,
        text,
        label
      FROM `projuris-datalake-prd.artigos_categorias.articles_pt_br`
    )
  );

johnidm/README.md

Select an option

No results found