BigQuery ML BQML is a powerful feature within Google BigQuery that allows users to create, train, and execute machine learning models directly using SQL queries.
This approach aims to democratize machine learning by enabling SQL practitioners to build models with their existing tools, eliminating the need to move large amounts of data to separate ML frameworks or services.
BigQuery ML supports a variety of machine learning models and algorithms, catering to different analytical needs. Here's a breakdown of the types you can use:
These models can be built directly within BigQuery ML using SQL:
- Linear Regression: Used for predicting continuous numerical values.
- Binary Logistic Regression: For binary classification tasks (e.g., yes/no, 0/1).
- Multiclass Logistic Regression: For classification tasks with multiple categories.
- K-Means Clustering: An unsupervised learning algorithm for dividing data points into clusters.
- Matrix Factorization: Used for recommendation systems, based on explicit or implicit feedback.
- Time Series Forecasting (ARIMA_PLUS): For forecasting future values based on historical time-series data.
- Boosted Trees (XGBoost): An ensemble learning method that combines multiple decision trees for accurate predictions.
- Deep Neural Networks (DNNs): Inspired by the human brain, used for various tasks like image classification and natural language processing.
- Principal Component Analysis (PCA): For dimensionality reduction.
BigQuery ML integrates with Vertex AI, allowing you to leverage more advanced models trained and deployed on Vertex AI. Predictions from these models can then be imported back into BigQuery for analysis. These include:
- DNN (Deep Neural Networks).
- Wide & Deep.
- Autoencoder.
- Random Forest.
- AutoML.
You can also load previously trained models from other frameworks into BigQuery ML for seamless predictions. This includes:
- TensorFlow models: You can load these into BigQuery ML and perform predictions [0].
- ONNX format models: This includes scikit-learn and PyTorch models [0].
- XGBoost models: An optimized distributed gradient boosting library [0].
BigQuery ML also offers features like hyperparameter tuning, model explainability functions, and feature preprocessing to enhance model performance.
BigQuery ML models are stored within BigQuery datasets, similar to how tables and views are stored, making them easily accessible for prediction queries. This approach allows you to keep your data and ML models in the same environment, simplifying workflows and speeding up deployment.
To train a model, you use the CREATE MODEL statement. You specify the model type, input options, and the data to train on.
CREATE OR REPLACE MODEL `my_bqml_dataset.my_logistic_reg_model`
OPTIONS(
MODEL_TYPE='LOGISTIC_REG',
INPUT_LABEL_COLS=['label']
) AS
SELECT
feature1,
feature2,
label
FROM
`my_bqml_dataset.my_training_data`
WHERE
-- Optionally, filter your training data if needed
DATE(timestamp_column) BETWEEN '2024-01-01' AND '2024-09-30';
After training, it's crucial to evaluate your model's performance. You use the ML.EVALUATE function for this.
SELECT
*
FROM
ML.EVALUATE(MODEL `my_bqml_dataset.my_logistic_reg_model`,
(
SELECT
feature1,
feature2,
label
FROM
`my_bqml_dataset.my_evaluation_data`
WHERE
-- Use a separate evaluation dataset or a different time period
DATE(timestamp_column) BETWEEN '2024-10-01' AND '2024-10-31'
)
);
Once you're satisfied with your model's performance, you can use it to make predictions on new data using the ML.PREDICT function.
SELECT
*
FROM
ML.PREDICT(MODEL `my_bqml_dataset.my_logistic_reg_model`,
(
SELECT
feature1,
feature2
FROM
`my_bqml_dataset.my_new_data_for_prediction`
WHERE
-- Select the data you want to make predictions on
DATE(timestamp_column) = '2024-11-01'
)
);
These three steps cover the core workflow of training, evaluating, and consuming a machine learning model using BigQuery ML. Remember to replace the dataset, table, and column names with your actual values.
BigQuery ML, and BigQuery in general, offers a robust REST API that allows for seamless integration with your applications. This means you can programmatically interact with your BigQuery ML models, run queries, load data, and manage resources directly from your application using standard HTTP requests and JSON.
This is the foundational API built on HTTP and JSON. It offers direct access to BigQuery resources and operations. You can use standard HTTP methods (GET, POST, PUT, DELETE) to interact with specific endpoints. While powerful, using the REST API directly requires manual handling of authentication, request formatting, response parsing, and error handling. BigQuery API | Google Cloud
Google provides high-level client libraries for various popular programming languages (Python, Java, Go, Node.js, C#, PHP, Ruby). These libraries wrap the underlying REST API, offering a more developer-friendly experience. They simplify common tasks, handle authentication (often automatically via Application Default Credentials), manage retries, and reduce boilerplate code. This is the recommended approach for most application development. APIs and reference | BigQuery | Google Cloud
A machine learning pipeline automates the steps involved in going from raw data → trained model → deployed inference. It ensures repeatability, scalability, and easier management of data science workflows.
Typical stages:
- Data ingestion & preparation
- Feature engineering
- Model training
- Model evaluation
- Model deployment
- Prediction & monitoring
1️⃣ Data Ingestion
Collect raw text data (e.g., customer reviews, support tickets) and store it in BigQuery or import from Cloud Storage (CSV/Parquet).
CREATE OR REPLACE TABLE `my_project.my_bqml_dataset.reviews_raw` AS
SELECT * FROM EXTERNAL_QUERY("my_project.us.external_connection_id",
"SELECT * FROM my_dataset.reviews_source");
2️⃣ Data Cleaning & Feature Preparation
Normalize case, remove stopwords or special characters, etc. You can use BigQuery SQL functions for text preprocessing or remote Python functions.
CREATE OR REPLACE TABLE `my_project.my_bqml_dataset.reviews_cleaned` AS
SELECT
review_id,
LOWER(REGEXP_REPLACE(review_text, r'[^a-zA-Z0-9\s]', '')) AS clean_text,
rating >= 4 AS is_positive
FROM
`my_project.my_bqml_dataset.reviews_raw`;
3️⃣ Model Training
Train a logistic regression model using BigQuery ML.
CREATE OR REPLACE MODEL `my_project.my_bqml_dataset.logistic_reg_model`
OPTIONS(
MODEL_TYPE='LOGISTIC_REG',
INPUT_LABEL_COLS=['is_positive']
) AS
SELECT
clean_text,
is_positive
FROM
`my_project.my_bqml_dataset.reviews_cleaned`;
4️⃣ Model Evaluation
Evaluate the model's performance using BigQuery ML.
SELECT
*
FROM
ML.EVALUATE(MODEL `my_project.my_bqml_dataset.logistic_reg_model`,
(
SELECT
clean_text,
is_positive
FROM
`my_project.my_bqml_dataset.reviews_cleaned`
WHERE
-- Use a separate evaluation dataset or a different time period
DATE(timestamp_column) BETWEEN '2024-10-01' AND '2024-10-31'
)
);
5️⃣ Model Deployment
You can export the model and deploy via Vertex AI for online predictions, or keep it in BigQuery ML and use ML.PREDICT() in batc.
SELECT
review_id,
predicted_is_positive,
predicted_is_positive_probs
FROM
ML.PREDICT(MODEL `my_project.my_bqml_dataset.review_sentiment_model`,
(SELECT review_id, clean_text
FROM `my_project.my_bqml_dataset.new_reviews`));
6️⃣ Monitoring & Automation
Automate everything with Cloud Composer (Airflow) or Vertex AI Pipelines. Track model performance over time (drift, accuracy drop, etc.) and retrain periodically.
| Stage | Recommended Tool | Description |
|---|---|---|
| Data ingestion | BigQuery, Cloud Storage, Dataflow | Batch or streaming import |
| Feature engineering | BigQuery SQL, DataPrep, or Python in Vertex Workbench | Data cleaning & transformation |
| Model training | BigQuery ML, Vertex AI Training, or AutoML | Train ML models |
| Deployment | BigQuery ML (batch) or Vertex AI Endpoint | Serve models |
| Automation | Vertex AI Pipelines or Cloud Composer (Airflow) | Orchestrate and schedule |
| Monitoring | Vertex AI Model Monitoring, Cloud Logging, BigQuery dashboards | Track model performance |
While BigQuery ML itself doesn't directly offer a built-in MODEL_TYPE specifically for Named Entity Recognition (NER), you can absolutely train and use NER models by leveraging BigQuery ML's integration with other Google Cloud AI services, particularly Vertex AI and Cloud AI APIs. You can use the ML.GENERATE_TEXT function in BigQuery ML, which integrates with large language models (LLMs), to perform NER by crafting appropriate prompts. While not a dedicated NER model, LLMs can extract entities based on instructions.
SELECT
ml_generate_text_result.*
FROM
ML.GENERATE_TEXT(
MODEL `your_project.your_dataset.your_llm_model`, -- This would be a remote model connected to an LLM
(
SELECT
text_column AS prompt_text,
'Extract all named entities (persons, organizations, locations, dates) from the following text and return them as a JSON array of objects with "entity_type" and "entity_name" fields.' AS prompt_template
FROM
`your_project.your_dataset.your_text_data`
),
STRUCT(
0.2 AS temperature,
100 AS max_output_tokens
)
) AS ml_generate_text_result;
Text classification is one of the common use cases for BigQuery ML, especially for tasks like sentiment analysis, spam detection, or categorizing customer feedback.
BigQuery ML automatically handles text preprocessing (like tokenization and feature extraction) for text columns when you use appropriate model types.
CREATE OR REPLACE MODEL `my_bqml_dataset.my_text_classifier_model`
OPTIONS(
MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
INPUT_LABEL_COLS=['sentiment_label'],
-- Optional: Specify HPARAMS for better performance, e.g.,
-- NUM_PARALLEL_TREE=10,
-- L1_REG=0.1,
-- L2_REG=0.1
) AS
SELECT
text_content,
sentiment_label
FROM
`my_bqml_dataset.my_training_text_data`
WHERE
-- Use a portion of your data for training
MOD(ABS(FARM_FINGERPRINT(text_content)), 10) < 8; -- 80% for training