How to obtain and interpret explanations of predictions

BigQuery ML is an easy-to-use way to invoke machine learning models on structured data using just SQL. Although it started with only linear regression, more sophisticated models like Deep Neural Networks and AutoML Tables have been added by connecting BigQuery ML with TensorFlow and Vertex AI as its backend. In other words, although we write SQL, what gets executed is TensorFlow.

Image by StartupStockPhotos from Pixabay

Now, BigQuery ML is picking up more features from the Vertex AI backend. In an earlier post, I showed you hyperparameter tuning. In this post, I’ll show you explainability.

Explainability is a way to understand what a machine learning…


Create an ARIMA model, then detect anomalies

Let’s say we have a time series dataset of the number of bicycles rented in London every day:

SELECT 
EXTRACT(date from start_date) AS start_date,
COUNT(*) AS num_trips
FROM `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY start_date

How can we find unusual days in terms of the number of bicycle rentals?

In BigQuery, all we need are two SQL statements.

First, create an ML model to predict the number of trips on future days using an ARIMA model:

CREATE OR REPLACE MODEL ch09eu.bicycle_daily_trips
OPTIONS(
model_type='arima_plus',
TIME_SERIES_DATA_COL='num_trips',
TIME_SERIES_TIMESTAMP_COL='start_date',
DECOMPOSE_TIME_SERIES=TRUE
)
AS (
SELECT
EXTRACT(date from start_date) AS start_date,
COUNT(*) AS num_trips
FROM `bigquery-public-data.london_bicycles.cycle_hire` …


BigQuery ML can use Vertex AI to tune common model parameters

BigQuery ML allows you to quickly train ML models on data in BigQuery. For example, suppose you want to train a linear ML model to predict the duration of a bicycle rental, you can do that using:

CREATE OR REPLACE MODEL ch09eu.bicycle_model_linear
OPTIONS(
model_type='linear_reg', input_label_cols=['duration']
)
AS
SELECT
start_station_name,
CAST(EXTRACT(DAYOFWEEK FROM start_date) AS STRING) AS dayofweek,
CAST(EXTRACT HOUR FROM start_date) AS STRING) AS hourofday,
duration
FROM `bigquery-public-data.london_bicycles.cycle_hire`

BigQuery ML also supports feature engineering that will be automatically repeated during inference. A slightly more sophisticated deep neural network model might be:

CREATE OR REPLACE MODEL ch09eu.bicycle_model_dnn
TRANSFORM(
start_station_name…


This tip is short enough to be a Twitter thread. I’m preserving it here:


A neat trick that uses Stored Procedure with a BigQuery script

My colleague Polong Lin did a brilliant thing recently. He wanted to do a demo, and rather than ask us to follow a checklist of instructions, he simply had us make a single BigQuery call:

CALL 
`ai-analytics-solutions`.publicdata.setup_flights_demo(
'MY-PROJECT-NAME', 'demods')

This call provisioned everything in our BigQuery project that we needed for the demo.

Try it!

You will see that a dataset named demods will get created in your project, and the dataset will contain stuff that is needed for the demo:

How does this magic work?

The idea is to create a Stored Procedure that contains a script. The script…


Why do we need it, how good is the code-free ML training, really, and what does all this mean for data science jobs?

The Google Cloud AI Platform team have been heads down the past few months building a unified view of the machine learning landscape. This was launched today at Google I/O as Vertex AI. What is it? How good is it? What does it mean for data science jobs?

The idea is that there are a few key constructs in machine learning:

  • We create datasets by ingesting data, analyzing the data, and cleaning it up (ETL or ELT).
  • We then train a model (Model Training) — this includes experimentation, hypothesis testing, and hyperparameter tuning.
  • This model is versioned and rebuilt when…


Converting rows to columns

Sometimes, you might want to reformat a table result so you have separate columns for each unique value. This is called a Pivot table — normally, it’s only a display function supported by BI tools. However, it can occasionally be helpful to create the pivot table in SQL. Here’s how to use the PIVOT operator in Google BigQuery to create a pivot table.

Image by Gerd Altmann from Pixabay

Pivot changes rows to columns. For example, suppose we have a table of flights and flight delays like this:


Using the new geospatial capabilities in Data Studio

Let’s take the new geospatial capabilities in Data Studio for a spin by exploring the public dataset of NOAA Storm Prediction Center storm reports of hail. Please follow along with me — there’s a free BigQuery sandbox that you can use. Data Studio is a free product.

Go to the BigQuery web console at https://console.cloud.google.com/bigquery and type in the following query:

SELECT 
FORMAT_DATE('%B', event_begin_time) AS month,
magnitude AS hail_inches,
event_point
FROM `bigquery-public-data.noaa_historic_severe_storms.storms_2020`
WHERE event_type = 'hail'

The important thing to notice in the query above is that event_point is a GEOGRAPHY type. …


Just turn it on; no code changes needed

copy of article published in https://cloud.google.com/blog/topics/developers-practitioners/speeding-small-queries-bigquery-bi-engine on 2021–04–01.

A quick and easy way to speed up small queries in BigQuery (such as to populate interactive applications or dashboards) is to use BI Engine. The New York Times, for example, uses the SQL interface to BI Engine to speed up their Data Reporting Engine.

I’ll use three representative queries on tables between 100MB and 3GB — tables that are typically considered smallish by BigQuery standards. Because BigQuery is a columnar database, I’m reporting only the size of the columns included in the query.

The first involves processing about 400 MB data…


Using Cloud Dataflow and Google Cloud Public Datasets

Originally posted on Google Cloud Blog at https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python

One common data analysis task across the agricultural industry, as well in academia and government (for drought studies, climate modeling, and so on), is to create a monthly vegetation index from Landsat images, which is now available as a public dataset on Google Cloud Platform (source of Landsat images: U.S. Geological Survey). One approach to create such a monthly vegetation index is to write a data processing script that does the following:

  1. Find all the Landsat images that cover the location in question.
  2. Find the least-cloudy image for each month, making sure…

Lak Lakshmanan

Data Analytics & AI @ Google Cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store