Anomaly detection in time series data using BigQuery ML

Create an ARIMA model, then detect anomalies

Let’s say we have a time series dataset of the number of bicycles rented in London every day:

SELECT 
EXTRACT(date from start_date) AS start_date,
COUNT(*) AS num_trips
FROM `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY start_date

How can we find unusual days in terms of the number of bicycle rentals?

In BigQuery, all we need are two SQL statements.

First, create an ML model to predict the number of trips on future days using an ARIMA model:

CREATE OR REPLACE MODEL ch09eu.bicycle_daily_trips
OPTIONS(
model_type='arima_plus',
TIME_SERIES_DATA_COL='num_trips',
TIME_SERIES_TIMESTAMP_COL='start_date',
DECOMPOSE_TIME_SERIES=TRUE
)
AS (
SELECT
EXTRACT(date from start_date) AS start_date,
COUNT(*) AS num_trips
FROM `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY start_date
);

One key thing is that we ask the model to keep track of forecast errors during the modeling using DECOMPOSE_TIME_SERIES. This is so that we can turn around and ask to see the anomalies.

Next, detect anomalies (unusual days):

SELECT * 
FROM ML.DETECT_ANOMALIES(
MODEL ch09eu.bicycle_daily_trips,
STRUCT (0.95 AS anomaly_prob_threshold))
ORDER BY anomaly_probability DESC
LIMIT 10

The result?

Okay, the second item on the list is Christmas. But why was July 9, 2015 so unusual?

Let’s ask Google.

Ah yes. The subway network was shutdown due to a strike. That’s why there was an unusually high number of bicycle rentals.

Enjoy!

Data Analytics & AI @ Google Cloud