Anomaly detection in time series data using BigQuery ML
Create an ARIMA model, then detect anomalies
Let’s say we have a time series dataset of the number of bicycles rented in London every day:
SELECT
EXTRACT(date from start_date) AS start_date,
COUNT(*) AS num_trips
FROM `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY start_date
How can we find unusual days in terms of the number of bicycle rentals?
In BigQuery, all we need are two SQL statements.
1. ARIMA+ Model
First, create an ML model to predict the number of trips on future days using an ARIMA model:
CREATE OR REPLACE MODEL ch09eu.bicycle_daily_trips
OPTIONS(
model_type='arima_plus',
TIME_SERIES_DATA_COL='num_trips',
TIME_SERIES_TIMESTAMP_COL='start_date',
DECOMPOSE_TIME_SERIES=TRUE
)
AS (
SELECT
EXTRACT(date from start_date) AS start_date,
COUNT(*) AS num_trips
FROM `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY start_date
);
One key thing is that we ask the model to keep track of forecast errors during the modeling using DECOMPOSE_TIME_SERIES. This is so that we can turn around and ask to see the anomalies.
2. Detecting anomalies
Next, detect anomalies (unusual days):
SELECT *
FROM ML.DETECT_ANOMALIES(
MODEL ch09eu.bicycle_daily_trips,
STRUCT (0.95 AS anomaly_prob_threshold))
ORDER BY anomaly_probability DESC
LIMIT 10
The result?
data:image/s3,"s3://crabby-images/0590f/0590fb1229e035f9b7d0ea1923f3e931395418b1" alt=""
Okay, the second item on the list is Christmas. But why was July 9, 2015 so unusual?
What happened in London on July 9, 2015?
Let’s ask Google.
data:image/s3,"s3://crabby-images/67610/676102bedcadf3b56139952200360096d54d50ae" alt=""
Ah yes. The subway network was shutdown due to a strike. That’s why there was an unusually high number of bicycle rentals.
Enjoy!