Speeding up small queries in BigQuery with BI Engine
Just turn it on; no code changes needed
copy of article published in https://cloud.google.com/blog/topics/developers-practitioners/speeding-small-queries-bigquery-bi-engine on 2021–04–01.
A quick and easy way to speed up small queries in BigQuery (such as to populate interactive applications or dashboards) is to use BI Engine. The New York Times, for example, uses the SQL interface to BI Engine to speed up their Data Reporting Engine.
The queries
I’ll use three representative queries on tables between 100MB and 3GB — tables that are typically considered smallish by BigQuery standards. Because BigQuery is a columnar database, I’m reporting only the size of the columns included in the query.
The first involves processing about 400 MB data to find the average cost of a Medicare claim:
COST_BY_STATE="""
SELECT
nppes_provider_state,
SUM(total_drug_cost)/SUM(total_claim_count) AS avg_cost
FROM `bigquery-public-data.medicare.part_d_prescriber_2014`
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""
The second processes a larger table (3GB) to find the average tip for a Chicago taxi based on payment type:
TIPS_BY_TYPE="""
SELECT payment_type, AVG(tips) AS avg_tip
FROM bigquery-public-data.chicago_taxi_trips.taxi_trips
GROUP BY 1
ORDER BY 2 DESC
"""
The third processes a smaller table (100 MB) to find the most polluted sites:
AIR_QUALITY="""
SELECT
site_num,
ANY_VALUE(state_name) AS state,
AVG(aqi) as air_quality_index,
FROM `bigquery-public-data.epa_historical_air_quality.pm10_daily_summary`
GROUP BY site_num
ORDER BY air_quality_index DESC
LIMIT 10
"""
Measuring query performance
To measure the query performance, I’ll run any given query 5 times, taking care to turn off the cache:
from google.cloud import bigquery
from timeit import default_timer as timer
from datetime import timedelta# Construct a BigQuery client object.
client = bigquery.Client()def run_query(query, n=5):
tot_slotmillis, tot_timeelapsed = 0, timedelta(0)
for iter in range(n):
query_job = client.query(query, bigquery.job.QueryJobConfig(use_query_cache=False))
df = query_job.result().to_dataframe() tot_slotmillis += query_job.slot_millis
tot_timeelapsed += (query_job.ended - query_job.started)
print("Job stat: slot_mills={} server_time={}".format(tot_slotmillis/n, tot_timeelapsed/n))
I print out the total resources consumed (slot-milliseconds) and the total time taken by the query on the server.
If you are on a flat-rate pricing (most enterprise customers are), the slot-milliseconds reflects how long your slots are getting used. So, this is a reflection of the cost of the query to you.
The time elapsed is the time spent in computing the results. Note that the server_time is the time taken to process the request (I don’t measure the network roundtrip time because it’s going to be the same whether or not you use BI Engine).
I first ran the three queries without BI Engine (see my notebook in GitHub).
Turning on BI Engine
I then went to the BigQuery web console and created a 10 GB BI Engine reservation (monthly cost: $300):
Why did I pick 10 GB?
One of the key ways that BI Engine speeds up queries on small tables is that it caches the tables in memory. It also does other speedups, but this is the main one. So, you want to provide it enough memory to comfortably hold the tables you will be querying. BI Engine will automatically manage the memory for you.
In my case, the Chicago taxicab query involves 3GB, so I used 10 GB. For the other two queries, 1 GB would have been enough. You can go up to 100 GB, something I’d recommend if you have many concurrent queries on small tables.
You can turn on BI Engine from a script using the Reservations API. Note, however, that it takes a few minutes for the memory to become available, so this is something you should consider doing for a few hours at least, not on a per-query basis.
Turning on BI Engine
Here’s how the 3 queries compared:
As you can see, I got cost improvements of 8x to 20x and timing improvements of about 2x.
Note that some BI Engine capacity is included at no extra cost when you purchase a flat rate reservation–for example, you get 50 GB of BI Engine included for free with a 1000-slot annual commitment. At $30-$300/month, additional BI Engine capacity is a very cost-effective way to make your BigQuery reservations go further, and get a speedup too.
Enjoy!
Next steps
- Try it out. My code is on GitHub.
- You can get the full speedup in your applications provided you are using the query() method in the BigQuery Client API [the insertJob() methods are also accelerated, but not as much]. So, check your code.
- To learn more about BigQuery, read my book.
Note: The SQL interface to BI Engine is in preview at the moment (March 2021). Contact your GCP sales rep to turn this on in your account. Thanks to Mosha Pasumansky and the BI Engine team for helpful discussions.