My colleague Polong Lin did a brilliant thing recently. He wanted to do a demo, and rather than ask us to follow a checklist of instructions, he simply had us make a single BigQuery call:
This call provisioned everything in our BigQuery project that we needed for the demo.
You will see that a dataset named demods will get created in your project, and the dataset will contain stuff that is needed for the demo:
How does this magic work?
The idea is to create a Stored Procedure that contains a script. The script…
The Google Cloud AI Platform team have been heads down the past few months building a unified view of the machine learning landscape. This was launched today at Google I/O as Vertex AI. What is it? How good is it? What does it mean for data science jobs?
The idea is that there are a few key constructs in machine learning:
Sometimes, you might want to reformat a table result so you have separate columns for each unique value. This is called a Pivot table — normally, it’s only a display function supported by BI tools. However, it can occasionally be helpful to create the pivot table in SQL. Here’s how to use the PIVOT operator in Google BigQuery to create a pivot table.
Pivot changes rows to columns. For example, suppose we have a table of flights and flight delays like this:
Let’s take the new geospatial capabilities in Data Studio for a spin by exploring the public dataset of NOAA Storm Prediction Center storm reports of hail. Please follow along with me — there’s a free BigQuery sandbox that you can use. Data Studio is a free product.
Go to the BigQuery web console at https://console.cloud.google.com/bigquery and type in the following query:
FORMAT_DATE('%B', event_begin_time) AS month,
magnitude AS hail_inches,
WHERE event_type = 'hail'
The important thing to notice in the query above is that event_point is a GEOGRAPHY type. …
copy of article published in https://cloud.google.com/blog/topics/developers-practitioners/speeding-small-queries-bigquery-bi-engine on 2021–04–01.
A quick and easy way to speed up small queries in BigQuery (such as to populate interactive applications or dashboards) is to use BI Engine. The New York Times, for example, uses the SQL interface to BI Engine to speed up their Data Reporting Engine.
I’ll use three representative queries on tables between 100MB and 3GB — tables that are typically considered smallish by BigQuery standards. Because BigQuery is a columnar database, I’m reporting only the size of the columns included in the query.
The first involves processing about 400 MB data…
Originally posted on Google Cloud Blog at https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python
One common data analysis task across the agricultural industry, as well in academia and government (for drought studies, climate modeling, and so on), is to create a monthly vegetation index from Landsat images, which is now available as a public dataset on Google Cloud Platform (source of Landsat images: U.S. Geological Survey). One approach to create such a monthly vegetation index is to write a data processing script that does the following:
Many BigQuery users ask for database triggers — a way to run some procedural code in response to events on a particular BigQuery table, model, or dataset. Maybe you want to run an ELT job whenever a new table partition is created, or maybe you want to retrain your ML model whenever new rows are inserted into the table.
In the general category of “Cloud gets easier”, this article will show how to quite simply and cleanly tie together BigQuery and Cloud Run. …
This was originally published in the Google Cloud Developers and Practitioners Blog: https://cloud.google.com/blog/topics/developers-practitioners/loading-complex-csv-files-bigquery-using-google-sheets
BigQuery offers the ability to quickly import a CSV file, both from the web user interface and from the command line:
bq load --source_format CSV --autodetect \
This works for your plain-vanilla CSV files, but can fail on complex CSV files. As an example of a file it fails on, let’s take a dataset of New York City Airbnb rentals data from Kaggle. This dataset has 16 columns, but one of the columns consists of pretty much free-form text. …
Data augmentation can help an image ML model learn to handle variations of the image that are not in the training dataset. For example, it is likely that photographs provided to an ML model (especially if these are photographs by amateur photographers) will vary quite considerably in terms of lighting. We can therefore increase the effective size of the training dataset and make the ML model more resilient if we augment the training dataset by randomly changing the brightness, contrast, saturation, etc. of the training images.
While Keras has several built-in data augmentation layers (like RandomFlip), it doesn’t currently support…
Increasingly, we see a move from building ETL pipelines (where much of the transformation is carried out in tools like Spark or Dataflow before the data is loaded into BigQuery) to ELT pipelines (where the transformation is carried out within BigQuery itself). The reasons are that (1) SQL is easier for business users to write (2) BigQuery scales better and is less expensive than alternative data processing technologies.
The problem with doing all the transformation code in SQL, though, is that it can become hard to maintain. How often have you come back to a project after a few months…
Data Analytics & AI @ Google Cloud