Use ASSERT to verify pre- and post- conditions

When you are writing a BigQuery script, you will often want to make sure that the data matches your expectation. How many times have you written a script that worked when you deployed it. Then, someone changed your input data upstream and it was months before you discovered that the script was silently creating erroneous/empty tables? Wouldn’t it be better if the script failed and you got a heads up?

Image for post
Image for post
Build solid scripts using assertions! Image by Pawel Kozera from Pixabay

The script

Imagine that your script (you might be scheduling it, or it might be run on a trigger) creates a table of average duration of bicycle rides for some set of…


How to use image embeddings for compression, search, interpolation and clustering

Embeddings in machine learning provide a way to create a concise, lower-dimensional representation of complex, unstructured data. Embeddings are commonly employed in natural language processing to represent words or sentences as numbers.

In an earlier article, I showed how to create a concise representation (50 numbers) of 1059x1799 HRRR images. In this article, I will show you that the embedding has some nice properties, and you can take advantage of these properties to implement use cases like compression, image search, interpolation, and clustering of large image datasets.

Compression

First of all, does the embedding capture the important information in the image? …


Designing and training an autoencoder on HRRR images in Keras

Autoencoder examples on the internet seem to be either about toy examples (MNIST, 28x28 images) or take advantage of transfer learning from ImageNet bottleneck layers. I will show you how to train an autoencoder from scratch, something that you will do if you have enough data and data that is completely unlike the photographs that ImageNet consists of.

In an earlier article, I showed how to take weather forecast images and create TensorFlow records out of them to make them ready for machine learning.

In this article, I will show how to do one machine learning task on these images: create concise representations of the radar reflectivity “analysis” fields from the High Resolution Rapid Refresh (HRRR) model using autoencoders. …


A step-by-step guide to extracting structured data from paper forms

Many business processes, especially ones that involve interacting with customers and partners, involve paper forms. As consumers, we are used to filling out forms to apply for insurance, make insurance claims, specify healthcare preferences, apply for employment, tax withholdings, etc. Businesses on the other side of these transactions get a form that they need to parse, extract specific pieces of data from, and populate a database with.

The input form

Google Cloud Document AI comes with form-parsing capability. Let’s use it to parse a campaign disclosure form. These are forms that every US political campaign needs to file with the Federal Election Commission. …


Use Apache Beam to convert custom binary files at scale

If you have JPEG or PNG images, you can read them directly into TensorFlow using tf.io.decode_image. What if your data is in some industry-specific binary format?

Why HRRR to TensorFlow records?

The High Resolution Rapid Refresh (HRRR) model is a numerical weather model. Because weather models work best when countries all over the world pool their observations, the format for weather data is decided by the World Meteorological Organization and it is super-hard to change. So, the HRRR data is disseminated in a #@!$@&= binary format called GRIB.

Regardless of the industry you are in — manufacturing, electricity generation, pharmaceutical research, genomics, astronomy— you probably have some format like this. A format that no modern software framework supports. …


numeric-calculations-as-a-service by leveraging ML Infrastructure

In scientific computations and simulations, you often have a really expensive calculation that you want to distribute and run in parallel. Ideally, we’d be able to deploy that calculation function to a serverless framework that will

  • Run the calculation on a GPU
  • Expose the function as a REST API
  • Handle as many calculations as we throw at it, autoscaling as needed
  • Abstract away all the complexities of Kubernetes, web servers, cluster management, etc.

Extracting such a function into a GPU-hosted microservice can allow you to greatly speed up your simulation software without having to rewrite the whole thing in CUDA…


Translating the slot usage graph into workload management

I am working with a non-profit organization that uses BigQuery for their data analysis. They use on-demand pricing, but have hit the point at which flat rate pricing will start to make sense (typically $10k monthly, although if your usage is more repeatable, it can be as low as $2k monthly). How should they go about making the change? What is a good strategy to follow?

All this information is in the BigQuery documentation, but it is in several different places. So, I’m consolidating it here and linking out to some other great articles on the topic.

When does it make sense?

With on-demand pricing, you get 2000 slots per project. So, you should consider moving from on-demand pricing to flat-rate pricing only if any of these…


But an old-fashioned one

Kamala Harris’ first name is a recognizably Indian name. It comes from the Sanskrit word for “lotus”, a pond flower, and one frequently associated with a meditating Buddha or a self-sustaining Vishnu.

Image for post
Image for post
Buddha on a lotus. Image by Wortflow from Pixabay

Babies named Kamala

The US Social Security Administration (SSA) maintains a list of baby names by state and year. It is available in BigQuery as a public dataset, which means that we can quite quickly query it:

SELECT 
year,
SUM(number) AS num_babies
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'Kamala'
GROUP BY year
ORDER BY year ASC

Plotting this in Google Sheets (in the BigQuery console, select Explore Data and choose Explore with…


Use Document embeddings in BigQuery for document similarity and clustering tasks

BigQuery offers the ability to load a TensorFlow SavedModel and carry out predictions. This capability is a great way to add text-based similarity and clustering on top of your data warehouse.

Follow along by copy-pasting queries from my notebook in GitHub. You can try out the queries in the BigQuery console or in an AI Platform Jupyter notebook.

Image for post
Image for post
Text embeddings are useful for document similarity and clustering tasks. Image by kerttu from Pixabay

Storm reports data

As an example, I’ll use a dataset consisting of wind reports phoned into National Weather Service offices by “storm spotters”. This is a public dataset in BigQuery and it can be queried as follows:

SELECT 
EXTRACT(DAYOFYEAR from timestamp) AS julian_day,
ST_GeogPoint(longitude, latitude) AS location,
comments
FROM `bigquery-public-data.noaa_preliminary_severe_storms.wind_reports` …

Build a Data Mesh & Set up MLOps

Businesses realize that as more and more products and services become digitized, there is an opportunity to capture a lot of value by taking better advantage of data. In retail, it could be by avoiding deep discounting by stocking the right items at the right time. In financial services, it could be by identifying unusual activity and behavior faster than the competition. In media, it could be by increasing engagement by offering up more personalized recommendations.

Key Challenges

In my talk at Cloud Next OnAir, I describe that, in order to lead your company towards data-powered innovation, there are a few key challenges that you will have to…

About

Lak Lakshmanan

Data Analytics & AI @ Google Cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store