Using Cloud Dataflow and Google Cloud Public Datasets

Originally posted on Google Cloud Blog at https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python

One common data analysis task across the agricultural industry, as well in academia and government (for drought studies, climate modeling, and so on), is to create a monthly vegetation index from Landsat images, which is now available as a public dataset on Google Cloud Platform (source of Landsat images: U.S. Geological Survey). One approach to create such a monthly vegetation index is to write a data processing script that does the following:

  1. Find all the Landsat images that cover the location in question.
  2. Find the least-cloudy image for each month, making sure…

Using the new EventArc Audit Log functionality on Google Cloud

Repost of https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events

Many BigQuery users ask for database triggers — a way to run some procedural code in response to events on a particular BigQuery table, model, or dataset. Maybe you want to run an ELT job whenever a new table partition is created, or maybe you want to retrain your ML model whenever new rows are inserted into the table.

In the general category of “Cloud gets easier”, this article will show how to quite simply and cleanly tie together BigQuery and Cloud Run. …


Building an ELT pipeline using Google Sheets as an intermediary

This was originally published in the Google Cloud Developers and Practitioners Blog: https://cloud.google.com/blog/topics/developers-practitioners/loading-complex-csv-files-bigquery-using-google-sheets

BigQuery offers the ability to quickly import a CSV file, both from the web user interface and from the command line:

bq load --source_format CSV --autodetect \
mydataset.mytable ./myfile.csv

Limitations of autodetect and import

This works for your plain-vanilla CSV files, but can fail on complex CSV files. As an example of a file it fails on, let’s take a dataset of New York City Airbnb rentals data from Kaggle. This dataset has 16 columns, but one of the columns consists of pretty much free-form text. …


Subclass Layer, and implement call() with TensorFlow functions

Data augmentation can help an image ML model learn to handle variations of the image that are not in the training dataset. For example, it is likely that photographs provided to an ML model (especially if these are photographs by amateur photographers) will vary quite considerably in terms of lighting. We can therefore increase the effective size of the training dataset and make the ML model more resilient if we augment the training dataset by randomly changing the brightness, contrast, saturation, etc. of the training images.

While Keras has several built-in data augmentation layers (like RandomFlip), it doesn’t currently support…


Collaboratively transform, document, schedule datasets using SQL

Increasingly, we see a move from building ETL pipelines (where much of the transformation is carried out in tools like Spark or Dataflow before the data is loaded into BigQuery) to ELT pipelines (where the transformation is carried out within BigQuery itself). The reasons are that (1) SQL is easier for business users to write (2) BigQuery scales better and is less expensive than alternative data processing technologies.

The problem with doing all the transformation code in SQL, though, is that it can become hard to maintain. How often have you come back to a project after a few months…


Use ASSERT to verify pre- and post- conditions

When you are writing a BigQuery script, you will often want to make sure that the data matches your expectation. How many times have you written a script that worked when you deployed it. Then, someone changed your input data upstream and it was months before you discovered that the script was silently creating erroneous/empty tables? Wouldn’t it be better if the script failed and you got a heads up?

Image for post
Image for post
Build solid scripts using assertions! Image by Pawel Kozera from Pixabay

The script

Imagine that your script (you might be scheduling it, or it might be run on a trigger) creates a table of average duration of bicycle rides for some set of…


How to use image embeddings for compression, search, interpolation and clustering

Embeddings in machine learning provide a way to create a concise, lower-dimensional representation of complex, unstructured data. Embeddings are commonly employed in natural language processing to represent words or sentences as numbers.

In an earlier article, I showed how to create a concise representation (50 numbers) of 1059x1799 HRRR images. In this article, I will show you that the embedding has some nice properties, and you can take advantage of these properties to implement use cases like compression, image search, interpolation, and clustering of large image datasets.

Compression

First of all, does the embedding capture the important information in the image…


Designing and training an autoencoder on HRRR images in Keras

Autoencoder examples on the internet seem to be either about toy examples (MNIST, 28x28 images) or take advantage of transfer learning from ImageNet bottleneck layers. I will show you how to train an autoencoder from scratch, something that you will do if you have enough data and data that is completely unlike the photographs that ImageNet consists of.

In an earlier article, I showed how to take weather forecast images and create TensorFlow records out of them to make them ready for machine learning.

In this article, I will show how to do one machine learning task on these images…


A step-by-step guide to extracting structured data from paper forms

Many business processes, especially ones that involve interacting with customers and partners, involve paper forms. As consumers, we are used to filling out forms to apply for insurance, make insurance claims, specify healthcare preferences, apply for employment, tax withholdings, etc. Businesses on the other side of these transactions get a form that they need to parse, extract specific pieces of data from, and populate a database with.

The input form

Google Cloud Document AI comes with form-parsing capability. Let’s use it to parse a campaign disclosure form. These are forms that every US political campaign needs to file with the Federal Election Commission…


Use Apache Beam to convert custom binary files at scale

If you have JPEG or PNG images, you can read them directly into TensorFlow using tf.io.decode_image. What if your data is in some industry-specific binary format?

Why HRRR to TensorFlow records?

The High Resolution Rapid Refresh (HRRR) model is a numerical weather model. Because weather models work best when countries all over the world pool their observations, the format for weather data is decided by the World Meteorological Organization and it is super-hard to change. So, the HRRR data is disseminated in a #@!$@&= binary format called GRIB.

Regardless of the industry you are in — manufacturing, electricity generation, pharmaceutical research, genomics, astronomy— you probably…

Lak Lakshmanan

Data Analytics & AI @ Google Cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store