Visualization and large-scale processing of historical weather radar (NEXRAD Level II) data
Use serverless tools: Apache Beam on Cloud Dataflow
Published on the Google Cloud Blog on June 25, 2017 and archived here
One of the highest-resolution sources of weather data in the United States comes from the NEXRAD network of 160 weather radars operated by the NOAA National Weather Service (NWS), the Federal Aviation Administration (FAA), and the U.S. Air Force (USAF). These 10 cm S-Band Doppler radars, located in the contiguous United States, Alaska, Hawaii, U.S. territories and at military base sites, detect atmospheric precipitation and winds.
The historical archive of that weather radar data is now available as a public dataset on Google Cloud Storage). For data collected before 2011, the archive contains reflectivity, mean radial velocity, and spectrum width radial volume scans at original resolution (e.g., the reflectivity is at 1 degree x 1 km resolution). Starting in 2011, as radars were upgraded to dual polarization, differential reflectivity, correlation coefficient and differential phase were added. These data are available at super-resolution (i.e., some tilts are at 0.5 degree x 0.25 km resolution).
You can browse and download files using the console here or programmatically interact with the Google Cloud Storage bucket gs://gcp-public-data-nexrad-l2.
Sample program to visualize volume scans
In order to take advantage of high sustained reads offered within Google’s data centers, the volume scan data in the radar archive is bundled up into suitably large tar files. Thus, for example,
gs://gcp-public-data-nexrad-l2/2012/07/23/KMUX/NWS_NEXRAD_NXL2DP_KMUX_20120723000000_20120723005959.tar
contains volume scans from KMUX whose collection started on July 23, 2012 in the hour starting at 23:00:00UTC.
Because radar-visualization software typically does not read tar files and does not read from cloud storage buckets, you will first have to download the tar files to the machine on which you want to visualize it. Here, I want the data from the KIWA radar on July 23, 2013 for the hour starting at 23:00 UTC:
#!/bin/bash
RADAR=KIWA
YEAR=2013
MONTH=07
DAY=23
HOUR=23
gsutil cp gs://gcp-public-data-nexrad-l2/$YEAR/$MONTH/$DAY/$RADAR/*_$RADAR_${YEAR}${MONTH}${DAY}${HOUR}0000_${YEAR}${MONTH}${DAY}${HOUR}5959.tar temp.tar
tar xvf temp.tar
rm temp.tar
The script above uses gsutil, which comes with the gcloud SDK and the standard UNIX utility tar. Alternately, you could use the Python Cloud Storage API to get the files you need. Running the above bash program yields the all the volume scans in that hour:
KIWA20130723_230242_V06.gz
KIWA20130723_230726_V06.gz
etc.
KIWA20130723_235451_V06.gz
KIWA20130723_235936_V06.gz
With the files in place, we can use the Py-ART library from the US Department of Energy ARM program to create diagrams:
import matplotlib.pyplot as plt
import pyart
def plot_data(infilename, outpng, maxrange):
radar = pyart.io.read_nexrad_archive(infilename)
display = pyart.graph.RadarDisplay(radar)
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(221)
display.plot('velocity', 1, ax=ax, title='Doppler Velocity',
colorbar_label='',
axislabels=('', 'North South distance from radar (km)'))
display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)
ax = fig.add_subplot(222)
display.plot('reflectivity', 0, ax=ax,
title='Reflectivity lowest', colorbar_label='',
axislabels=('', ''))
display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)
ax = fig.add_subplot(223)
display.plot('reflectivity', 1, ax=ax,
title='Reflectivity second', colorbar_label='')
display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax) ax = fig.add_subplot(224)
display.plot('cross_correlation_ratio', 0, ax=ax,
title='Correlation Coefficient', colorbar_label='',
axislabels=('East West distance from radar (km)', ''))
display.set_limits((-maxrange, maxrange), (-maxrange, maxrange), ax=ax)plt.show()
Using the above code on KIWA20130723_235451_V06.gz yields:
If you zoom in (change the maxrange to 30 km), you will notice speckle-like high reflectivity scatter near the radar.
The high-reflectivity (yellow and red) pixels in the figure above are not precipitation. Instead, the high reflectivity is caused by ground clutter.
Which radars in Arizona are most affected by such ground clutter? Answering this question calls for large-scale data analysis, and this is where cloud computing really shines.
Sample program to do large-scale analysis
While you can work with individual volume scans as shown above, one key benefit of having all the NEXRAD data immediately available on a public cloud is the ability to analyze long time periods of data at scale. Thanks to GCP’s “serverless” approach to infrastructure, it is possible to do data processing, data analysis, and machine learning without having to manage low-level resources.
Cloud Dataflow, GCP’s fully-managed service for stream and batch processing allows you to write a data processing and analysis pipeline that will be executed in a distributed manner. The pipeline will autoscale different steps to run on multiple machines in a fault-tolerant way.
For example, let’s say that you have an algorithm to identify cases of anomalous propagation where the ducting of a radar beam causes ground-clutter artifacts to appear in radar data. You wish to determine which range-gates are affected, and when. But you wish to do this over several months, several years, and several radars.
This is the sort of data analysis that might involve cobbling together monitoring, error checking and file-handling code. Instead, we can simply code up the actual logic in Apache Beam (Cloud Dataflow SDK versions 2.0 and later are based on Beam) and submit the pipeline to Cloud Dataflow. Cloud Dataflow will then care of infrastructure wrangling, scaled execution, and fault tolerance.
Pipeline details
The pipeline being executed in Cloud Dataflow is shown in the diagram above. Notice that we managed to carry out 4 days worth of processing in less than an hour, thanks to doing the work in parallel on 15 workers with 8 CPUs each! The code to do massively scalable processing on the NEXRAD data is quite straightforward — the full code for the pipeline is on GitHub, but the key steps are shown below:
Beam supports two languages: Python and Java. I chose to write my pipeline in Java because, at present, the Java runtime is faster, but you could do the above processing in Python as well. (If you are interested in what a Beam Python pipeline looks like, see this example of distributed processing of Landsat scenes in Python.) Note the exception-handling code in the processTar step: Every so often, you will run into corrupt volume scan files in the archive and it is important to be able to skip them. Here I am skipping the entire hour of data, but you could skip just the corrupt volume scan by changing where the exception is caught.
The code uses the Netcdf Java library to read the NEXRAD volume scan files. Because this Java library can read only individual volume scan files on disk (not tar files and not from Cloud Storage), we download the tar files from Cloud Storage to the local disk of the machine doing the computation, untar the file, and delete the temporary files when we are done. The code to do these tasks (see GitHub) uses the Google Cloud Storage API. For example, downloading from Cloud Storage:
private static File downloadFromGcs(String bucketName, String blobName, File tmpDir) throws IOException {
Storage storage = StorageOptions.getDefaultInstance().getService();
File fileName = File.createTempFile("download", "bytes", tmpDir);
try (ReadChannel reader = storage.reader(bucketName, blobName);
FileOutputStream writer = new FileOutputStream(fileName)) {
ByteBuffer bytes = ByteBuffer.allocate(64 * 1024);
while (reader.read(bytes) > 0) {
bytes.flip();
writer.getChannel().write(bytes);
bytes.clear();
}
}
return fileName;
}
Besides scaling out processing, it is also possible to perform computations, grouping, etc. within the Cloud Dataflow pipeline. The sample (full code on GitHub) also demonstrates finding which radars are most affected by anomalous propagation/ground clutter:
The entire pipeline took about 4 hours to complete. An example AP detection looks like this:
KIWA,2013–07–23T23:54:53,43.736572,20250.0
This result indicates that for the KIWA radar in Phoenix, AZ, there was anomalous propagation at the range gate located at azimuth=43.7, range=20.25km on July 23, 2013 at 23:54:53 UTC. This is one of the high-reflectivity clutter points we noticed in the image in the previous section.
The total number of AP/ground-clutter range-gates by radar for June-August in 2012–2014 are:
KYUX,8542
KFSX,1484
KIWA,7906
The radars in Yuma and Phoenix (KYUX and KIWA) experience AP/ground clutter 5 to 6 times more often than the radar in Flagstaff (KFSX).
Next steps
The ability to do this kind of large-scale processing, without any human intervention and without having to move data around, is a key benefit of having your data in a cloud environment that supports serverless data processing. To explore more use cases for Cloud Dataflow, take advantage of our free trial!
Checkout the full code for the pipeline on GitHub