A simple approach would be to create a BigQuery table that has metadata about the files themselves. Your Dataflow pipeline would then start with the BigQuery table. I do this for satellite data; here’s the metadata table to give you an idea: https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=noaa_goes16&page=dataset