Data Workflow Modernization
Drive transformational improvement in users’ workflows, not an incremental improvement in the tools you use
Many years ago, my grandmother asked me what my job was. I explained to her that I did weather research. “But what is it that you make?,” she asked. We went around and around and I don’t think I ever gave her a satisfactory answer — she would have understood it if I could have pointed to the weather map on TV and said I drew it. But how do you explain the research that goes into determining the things that go into a weather forecast?
Verbs not nouns
Nouns are so much easier to explain than verbs.
It is no wonder, therefore, that when we think about a modernization program, our mind gravitates to nouns, to the things we have. We think about the pain we are experiencing and resolve to modernize the tools we use. We decide that we want to upgrade our database and make it globally consistent and scalable, so we modernize to Spanner. We decide that we want to upgrade our streaming engine and make it more resilient and easier to run, so we modernize to Dataflow. We decide we are done tuning clusters and queries on our data warehouse, so we modernize to BigQuery.
These are all great moves. You should definitely upgrade to easier-to-use, easier-to-run, much more scalable, much more resilient tools whenever you can. However, if you do only like-for-like tool changes, you will end up with just incremental improvements. You will not get transformational change from such upgrades.
Indeed, modern society and most of the innovation we take for granted is obtained by transformative changes in the verbs — in the way that we solve problems. These are of course made possible by the thousands of incremental changes in the tools we use to solve those problems. But the best way to benefit from improved tools is to forget the individual tools and approach the job that is to be done from first principles.
Dog’s breakfast of dependencies
So what does this have to do with data engineering?
When we embark on a data modernization program, we usually focus on the nouns. We focus on the Oracle databases and Teradata instances we need to migrate. We find the corresponding cloud product and embark on a migration journey.
Of course, you can’t just migrate the data. If you only need to transfer the data and schema from Teradata to BigQuery, you’d be done in a few hours. There is an automated transfer service available.
The problem is that you have to migrate the dependencies too. What dependencies?
ETL pipelines that used to populate the Teradata instance now have to insert records into BigQuery instead. Those ETL pipelines have to be moved to the cloud too. And all the data sources and libraries that these pipelines depend on.
Your simple project is now a dog’s breakfast of dependencies that have to be ordered and moved.
To avoid this trap, when you embark on a data modernization program, force yourself to think verbs not nouns. You are not modernizing your database or your data warehouse. You are modernizing your data workflow. The data workflow is the job to be done. Approach it from first principles.
What does it mean to modernize a data workflow? Think about the overall task that the end-user wants to do. Perhaps they want to identify high-value customers. Perhaps they want to run a marketing campaign. Perhaps they want to identify fraud. Now, think about this workflow as a whole and how to implement it as cheaply and simply as possible.
Avoid the sunk-cost fallacy. Just because you have an ETL pipeline that knows how to ingest transactions doesn’t mean that you have to use that ETL pipeline. Throwing it away completely, or salvaging small scraps from it, is an option.
Instead, start from first principles. The way to identify high-value customers is to compute total purchases for each user from a historical record of transactions. Figure out how to make such a workflow happen with your modern set of tools.
Automation is the name of the game when it comes to modern workflows:
- Automated data ingest. Do not write bespoke ELT pipelines. Use off-the-shelf EL tools to land the data in a data warehouse. It is so much easier to transform on the fly, and capture common transformations in materialized views, than it is to write ETL pipelines for each possible downstream task. In our example, we’d want to use Datastream to replicate transactions as they happen into BigQuery.
- Streaming by default. Land your data in a system that combines batch and streaming storage so that all SQL queries reflect the latest data (subject to some latency). Same with any analytics — look for data processing tools that handle both streaming and batch using the same framework. In our example, the lifetime value calculation can be a SQL query. For reuse purposes, make it a materialized view. That way, all the computations are automatic and the data is always up to date.
- Automatic Scaling. Any system that expects you to pre-specify number of machines, warehouse sizes, etc. is a system that will require you to focus on the system rather than the job to be done. You want scaling to be automatic, so that you can focus on the workflow rather than on the tool.
- Query rewrites, fused stages, etc. You want to be able to focus on the job to be done and decompose it into understandable steps. You don’t want to have to tune queries, rewrite queries, fuse transforms, etc. Let the modern optimizers built into the data stack (Dataflow, BigQuery) take care of these things.
- Evaluation. You don’t want to write bespoke data processing pipelines for evaluating ML model performance. You simply want to be able to specify sampling rates and evaluation queries and be notified about feature drift, data drift, and model drift. All these capabilities should be built into deployed endpoints. (This is the case with Vertex AI).
- Retraining. What should you do if you encounter model drift? 9 times out of 10, the answer is to retrain the model. So, this should also be automatic. Modern ML pipelines will provide a callable hook that you can tie directly to your automated evaluation pipeline. Automate retraining too.
- Continuous training. Model drift is not the only reason you might need to retrain. You want to retrain when have a lot more data. Maybe when new data lands in a storage bucket. Or when you have a code check-in. Again, this can be automated.
As you can see, once you get to a fully automated data workflow, you are looking at a pretty templatized setup that consists of a connector, data warehouse, and ML pipelines. All of these are serverless on GCP, so you are basically looking at just configuration, not cluster management.
Of course, you will be writing a few specific pieces of code:
- Data preparation in SQL
- ML models in a framework such as Tensorflow
- Evaluation query for continuous evaluation
The fact that we can come down to such a simple setup for each workflow explains why an integrated data and AI platform is so important. You won’t get this simplicity if you have to cobble together a bunch of tools yourself.
Transform the workflow itself
You can make the workflow itself much more efficient by making it more automated using the modern data stock. But before you do that, you should ask yourself a key question: “is this workflow necessary to be precomputed by a data engineer?”
Because that’s what you do whenever you build an ELT or ETL pipeline — you are precomputing. It’s an optimization, nothing more.
In many cases, if you can make the workflow something that is self-serve and ad-hoc, you don’t have to build it with data engineering resources. Because so much is automated, you can provide the ability to run any aggregation (not just lifetime value) on the full historical record of transactions. Move the lifetime value calculation into a declarative semantic layer to which your users can add their own calculations. This is what a tool like Looker will allow you to do.
Once you do that, you get the benefits of consistent KPIs across the org, and users who are empowered to build a library of common measures. The ability to create new metrics now lies with the business teams where this capability belongs in the first place.
In summary, focus on modernizing workflows as a whole. Not on migrating each stage of your workflow one-by-one to cloud-native tools. Built-in automation in much of the cloud-native tooling will then provide transformative impact. Other than small snippets of code, you can leave the rest of your infrastructure behind. Wherever possible, provide ad-hoc, interactive capability to end-users.