No, you don’t need MLOps
Keep It Simple: the complexity of full MLOps is rarely needed
MLOps started from a straightforward problem statement — that the technical debt associated with ML models becomes intolerable if the models are not adjusted over time to account for changes in the environment. Since that 2015 observation, ML models and frameworks have been built that make it relatively easy to avoid the most glaring potholes in the way of the ML practitioner. However, in the past year or so, the MLOps buzzword has taken on a life of its own. At this point, most things sold as MLOps are overkill and unnecessary for most teams.
In this article, I take a back-to-first-principles approach to the MLOps field, talk about the original problems, how they can be elegantly addressed in today’s ML frameworks, and why any additional complexity beyond these keep-it-simple (KISS) solutions is unnecessary.
MLOps is the concatenation of machine learning (ML) and operations (Ops) and is analogous to DevOps which is the concatenation of software development and operations. DevOps came about because lots of software was developed by software engineers and then handed over to systems administrators who didn’t code and couldn’t make changes. This meant that every single change that needed to be made to the production system had to be escalated back to the software development team. The solution was DevOps — have people in production who understand the code and can make changes to it. In order to do this, the ability to quickly make changes, track dependencies, etc. needed to be done.
By analogy, the purpose of MLOps is for operations people to administer ML models. As long as your ML system is simple enough that an ops person can (1) find out when it isn’t working properly, (2) make small changes to it, and (3) redeploy the model, you are achieving the goal of MLOps. The technology solutions associated with MLOps are optional because the ML frameworks themselves have come a long way.
Long way since what?
An observation and a viral image by D. Sculley et al
D. Sculley and co-authors from Google wrote a NIPS paper in 2015 titled “Hidden Technical Debt in Machine Learning Systems”. There, they observed:
Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.
Included in that paper is a viral image that you will find in pretty much every introduction to MLOps:
They observed that the machine learning model is a small part of the overall ML system, and that you will find that changes anywhere in this complex system will make changes to the ML model itself necessary.
The paper resonated with me because this it described precisely how I was building and productionizing ML models in the 2010s. For example, one of the models I built was to apply quality-control to weather radar data to remove biological artifacts from it:
This involved a complex orchestration of data collection, feature extraction, data verification, and a series of steps that then needed to be precisely repeated in production to avoid training-serving skew:
So, the goal of MLOps, as articulated in the Google Cloud MLOps whitepaper, does resonate with me:
MLOps provides a set of standardized processes and technology capabilities for building, deploying, and operationalizing ML systems rapidly and reliably.
So, what’s the issue?
MLOps tries to automate too much
At a high-level, the “standardized process” in MLOps is all about automation. Having recognized that there is a lot of complexity surrounding the ML model, the solution that MLOps practitioners have landed on is to make the surrounding processes (e.g. data validation, feature creation, etc.) systematic so that it can be automated. Once it is fully automated, the argument goes, you don’t need to worry about complexity.
Again, from the Google Cloud MLOps white paper:
[MLOps] advocates formalizing and (when beneficial) automating critical steps of ML system construction
with the following steps described as being necessary to automate:
The problem is that automation takes a life of its own, and after a while, you find yourself spending more time on automation than on the thing you are supposedly making time for.
It is necessary now to go back to first principles. Is all this automation necessary if your goal is to enable ops people to administer ML models?
Here’s the thing. While the systems people were out there automating the ML ecosystem, ML capabilities and ML frameworks have not stood still. Advances in NLP and computer vision as well as frameworks like Keras and Pytorch have greatly simplified the ML workflow. Also, the general awareness of ML has gone up. Therefore:
- You don’t need to automate something that is now automatic. For example, data preprocessing layers in Keras make what used to be a really complex text processing workflow a single function call. As does the ability to directly embed text sentences using models like Bert. Much of the text preprocessing workflow that MLOps people were trying to automate is no longer necessary.
- A manual step is okay if your workflow is streamlined and the situation is rare enough. You don’t need to automate absolutely everything. Thus, for example, CI/CD is unnecessary if launching a job is a matter of running a single command. You don’t need to build complex automated drift detection if someone can easily launch a retraining job if they see the model drifting.
- You can teach ops people new tricks. They can monitor GPU usage and correlate changes in GPU utilization with potential changes in the data being processed by the model.
Key challenges that cause technical debt in ML models
Given this, let’s go back to first principles and consider that MLOps came about as a way to address the build up of technical debt in ML models. Let’s look at why technical debt builds up. The key challenges that cause technical debt to build up in ML models are:
- Training-Serving Skew, where differences crop up between training and serving stacks. Take the case of an NLP model. Suppose you remove all stop words (“the”, “and”, etc.) before you train the model. The model has never seen these words, so you’d better remove the stop words before feeding any text to the model in production. If you don’t do that, you have no idea what the model will do. Precisely replicating the training steps when serving is a key challenge.
- Clients are interested in a capability that involves multiple steps, only one of which is the ML model. Suppose the client wants to find out what interest rate you can offer them. Your code invokes an ML model which provides back a credit score for the client. Based on the credit score, you choose a bank and obtain that bank’s current offer. The point is that the client capability requires three steps: (a) computing the credit score which is ML (b) choosing the bank which is based on rules and (c) getting the current offer which is an API call. Every time any of these systems change, you have to do an integration test.
- The incoming data changes characteristics. Suppose a new law is passed that raises the minimum wage. The distribution of income that your credit score model sees will be different in production than in training. How does this affect the ability of your model to assess applicants?
- The correct answer changes over time. Perhaps you a model that recommends items in your product catalog. As new items get added to the catalog and some items are retired, the correct answer from the recommendation model will have to change.
- Model fairness issues arise. Suppose a new college is built in a town. The number of young people with no credit history from that area will increase. You might have, as one of your features, the ratio of applicant income to average income at the location. How does this affect the ability of your model to fairly assess applicants from that area?
- Parts of the system aren’t changed when new code is checked in. Suppose your credit score model uses a library to determine the lat-long location given an applicant’s street address. If that library is changed to be more coarse in order to protect applicants’ privacy, but the model isn’t retrained, you will have a problem.
- Need to be able to troubleshoot old results. It may be possible for a regulator or court to ask you to explain you made a certain decision 2 years ago. Will you be able to resurrect the entire ML model and associated ecosystem to be able to playback and troubleshoot the results?
The challenges can be solved through automation
These key challenges can be solved through automation. Specifically, one way to precisely align feature creation between training and serving is to use a feature store so that you can get the value of a feature as of a certain time and version. Similarly, a way to ensure that consistency in all the steps in a capability to code it up as a set of containerized steps, called an ML pipeline.
These options are summarized below:
If the incoming data can change characteristics, sample the requests and store them in a data warehouse. Once you know the correct answer (whether it is through manual labeling, or because the predicted outcome has come to pass), run an evaluation query that joins the request, predicted answer, and true answer and computes the evaluation metric.
If drift is detected, or the correct answer has changed, launch a training job automatically. This is called continuous training.
If model fairness issues, automation is supposed to help. But I haven’t heard a good explanation of how.
In order to be able to troubleshoot old results, make sure to store all datasets and models in a registry so that you can retrieve them later. Of course, you are already storing your code in a version-control system.
But is automation the only way to solve these problems? Or have simpler methods come about?
Simpler solutions exist
Take the concept of training-serving skew. Do you need to go all the way to a Feature Store to solve this problem?
Nope. There are classes of ML problems (such as computer vision) for which feature engineering is now unnecessary. My radar quality control application would now be done with an off-the-shelf image segmentation model such as UNet and readily deployed. Advances in data engineering mean that a streaming pipeline capability such as Apache Beam can efficiently invoke an ML model even if it is deployed as a web service
Even if you do need custom preprocessing, as the Machine Learning Design Patterns book suggests, a simpler solution to the problem of training-serving skew is to use the Transform pattern. In Keras, this involves using preprocessing layers (for built-in capability) and Lambda layers (for custom capability)
The set of simpler solutions is summarized below:
If clients are interested in a capability that involves multiple steps, export your ML model and then deploy it as a web service with no state (this is called a Stateless Serving Function, and is supported directly by Keras for the export and Sagemaker/VertexAI for the deploy). This way, the steps can all change independently because they are loosely coupled through REST APIs.
While continuous evaluation is fancy, there are few systems where you need to detect changes in the incoming data distribution immediately. Besides, even if you detect changes in the incoming data distribution, you will often wait a while to make sure the change isn’t just a blip. So, you can often get away with doing the evaluation once a week or once a month. Scheduled evaluation is a lot simpler to orchestrate than continuous evaluation. Keep It Simple.
While continuous training sounds fancy, automated drift detection is extremely hard to get right. You are better off writing a bunch of rules that will trigger retraining. For example: every Sunday at 2am, whenever a new product catalog is released, etc.
To check for model fairness, you will need to do sliced evaluations and interpret the results. This is necessarily a manual process.
Instead of launching off a complete rebuild, retraining, and redeploy whenever any small bit of code changes, you are often better off doing a from-scratch build periodically. Perhaps make code releases one of the systematic triggers.
To be able to troubleshoot old results, you do need to store versions of everything. However, you could do this with standard technologies such as git, data warehouses, and API versioning. You don’t need a special purpose tool for this. Keep It Simple.
So, is MLOps just all hype then?
It’s not all hype
There are situations where you need the complexity. A few months ago, I wrote an article titled “Do you really need a feature store” where I summarized:
In most cases, feature stores add unnecessary complexity. There are, however, a few instances in which a feature store will be invaluable. Use a feature store if you need to inject features server-side, especially if the method of computing these features will keep improving.
That same sort of advice applies for all the technical debt challenges.
Understand the situations where complexity is warranted:
- Use a feature store if you need to inject features server-side. Otherwise, the Transform pattern is simpler and requires no extra tooling.
- Use ML pipelines if two or more of your steps are ML models. In that case, you will need to retrain the downstream model based on the output of the upstream model. ML pipelines make tracking such experiments easier.
- Do continuous evaluation if you have a robust mitigation plan available, for example, if you have a way to quickly change over to a shadow model that doesn’t use this piece of data.
- Do continuous training if you are building a truly adaptive system.
- Do automated sliced evaluations and other checks if your use case is such that you will have to completely turn off the ML service if there is any doubt about its fairness.
- Implement CI/CD for ML models if you have a large monorepo (that will take too long to build from scratch) and you have frequent model changes.
- Use special-purpose registries for models and data if you have hundreds of models in production. Until then, go with simpler, general-purpose solutions.
Realize, however, that these situations that require complexity are getting rarer because the simple solutions have been becoming more powerful.
Is this all just my opinion?
Not just me
Fortunately, real-world systems do tend to keep it simple. This, for example, is the operational machine learning system used by Pandera:
Notice the use of general purpose tools (SQL for data preparation, run through Dataflow during serving). Even the “feature store” is just BigQuery. I love that they are simply paying lip-service to the hype while doing the practical thing.
Todd Underwood also seems to have reached this conclusion in 2021. In the preamble to a talk on making sure data semantics is consistent between training and serving, he notes:
Paraphrasing Todd a bit:
The majority of problems “solved” by MLOps solutions are already well addressed by standard data processing approaches.
You don’t need special ML processing architecture.
What modern data processing approaches? Data Warehouses (BigQuery, Snowflake), Data Lakes (GCS/S3, Parquet/Iceberg/Delta Lake), and stream processing (Apache Beam).
MLOps, these days, is mostly just a sales pitch by tools vendors. While the principles of MLOps remain valid, you don’t need what’s being sold as MLOps.
- It is not 2015 anymore.
- Modern data processing approaches (BigQuery, GCS, Apache Beam) and ML frameworks (Keras, Pytorch) provide simple solutions to the technical debt challenges raised in D. Sculley et al
- Avoid unnecessary complexity
- Keep It Simple
- You don’t need what’s being sold as MLOps