Snowflake or Databricks? BigQuery or Dataproc? Redshift or EMR?
How to choose the product to build your data platform upon
An organization that seeks to get better at how they use data and AI will need to get their data architecture into shape. Assuming that you are going to be building your data platform on the cloud, you will find yourself trying to make the choice between Snowflake, Databricks, BigQuery, Dataproc, Redshift, EMR, Synapse, etc. All of these are reasonable choices for building a data platform upon. Each vendor claims they are the best, and POCs and benchmarks are fraught with danger. In such situations, my suggestion is to go with a few guiding principles to choose the platform that is right for you.
My biases
Warning: This article is highly opinionated. So, let me get my biases out of the way:
- I used to work at Google Cloud and continue to have a soft spot for it — I believe that, in the data and AI space, it offers the most intuitive, innovative, and well-integrated ecosystem. I also wrote the book on BigQuery and Data Science on GCP.
- I have lots of friends who now work at Snowflake, and I believe that it is the data platform that is easiest to operate for business users.
- Many friends and a couple of mentors are now at Databricks, and I believe that it is the data platform that hits the best balance between flexibility and ease-of-use for programmers.
- I’m biased against Redshift (too much tuning), EMR (plain vanilla), and Synapse (lots of missing capabilities).
- DuckDB is amazing and MotherDuck (which provides a high-performant, cost-effective, converged data warehouse/data lake in the cloud) is the future. The founders of MotherDuck are friends of mine, but it’s not yet ready for primetime.
Although these biases come from experience, things like “innovative”, “easiest”, “balance”, “not ready”, etc. are necessarily subjective. Keep my biases in mind as you read the advice below.
Choosing the architecture
Instead of listening to vendor pitches and trying to decide between Snowflake and Databricks based on partial truths and flawed POCs, I suggest starting from a few guiding principles. There are four potential architectures for your data platform, and you should first decide which one is right for you. Once you choose the architecture that fits your business best, choosing the product to build your data platform around becomes much easier.
Choose the architecture based on the skillset of your users and the shape of your workloads:
- A data warehouse is a SQL-first data architecture that works off data stored in a warehouse-specific format. Choose this if the majority of your users are analysts rather than programmers, by which I mean that they can either write SQL or use a dashboard tool (like Tableau, PowerBI, or Looker) that will generate SQL. It can be used for structured data and semi-structured data, i.e. for tabular and JSON data, but can not be used for images/video/documents other than to store metadata about them. The main advantage of data warehouses is that they permit self-service, ad-hoc querying by business users.
- A data lake is a code-first data architecture. The data is stored on cloud storage (S3, GCS, ABS) and then read by job-specific or persistent compute clusters. The most common programming language/framework is Python/Spark, although Java and Scala are also often used. Choose this if the majority of your users are programmers who do a lot of data wrangling in code, such as to write ETL pipelines and to train ML models. It supports unstructured data such as images and video. For structured and semi-structured data, it tends to be less efficient than a data warehouse. The main advantage of data lakes is that they permit flexible data processing by programmers.
- A lakehouse is a hybrid architecture that allows you to serve both business users and programmers. It supports both interactive queries and flexible data processing. It can be built in one of two ways. You can make a data warehouse run SQL on data held in cloud storage in formats such as Parquet. Alternately, you can use a framework such as SparkSQL to run SQL on data lake storage. Both these are compromises. A data warehouse operating on cloud storage loses many of the optimizations that make data warehouses interactive and suitable for ad-hoc querying. A data lake running SQL loses the schema-free data processing capability that makes data lakes so flexible. So, choose the form of hybrid architecture based on which type of user and workload you want to support very well, and which one you want to support in a compromised way. (Yes, I’m aware of the benchmark that supposedly shows data lake SQL outperforming data warehouses and case studies that supposedly prove data warehouse Spark implementations being completely flexible. Suffice to say that they are flawed).
- A data mesh is a decentralized data platform that allows each division in your company to manage their own data, but still permit data sharing without data movement. If all the divisions can agree on the same data warehouse, it is possible to build a data mesh with that data warehouse. Otherwise, you will have to build it around a data lake. In the latter case, because each business unit is likely to have a different mix of user skills (SQL and Python) and mix of workload needs (interactive and flexible data processing), each business unit will choose a different lakehouse implementation.
At this point, you have decided the data architecture. And in the process of selecting the data architecture (see above), you have also decided whether you need to a SQL-first data warehouse technology (Snowflake, BigQuery, Redshift, Synapse) or a Python-first data lake one (Databricks, Dataproc, EMR). The next decision is to choose the specific SQL-first product or the specific Python-first product.
Choosing the product: small orgs
The smaller your organization, the more sense it makes for you to favor simplicity. How small is small? If your cloud data platform will cost less than $300k/year, you are small. If your data platform will cost more than $1m/year, you are large. In between, it depends — you should decide whether you are small or large depending on your engineering sophistication.
If you are a small org, I strongly suggest that you have all your assets in a singe cloud provider. Operating two clouds will mean that you will need to support cross-cloud data governance, negotiate multiple cloud contracts, etc. The cost of this will greatly outweigh whatever extra efficiency the second cloud can bring you. So, your choice of product depends on which cloud you are on:
- On AWS, if cost matters, stick to the native cloud product — Redshift or EMR. Using the native cloud product is cheaper than using Snowflake or Databricks. These vendors have extremely insistent sales people who will hide the ball a lot, so first principles thinking can help here. When you use Snowflake on AWS, you are paying the profit margin of both AWS (~60%) and of Snowflake (~80%). As long as the native cloud product is not actively bad, you will save a lot of money using it instead of an add-on technology. Snowflake queries would have to use half the resources as Redshift to counteract this. Snowflake is better than Redshift, but not that much better.
- On AWS, if ease-of-use matters, use Snowflake/Databricks. Quite often, it can outweigh the increased operating cost. Operating Redshift requires one full time engineer to get the IAM and tuning right. Given this, if your query volumes are low enough, Snowflake might be a bargain. Price it out with Snowflake’s Enterprise tier pricing. If you will need more than one engineer for tuning Redshift, your query volumes are high enough that you should consider yourself a large organization (see below).
- On Google Cloud, there is no reason to choose anything other than the native cloud product. Stick to BigQuery or Dataproc.
- On Azure, use Snowflake or Databricks. Synapse or HDInsight will run into cost/reliability issues. There is no free lunch — factor the increased data platform cost as the price you pay for taking advantage of Azure credits.
- Try not to be path dependent. If you see that GCP or Snowflake or Databricks is a better fit for you, move to it. Yes, it would have been better to start out on the best product, but that’s not a reason to keep sacrificing money and productivity. The vendor of the destination product will probably even pay for your move. So, move already!
Choosing the product: large orgs
For organizations with a large analytics/ML footprint, you have to make a decision on whether to decentralize or standardize on a data platform. The more agile your organization and the more likely M&A activity is in your business, the more sense it makes to decentralize the platform decision. Also, see if different business units in your company will have a different mix of user skills (SQL vs. Python) and mix of workload needs (interactive vs. flexible data processing). In such cases, let each business unit choose the data platform implementation that makes sense for it. On the other hand, if you are more of a top-down organization, choosing a single platform gives you higher cloud discounts and greater consistency, but the price you pay is less agility in the business units.
- If you are going to standardize, standardize on native cloud products. Using native cloud products is going to be less expensive than Snowflake or Databricks (one profit margin, not two). Once your query volumes become high enough, you can even justify the cost of an optimization team and come out ahead. This is only true for AWS and GCP.
- If you are going to decentralize, some business units and functions might prioritize ease of use, flexibility, etc. over cost. Specifically, if you are on AWS or Azure, business analyst teams might choose Snowflake and traditional data science (not machine learning) teams choose Databricks. On GCP, again, there is not much reason to look beyond the native products.
- It’s perfectly reasonable to choose a different cloud for your analytics and data science work than for your applications. The egress cost friction that public cloud providers impose and the cost of operating a second cloud can be more than made up by the increased ROI of a better data platform. If you decide to standardize on a single cloud data platform, the tight integration between Google Cloud data products makes it a strong contender to be the one you pick. Governance and fine-grained security are unified across all your data, whether it lives in a data lake or in a data warehouse, and the connections between Spark, SQL, real-time, and transactional data are seamless. This sort of tight integration will greatly speed up your development and keep costs under control.
- Your board may have directed you to use multiple clouds to avoid overdependence on a single vendor. Splitting clouds by function (rather than spreading applications across clouds) might be a way to meet that goal and get a better data platform in the bargain.
Notice that lock-in is not a consideration. You are going to get locked in by your data platform — whether it is to a cloud platform (AWS, Azure, GCP) or to a vendor (Snowflake, Databricks). Fortunately, Spark is open-source and the SQL dialects of the various data warehouses can be easily converted from one product to the other using tools like SQLglot. Therefore, if you do need to migrate from one to the other, it is not that hard. Several consulting companies are around that will do it in a matter of weeks.
Summary
- Choose the data architecture based primarily on the skillset of your users and secondarily on whether you need interactivity or flexibility more. The choice of data architecture also narrows down your choice to a SQL-first product or a Python-first product.
- If you are a small organization, use the native cloud product on GCP and Snowflake/Databricks on Azure. On AWS, choose between the Redshift/EMR and Snowflake/Databricks depending on whether cost or ease-of-use is more important.
- If you are a large organization, decide whether to centralize or decentralize. If centralizing, consider using GCP as your native cloud data & ML platform. Else, go with the native cloud products on AWS. There is no good answer for large organizations on Azure. If decentralizing, each business unit makes its decision based on cost or ease-of-use, similar to a small organization.
This article is primarily about choosing a data platform. The answer for the ML platform depends on who in your company does ML — you’d choose between no-code, low-code, and code-first platforms depending on the answer to that question. You also want to ensure that your data and ML capabilities are collocated.