Struggling to harness data sprawl, CIOs across industries are facing tough challenges. One of them is where to store all of their enterprise’s data to deliver robust data analytics.
There have traditionally been two storage solutions for data: data warehouses and data lakes.
Data warehouses mainly store transformed, structured data from operational and transactional systems, and are used for fast complex queries across this historical data.
Data lakes act as a dump, storing all kinds of data, including semi-structured and unstructured data. They empower advanced analytics like streaming analytics for live data processing or machine learning.
Historically, data warehouses were expensive to roll out because you needed to pay for both the storage space and computing resources, apart from skills to maintain them. As the cost of storage has declined, data warehouses have become cheaper. Some believe data lakes (traditionally a more cost-efficient alternative) are now dead. Some argue data lakes are still trendy. Meanwhile, others are talking about a new, hybrid data storage solution - data lakehouses.
What’s the deal with each of them? Let’s take a close look.
This blog explores key differences between data warehouses, data lakes, and data lakehouses, popular tech stacks, and use cases. It also provides tips for choosing the right solution for your company, though this one is tricky.
Data warehouses are designed to store structured, curated data, organizing datasets in tables and columns. This data is easily available to users for traditional business intelligence, dashboards, and reporting.
A three-tier architecture is the most commonly used approach to designing data warehouses. It comprises:
There are three other vital components of a data warehouse that should be mentioned: the data mart, the operational data storage, and metadata.
Data marts belong to the bottom tier. They store subsets of the data warehouse data, serving individual business lines.
Operational data stores act as a repository providing a snapshot of the organization’s most current data for operational reporting based on simple queries. They may be used as an interim layer between the data sources and the data warehouse.
There is also metadata — data describing the data warehouse data — which is stored in special-purpose repositories, also at the bottom layer.
Data warehouses have been around for a few decades.
Traditionally, data warehouses were hosted on premises, meaning companies had to purchase all hardware and deploy software locally, either paid or open-source systems. They also needed a whole IT team to maintain the data warehouse. On the bright side, traditional data warehouses were bringing in (and still do so today) a fast time-to-insight with no latency issues, total control of data together with one hundred percent privacy, and minimized security risk.
With cloud ubiquity, many organizations now choose to migrate to cloud data warehouse solutions where all data is stored in a cloud. It is analyzed in a cloud, too, using some type of an integrated query engine.
There are a variety of established cloud data warehouse solutions in the market. Each provider offers its unique set of warehouse capabilities and different pricing models. For example, Amazon Redshift is organized as a traditional data warehouse. Snowflake is similar. Microsoft Azure is an SQL data warehouse, while Google BigQuery is based on a serverless architecture offering in essence software-as-a-service (SaaS), rather than infrastructure or platform-as-a-service like, for instance, Amazon Redshift.
Among well-known on-premises data warehouse solutions are IBM Db2, Oracle Autonomous Database, IBM Netezza, Teradata Vantage, SAP HANA, and Exasol. They are also available on the cloud.
Cloud-based data warehouses are obviously cheaper because there is no need to buy or roll out physical servers. Users pay only for the storage space and computing power as needed. Cloud solutions are also much easier to scale or integrate with other services.
Serving highly specific business needs with top data quality and fast insights, data warehouses are here to stay for long.
Data warehouses deliver high-speed and high-performance analytics on petabytes and petabytes of historical data.
They are fundamentally designed for BI-type queries. A data warehouse might give an answer about, for instance, sales in a particular time period, grouped by region or division, and year-on-year movements in sales.
Key use cases for data warehouses are:
Having structured business data in one easily accessible location outside operational databases is pretty much important to any data mature company.
However, traditional data warehouses don’t support big data technology. They are also updated in batch, with records from all sources processed periodically in one go, which means that the data can become stale by the time it’s rolled up for analytics.
The data lake seems to resolve these constraints. With a tradeoff. Let’s explore.
Data lakes mostly collect unrefined raw data in its original form. Another key difference between the data lake and the data warehouse is that data lakes store this data without arranging it into any logical relationships that are called schemas. However, this is how they enable more sophisticated analytics.
Data lakes pull in (i) transactional data from business applications such as ERP, CRM, or SCM, (ii) documents in .csv and .txt formats, (iii) semi-structured data such as XML, JSON, and AVRO formats, (iv) device logs and IoT sensors, and (v) images, audio, binary, PDF files.
Data lakes use a flat architecture for data storage. Its key components are:
Data lakes don’t inherently contain analytics capabilities. Without them, they just store raw data that is not useful in its own right. So, organizations build data warehouses or leverage other tools on top of data lakes to put data to use.
To make sure a data lake doesn’t turn into a data swamp, it is important to have an efficient data management strategy to include built-in data governance and metadata management in data lake design. In an ideal world, data sitting in a data lake should be cataloged, indexed, validated, and easily available to data users. This is rarely a case though and many data lake projects fail. This can be avoided: regardless of the maturity of a data team, it is critical to install at least essential controls to enforce data validation and quality.
The rise of big data in the early 2000s has brought both grand opportunities and grand challenges for organizations. Business needed new technology to analyze these massive, messy, and ridiculously fast-growing datasets to capture a business impact from the big data.
In 2008, Apache Hadoop came up with innovative open-source technology for collecting and processing unstructured data on a massive scale, paving the way for big data analytics and data lakes. Shortly after, Apache Spark emerged. It was easier to use. In addition, it provided capabilities for building and training ML models, querying structured data using SQL, and processing real-time data.
Today data lakes are predominately cloud-hosted repositories. All top cloud providers such as AWS, Azure, and Google offer cloud-based data lakes with cost-effective object storage services. Their platforms come with various data management services to automate deployment.
In one scenario, for instance, a data lake might consist of a data storage system like the Hadoop Distributed File System (HDFS) or Amazon S3 integrated with a cloud data warehouse solution like Amazon Redshift. These components would be decoupled from services in the ecosystem which might include Amazon EMR for data processing, Amazon Glue that provides the data catalog and transformation functionality, the Amazon Athena query service, or Amazon Elasticsearch Service that is used to build a metadata repository and index data.
Local data lakes are still common because of usual cloud concerns like security, privacy, or latency. There are also on-premise storage vendors that offer some products for data lakes, but their data lake offerings, however, are not well-defined.
Unlike data warehouses, data lakes don’t have many years of real-world deployments behind them. There is still much criticism describing the data lake concept as blurry and ill-defined. Critics also argue that few people in any organization have the skills (or enthusiasm for that matter) to run exploratory workloads against raw data. The idea that data lakes should be used as a central repository for all enterprises’ data needs to be approached with caution, they say.
There has also been a provocative talk that data lake days are numbered. The following reasons are cited:
Such criticism is an inherent part of any younger technology. However, data lakes do have clear use cases like streaming analytics. And just yet, they don’t threaten data warehouses.
At some point, data lakes even triumphed over data warehouses, offering wider analytics capabilities, cost-effectiveness, and flexibility in terms of data stored. However, as data warehouse technologies have matured, many agree there is no obvious winner now.
It is generally advisable to maintain them both or… go for a hybrid architecture. Read on.
The main idea about data lakes is to give businesses access to all available data from all sources as quickly as possible.
Data lakes do not just give a picture of what happened yesterday. Storing massive amounts of data, data lakes are designed to enable organizations to learn more about both the present (using streaming analytics) and the future (using big data solutions, including predictive analytics and machine learning).
Key use cases for data lakes are:
With strong data engineering skills to move raw data into an analytics environment, data lakes can be extremely relevant. They allow teams to experiment with data to understand how it can be useful. This might involve building models to dig through data and try out different schemas to view the data in new ways. Data lakes also allow wrangling with stream data that is pouring in from web logs and IoT sensors and is not suited for a traditional data warehouse approach.
In short, data lakes enable organizations to unearth patterns, anticipate changes, or find potential business opportunities around new products or current processes.
Used for different business needs, data lakes and data warehouses are often implemented in tandem. Before we move to the next data storage concept, let’s quickly recap the key differences between the data warehouse and the data lake.
Characteristics |
Data Warehouse |
Data Lake |
---|---|---|
Types of data |
Structured data only |
All data, including structured, semi-structured, and unstructured |
Schema |
Pre-defined, fixed schema for data ingest (schema-on-write) |
Schema written only at the time of analysis (schema-on-read) |
Data quality |
Highly curated data, reliable |
Raw data, low quality, data swamp risk |
Query |
Fast query results upon deployment |
Poorly optimized query performance, a lot of data preparation is needed |
Users |
Business professionals |
Business analysts, data scientists, data engineers, and data architects |
Analytics |
Reporting, BI, dashboards |
Advanced analytics (exploratory analysis, data discovery, streaming analytics, operational analytics, ML, big data) |
Ease of use |
The fixed schema makes data easy to locate, access, and query |
Time and effort is required to organize and prepare data for use. Extensive coding is involved |
Scalability |
Scaling might be difficult because of tightly coupled storage and compute |
Scaling is easy and cost-effective because of separation of storage and compute |
Marketing aside, the key idea about a data lakehouse is to bring computing power to a data lake.
Architecturally, the data lakehouse usually consists of:
Touted as a solution marrying the best of both worlds, the data lakehouse addresses both:
The data lakehouse is a new advancement in the data analytics scene.
The concept was first used in 2017 in relation to the Snowflake platform. In 2019, AWS used the data lakehouse term to describe its Amazon Redshift Spectrum service that allows users of its data warehouse service Amazon Redshift to search through data stored in Amazon S3. In 2020, the data lakehouse term came into widespread usage, with Databricks adopting it for its Delta Lake platform.
The data lakehouse might have a bright future ahead as companies across industries are adopting AI to improve service operations, offer innovative products and services, or drive marketing success. Structured data from operational systems delivered by data warehouses is ill-suited for smart analytics, while data lakes are just not designed for robust governance practices, security, or ACID compliance.
Characteristic |
Data Lake |
Data Lakehouse |
---|---|---|
Types of data |
All data, including structured, semi-structured, and unstructured |
All data, including structured, semi-structured, and unstructured |
Schema |
No pre-defined schema |
Schema enforcement |
Data quality |
Raw data, low quality, data swamp risk |
Raw and curated data, high quality enabled by in-built data governance |
Query |
Poorly optimized query performance |
Highly optimized query performance |
Users |
Business analysts, data scientists, data engineers, and data architects |
Business professionals and data teams |
Ease of use |
Difficult |
Simple, with data lakehouses providing interfaces similar to traditional data warehouses together with in-built AI support |
Analytics |
Advanced analytics |
Suitable for all types of analytics workflows, both advanced analytics and BI |
Scaling |
Easy |
Easy |
Whether you want to build a data storage solution from scratch or modernize your legacy system to support ML or improve performance, the right answer won't be easy. There’s still a lot of mess about key differences, benefits, and costs, with offerings and pricing models from vendors rapidly evolving. Besides, it’s always a difficult project even if you have stakeholders’ buy-in.
However, there are some key considerations when choosing the data warehouse vs. data lake vs. data lakehouse.
The primary question you should answer is: WHY. A good point here to remember is that key differences between data warehouse, lakes, and lakehouses do not lie in technology. They are about serving different business needs.
So why do you need a data storage solution in the first place? Is it for regular reporting, business intelligence, real-time analytics, data science, or other sophisticated analysis? Is data consistency or timeliness more important for your business needs?
Spend some time developing use cases. Your analytics needs should be well defined. You should deeply understand your users and skillsets too.
A few rules of thumbs are:
Consider also your budget and time constraints. Data lakes are surely faster to build than data warehouses, and probably cheaper. You might want to implement your initiative incrementally and add capabilities as you scale up.
If you want to modernize your legacy data storage system, then again, you should ask WHY you need this. Is it too slow? Or doesn’t it allow you to run queries on bigger data sets? Is some data missing? Do you want to pull out a different type of analytics? Your organization has spent a lot of money on the legacy system, so you definitely need a strong business case to ditch it. Tie it to an ROI too.
Data storage architectures are still maturing. It is impossible to say for sure how they will evolve. However, no matter which path you’ll take, it is useful to recognize common pitfalls and make the most of the technology that is already here.
We hope this article has cleared up some confusion about data warehouses vs. data lakes vs. data lakehouses. If you still have questions or need top tech skills or advice to build your data storage solution, drop ITRex a line. They will help you