As I write this on June 29, 2022, the “Data + AI Summit” is on its last day in San Francisco. I’d been thinking about writing about this topic for nearly a year now, but the
In the long long ago, in the before time, we had databases, where compute, storage, security, indexing, and all that good stuff was all in one place. As computing needs advanced at a crazy rate, software was developed to address those needs. Now we were dumping structured, semi-structured, and unstructured data (audio, video, images) into storage like AWS S3. We needed a way to understand the structure of what was there. This gave rise to catalogs such as the __Hive Metastore __or AWS Glue that would describe the data. That was a start, but we needed to be able to deal with the data in the files like in a database and be able to insert/delete/update records, and that gave rise to the Hive table format. It was quite primitive compared to the options available today, and if I recall correctly, didn’t work with metastore yet, so users had to know the layouts of the files.
Databricks built and released Deltlake as a table format in April 2019, but the open-source version was pretty limited and Databricks was really the only company doing any serious work on it. That’s what made their announcement about open-sourcing a bunch of stuff so interesting. I read it as a bit of a panic move as Iceberg is getting so much vendor adoption and so many vendor contributors. Iceberg is also the newer kid on the block open-sourced in May 2020. Hudi started life at Uber in 2016 and was open-sourced in 2017, and is used at some dang big companies other than Uber, like the Robinhood trading platform, Amazon, Bytedance, creators of TikTok, and many others.
The way these table formats work to handle upserts, deletes, etc., is generally one of two methods, they are:
Keep in mind that these changes to the files are coming in as change logs, which means the latest versions need to be dealt with for queries, or time travel. Time travel is a cool feature of this configuration, but I’m not going to address it in this article. MoR tends to be faster than CoW, but this is a pretty detailed topic that you should research in-depth. Let’s do a high-level comparison of the three table formats to get you started. For the below grid, I’m borrowing some research from a
Feature Overview |
|
|
|
---|---|---|---|
ACID Transactions |
Yes |
Yes |
Yes |
Partition Evolution |
No |
No |
Yes |
Schema Evolution |
Partial |
Partial |
Yes |
Time Travel |
Yes |
Yes |
Yes |
File Formats Supported |
Parquet |
Orc, Parquet |
Avro, Orc, Parquet |
Schema Evolution |
|
|
|
Add Column |
Yes |
Yes |
Yes |
Drop Column |
No |
Yes w/Spark |
Yes |
Rename Column |
Yes |
Yes w/Spark |
Yes |
Update Column |
Yes |
Yes w/Spark |
Yes |
Reorder Column |
Yes |
Yes w/Spark |
Yes |
Change partitioning w/out rewriting table |
No |
No |
Yes |
Use transforms of columns to specify partitions |
Partial |
No |
Yes |
Require understanding of table partitioning |
Yes |
No |
Yes |
File Pruning |
Yes |
Yes |
Yes |
Read Support |
|
|
|
Yes |
Yes |
Yes | |
Yes |
No |
No | |
No |
Yes |
No | |
Yes |
No |
No | |
No |
No |
Yes | |
Yes |
Yes |
Yes | |
Yes |
Yes |
Yes | |
No |
Yes |
Yes | |
Yes |
Yes |
Yes | |
Yes |
Yes |
No | |
Yes |
No |
Yes | |
Yes |
No |
Yes | |
Yes |
Yes |
Yes | |
Yes |
Yes |
Yes | |
Write Support |
|
|
|
No |
No |
Yes | |
Yes |
Yes |
Yes | |
No |
Yes |
Yes | |
Yes |
No |
No | |
No |
No |
Yes | |
No |
No |
Yes | |
Yes |
Yes |
Yes | |
Yes |
No |
Yes |
Part of the reason this ecosystem evolved was around performance, but part of it had to do with how cloud providers charge for their platforms. It is much cheaper for storage than it is for compute, so if you can just query your raw storage without putting it into a conventional database, you’ll reduce your costs. By NOT putting it in a database of some sort, we’ve had to develop a variety of file formats like Parquet, catalog systems to describe a variety of file formats so they present as a schema, query engines, security plug-ins, table formats to deal with transactions, indexing, and more. The appeal of systems like Redshift, Snowflake, Yugabyte, CockroachDB, and others, is that all those things tend to be built in, just like the databases of old. Yes, there is a lot of flexibility in this scenario as you can use all the bits and pieces that best suit your situation, but imagine if Amazon were to suddenly change their pricing policies with Graviton 5 (I’m just throwing out an idea here) because they got compute so cheap that they decided to make compute cheaper than storage and really drop egress fees. You could see a massive collapse of a segment of the tech sector. Kind of a scary thought.
DataBeans published a
The tests performed were ones that would show Delta in a better light with default configurations. It all seems just a little too convenient, meaning, it seems like it was more of a paid placement than an organic comparison, but I could be wrong (this article is not paid for, I did it on my own time and dime). While Delta and Hudi have broad consumer adoption, the former because it is part of the Databricks product and the latter because of being first to market, I think Iceberg is likely going to be the eventual winner in this space based on the commercial support that I’m seeing, but you never know how the market can change from some unexpected innovation.
As to what is best for your environment, that’s going to be up to your specific needs, I’m simply talking about the tech adoption. The nice thing about open-source is that it isn’t company-dependent, so you can keep using that tech regardless of if the commercial company that sold you support continues to exist or not.