754 reads

What The Heck is DeltaStream?

by Shawn GordonDecember 12th, 2024

FA-AF

Too Long; Didn't Read

DeltaStream is a managed service that lets you use Flink without having to deal with Flink.

featured image - What The Heck is DeltaStream?

When we last asked, “What the heck is???”, we were looking at the __Apache Kafka__® protocol compatible data streaming platform WarpStream. With streaming data, you need to do something with it. Land it somewhere like Apache Iceberg, Snowflake, or Databricks, or query that in-flight data to enrich and/or filter it before landing it. There are some options out there for the latter, but the biggest one is undoubtedly the open-source project Apache Flink. Looking at Flink led me to DeltaStream, which is our subject today. What the heck is DeltaStream? How does it work with Flink, and how is it a companion to Kafka?

What is Flink?

Flink was initially accepted as an Apache project in December 2014, so it has been around for a while. The growth of stream processing has led to accelerated interest and use in the last few years. Flink is a somewhat challenging system to stand up and use internally, requiring dedicated engineering talent. Even the AWS MSF service, while easier, is still fairly complex. Utilizing Java, for example, requires you to write your Java query, generate a jar file, zip it, upload it to S3, set your permissions, and then execute it.

I’m going to borrow from the Apache Flink web page here. Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments and perform computations at in-memory speed and at any scale. I don’t want to replicate more of what is on their website, so if you aren’t sure about Flink, give it a look.

Using DeltaStream

At its most basic, DeltaStream gives you the power of Flink without having to deal with Flink. At all. For my test, I used their Quick Start Guide for the Web UI; note that they also have a CLI, which I did not test.

When you sign up, you get a free 14-day trial. A sample Kafka cluster called “trial_store” is spun up that you can use instead of your own data to try out. The demo contains a number of topics to play with. The icons under ACTIONS allow you to delete a topic or view details and content about and in the topic.

Here are the details of the pageviews topic for reference, which we’ll use later.

Okay, we’ve got a Kafka cluster going and topics in it. Now, let’s do something interesting. As mentioned in the intro, the most interesting thing I can do is enrich and/or filter data while it is in flight before it lands at its ultimate destination, like a database/warehouse/lake. For those purposes, we go to the Workspace.

This part took a little getting used to. A database and a schema in DeltaStream are just organizational folders. You can create ‘n’ number of databases, and within a database, you can have ‘n’ number of schemas. The schemas will hold the definitional layout of your DeltaStream Objects known as STREAM, CHANGELOG, MATERIALIZED VIEW, and TABLE. A Table relates to a database table in something like PostgreSQL, and the Materialized View is a way to persist the data in a result set without putting it somewhere specifically. I’m not going to do anything with either of those in this blog; I’m going to focus on Stream and Changelog.

I’m creating a Stream of the pageviews topic in the Kafka cluster in the following screenshot. I think of it as making a table definition of the topic. We’ve assigned the names of the fields and told them what topic to use and what the data format is. We don’t have to fully justify the trial_store cluster as it is set as the default in the combo box at the top. Once that command is executed, it will show up under TestDB.public. I can then query it with something like SELECT * FROM PAGEVIEWS, and I’ll start seeing the data in the result pane at the bottom.

Next, I declare a changelog backed by the user's topic and ordered by UserID. A changelog is similar to a stream but enables you to interpret events in a topic as UPSERT events. Events require a primary key; DeltaStream interprets each event as an insert or update for the given primary key. In this case, the changelog reflects specific details by user, such as gender and interests.

Here is where we start to have some fun. I will create a new stream that enriches the pageviews stream with data from the users_log changelog using the userid as the key value to join on. This now gives me a unique topic in the cluster with data from two different topics. From here, I can filter it using something like regionid and write the results of that query to a final destination, such as a database, warehouse, or lake. This allowed me to enrich and filter data in-flight before landing it, thus improving latency and reducing compute and storage costs.

Summary

So, what the heck is DeltaStream? It’s a really simple way to use Apache Flink without knowing anything about it or directly doing anything with it. You saw from my example how straightforward it was to connect to Apache Kafka and then read, join, and filter the data. Other currently supported connections are Kinesis, PostgreSQL, Snowflake, and Databricks, and I’m told that ClickHouse and Iceberg will soon be available.

Ultimately, DeltaStream gives you the power of Apache Flink without having to deal with Apache Flink, and you can do it using SQL instead of Java. If you are dealing with streaming data or looking to implement it, this is definitely a very clever and convenient solution.

Check out my other What the Heck is… articles at the links below: