I started seeing chatter about Apache SeaTunnel in early 2023 and was low-key keeping an eye on it. The project started in 2017 as Waterdrop and grew out of the contributions from the creator of Apache Dolphin, which supports SeaTunnel as a task plugin.
I had some initial issues getting my head around what SeaTunnel is and why I should care about it. That means I will keep this relatively high level to at least answer those questions. With that, let’s jump in.
They describe it as “a high-performance, distributed, massive data integration tool that provides an all-in-one solution for heterogeneous data integration and data synchronization.” It comprises three main components:
Many source connectors are available; as of the current version, 2.3.3, a list is available
A transform connector comes into play if the schema between your source and sink differ, essentially mapping your data.
The sink is the other side of the source, but now you are writing instead of reading. A complete list of sink connectors as of version 2.3.3 is available
With these components, SeaTunnel can solve common problems found with data integration and synchronization. So, it provides high-performance data synchronization for real-time and batch data. The poorly translated claim in the docs is that it can “synchronize hundreds of billions of data per day in real-time.” I’m not sure what that claim is, but it's probably pretty fast, considering that companies like Alibaba use it.
I was impressed with the connector API feature in the system. As stated earlier, over 100 pre-built connectors exist, but you can create a different one if you need to. The connectors are not tied to a specific execution engine but can use Flink, Spark, or the native SeaTunnel one. The plug-in architecture for the connectors reminds me a bit of the
Data can be synchronized in batch or real-time, providing various synchronization options. A nifty feature is how it works with JDBC, which supports multi-table or whole database synchronization. This addresses the need for CDC multi-table synchronization scenarios.
The runtime process of SeaTunnel is shown in the diagram below:
The SeaTunnel runtime flow breaks down as follows:
Keep in mind that SeaTunnel is an EL(T) integration platform, as such, it can only do basic data transformations itself:
A SeaTunnel job, or config file, is described with four possible sections: env, source, transform, and sink. The transform can be ignored if no transformation is performed. A config file can be written in Hocon or JSON format. Borrowing from the SeaTunnel docs, here is a simple example in hocon format:
env {
job.mode = "BATCH"
}
source {
FakeSource {
result_table_name = "fake"
row.num = 100
schema = {
fields {
name = "string"
age = "int"
card = "int"
}
}
}
}
transform {
Filter {
source_table_name = "fake"
result_table_name = "fake1"
fields = [name, card]
}
}
sink {
Clickhouse {
host = "clickhouse:8123"
database = "default"
table = "seatunnel_console"
fields = ["name", "card"]
username = "default"
password = ""
source_table_name = "fake1"
}
}
While the format is very easy to read and understand, I could see it getting pretty gnarly with large tables. I’ll comment here that, like many open-source projects, the docs are rather lacking, but the project seems to have a pretty active
It’s a Java system, and they say that version 8 or 11 is required but should work with older systems. If you already have Java installed, then you just need to get the plugins you want from their site (or write your own) and set them up in their config file. After that, you create the config file that will manage the job as we described. It’s all pretty straightforward as long as you have the credentials to access your source and destination data repositories. The console will give feedback on what is happening.
A
SeaTunnel is undoubtedly only for some; it comes into play when dealing with lots of data across various data sources and destinations, as I currently see it. I can certainly see situations where it would simplify things, so I’ll keep this project in my bag of tricks. The SeaTunnel folks have this good quick-start guide available
You can read the other “What the heck” articles at these links: