Republished from TDengine blog. I am a Director of Solutions Engineering at TDengine.
Time-series databases (TSDB) are aimed at processing time-stamped data points in an efficient way. The data model is the most important part of any data management system. Different TSDBs adopt different data models. To help developers deploy the right TSDB for their application, this blog compares data models for a few popular TSDBs on GitHub. Particularly we look at InfluxDB, TDengine, TimeScale, Prometheus, OpenTSDB, and QuestDB.
A time-series dataset contains a sequence of time-stamped metrics. The metrics are always generated by a single device, sensor, or data collection agent. Let’s call it a data collection point, DCP. Each DCP has static attributes, like serial number, model, color, hostname, app name, etc. In TSDB, these attributes are treated as labels or tags. These tags/labels actually are dimensional data and can be used to filter, group, or match the time series during analysis.
To make it easier to understand, let’s use connected vehicles as an example in this blog. Each vehicle reports its GPS position(x, y, z) to a server periodically. Each vehicle has its own unique VIN (vehicle identification number), brand, and model.
We will look at the data model from the following dimensions:
While TimeScale, TDengine, and QuestDB adopt a relational data model, OpenTSDB, Prometheus, and InfluxDB adopt a Tag Set data model. Prometheus shares the same tag set data model as OpenTSDB with some minor enhancements and so in this blog, we will not address OpenTSDB specifically.
In a tag set data model, every time series is uniquely identified by a metric name (measurement in InfluxDB nomenclature) and a set of labels (tags in InfluxDB nomenclature ). In our connected vehicles scenario, each time series is identified by:
GPS Coordinates + VIN + Brand + Model
For Prometheus/OpenTSDB, a time-series for connected vehicles could be written like this:
gps_x { vin=”vin1”, brand=”tesla”, model=”s3”}
gps_y { vin=”vin1”, brand=”tesla”, model=”s3”}
gps_z { vin=”vin1”, brand=”tesla”, model=”s3”}
In the case of InfluxDB, the time-series could be written as:
gps, vin=vin1, brand=tesla, model=s3
The Tag Set model resembles NoSQL, where schema and indexes are not required to be defined. Indexes on tags are created automatically, but indexes on metrics are not created. In fact, indexes on metrics are not allowed to be created which is a shortcoming of the tag set model. On the other handwriting to the database is very simple and it is quite easy to prototype simplistic applications. This makes it very attractive for developers.
To aggregate multiple time-series, a label/tag set filter has to be applied.
If we add one more tag or change the tag value, it is treated as a different time-series. While typically tags may not be changed, as IoT grows and devices become more complex and capable of capturing more data, we can anticipate that tags/labels will in fact be added probably quite frequently. In our simple connected vehicles scenario, if we add color as one more dimension for analysis, the Tag Set model treats it as a new time-series even though it is actually not. Additionally, in the Tag Set model, tags are always treated as strings, and no other data types are supported. This makes it impossible to filter within a range for example.
Another drawback of the Tag Set data model is that the query language is always proprietary. The early version of InfluxDB’s query language is SQL-like, but the new version is powered by its proprietary Flux query language. Prometheus also has its own query language, PromQL, and OpenTSDB adopts a set of its own APIs for querying. Proprietary query languages impose an unnecessary learning cost which is better spent on robust development especially in a fast-moving IoT world.
In applications that use a relational data model, the schema is always defined first. In the case of time-series data, there is always a timestamp column. In addition, each metric has a dedicated column and a data type. There are advantages to this. For e.g. the design of the schema is useful in validating the user’s input. Requiring schema design at the beginning of the project forces developers to be thoughtful about the data model, which reduces the effort for analysis and change management later on and reduces risk.
One of the best advantages of the relational data model is the query language – SQL. SQL is both an ISO and an ANSI standard. TimeScale, QuestDB, and TDengine all support SQL with some extensions. Since SQL is supported, many analytical tools, BI tools, and graphing tools can be used without any effort and queries can be reused with little or no change.
Defining a schema is viewed as a hurdle for developers used to NoSQL models who may want to adopt a relational data model.
To overcome this problem, both TDengine and QuestDB support schemaless data ingestion with built-in support for automatic table creation. TDengine supports both InfluxDB’s Line Protocol and OpenTSDB’s telnet and JSON format protocols. TDengine has out-of-box support for Telegraf, collectd, StatsD and other agents which can write data directly into TDengine.
A relational data model, of course, defines a schema for collected metrics. But TDengine, TimeScale and QuestDB adopt different ways to handle labels/tags associated with each data collection point.
For TimeScale, we have to create a table to store collected metrics and then convert this table into a HyperTable. In addition, we also need to create another table to store the tags. If we want to execute a basic query such as aggregation over a set of time-series, we need to perform a JOIN operation between the HyperTable and the tags table. Given our connected vehicle scenario, the following SQL statements shall be executed:
create table vehicles(vin SERIAL PRIMARY KEY, brand VARCHAR(20), model VARCHAR(20));
create table vehicle_data(time TIMESTAMPTZ NOT NULL, vin SERIAL, x DOUBLE PRECISION, y DOUBLE PRECISION, z DOUBLE PRECISION, FOREIGN KEY (vin) REFERENCES vehicles (vin));
If we want to retrieve data for a specific vehicle the SQL is:
select * from vehicle_data where vin=”vin1”
If you want to query all the data for Tesla Model S3, the SQL is:
select * from vehicle_data join vehicles on vehicle_data.vin= vehicles.vin where brand=“tesla” and model=”s3”
TDengine’s developers have designed a storage engine that is highly optimized for time-series data. To enhance the overall storage and query performance, TDengine requires a table for each time-series or data collection point. Furthermore, TDengine introduces a novel concept called the SuperTable (STable). An STable is used to describe a category of devices which generate the same metrics. An STable’s schema includes two parts – one part for metrics, and another part for tags.
For our connected vehicle scenario, the STable is created as below:
create table vehicle (ts timestamp, double x, double y, double z) tags (brand nchar(16), model nchar(16))
For each time-series or data collection point, a table needs to be created. In TDengine, this table is created by using a STable as a template with tag value assigned. For example, for a vehicle vin1, the table is created as follows:
create table vin1 using vehicle tags (“tesla”, “s3);
In other words, compared to a regular table, the table for each time-series is attached to a set of tags. The table name is unique in a database and can be set to a VIN in our connected vehicle scenario. To query the data on a specific vehicle, the SQL statement will be:
select * from vin1
To run a query over a set of vehicles, it queries on STable instead of the table, the SQL statement will is:
select * from vehicle where brand = “tesla” and model = “s3”;
By not requiring unnecessary joins the query in TDengine is much simpler when compared to queries in TimeScale.
QuestDB puts tags and metrics together in the table definition. For our connected vehicle scenario, the created table looks like this:
create vehicle(ts timestamp, vin symbol, brand symbol, model symbol, x double, y double, z double)
If you want to run query on a set of vehicles, the SQL will be like:
select * from vehicle where brand = “tesla” and model = “s3”;
It’s similar to TDengine; simple and straightforward. If you want to query data on a specific vehicle, the SQL will be like:
select * from vehicle where vin = "vin1";
Putting metrics and tags in a single table is a natural way for data modeling, but it raises a challenge for storage efficiency and query performance.
Prometheus and OpenTSDB adopt a single-column model. Each metric is treated as a different time series. In our connected vehicle case, GPS position (x, y and z) requires 3-time series.
While the single-column concept is very easy to understand and quite flexible, it consumes significantly more resources. In our connected vehicle example, the GPS position x, y and z each requires a timestamp which doubles the data size in both RAM and on disk. Reading the GPS position requires reading from 3-time series which leads to much longer latency.
InfluxDB, TDengine, TimeScale and QuestDB all support multi-column. Typically, a time series will contain multiple metrics which share the same timestamp. Obviously, multi-column support saves resources since the timestamp is stored only once. In addition, it also reduces the number of time-series or cardinality, thus reducing metadata size and reducing latency.
QuestDB stores each column as a separate file and thus supports a huge number of columns in a table. TimeScale can currently only support 1600 columns while TDengine can currently only support 4096 columns. In the next big release of TDengine, TDengine will support as many columns as are needed without any performance degradation.
In lots of scenarios, metrics are typically tied together. For e.g. the x, y and z coordinates of a GPS position are always tied together. In a smart power meter, current and voltage are always connected and it does not make sense to separate them. In DevOps, the inbound and outbound traffic volume are always measured at the same time and CPU and memory usage are always sampled at the same time. We should not need to separate them into two time-series.
In some scenarios, collected metrics may change a lot and it may be hard to define a schema at the beginning. While single column is a good choice for flexibility, for the majority of scenarios such as GPS position, smart power meters, financial data and many others, the metrics are defined very clearly. In these cases multi-column is a better choice to save system resources and to reduce latency.
For flexibility and performance, multi-column and single column model can be mixed together as is done in TDengine.
Prometheus and OpenTSDB only support numeric data type while TDengine, TimeScale, InfluxDB and QuestDB supports numeric and many other data types.
In some use cases, such as IT infrastructure monitoring, a numeric data type is good enough. In most other use cases, however, such as IoT and connected vehicles, a numeric data type is not enough. It requires Boolean, integer, and string data types among others.
An advantage of using different data types for different metrics is that different compression algorithms can be used to enhance the compression ratio.
For developers who are more used to NoSQL models, the Tag Set model is quite easy to get started with since it does not require a schema to be defined. But this early simplicity can lead to complexity later in various ways, such as making tags more difficult. As we have seen, the query language is always proprietary and not standards-based like SQL. This leads to unnecessary learning costs and a lack of reusability. If the application requires more analytical features or more third-party tools, the relational data model is a better choice.
We have also seen that the multi-column model can save system resources and reduce latency while the single-column model is more flexible to accommodate schema changes. The best practice for a growing IoT world is to combine multi-column and single-column together in the platform.
As for data types, based on the scenario, it is usually quite easy to arrive at an answer to the question of whether support for more data types is better.
First published here