Hi! I am an experienced DevOps engineer and I decided to participate in the contest by Hackernoon and Aptible.
I want to share the story of migrating logging and tracing services from Elastic Stack to Grafana Stack and what came out of it. Before the migration, my team used quite classical schemes:
This is a common variant among projects. It suited us for the first year and a half of the project's life. But as time passed, microservices proliferated like mushrooms after rain, and the volume of client requests grew. Expansion of resources of logging and tracing systems became more and more frequent. More and more storage and computing power was required. On top of that, the X-Pack license was pushing the price even higher. When problems with licenses and access to Elastic's products began to appear, it became clear that we couldn't go on living like this.
While searching for the best solution, we tried different variants of components, wrote Kubernetes operators, and collected two buckets of trouble. In the end, the schemes took the following form:
This is how we managed to unite the 3 most important aspects of monitoring: metrics, logs, and tracers in one Grafana workspace and get several benefits from it. The main ones are:
I hope this article will be helpful for those who are just choosing a logging/tracing system and those with similar difficulties.
It was 2022, and as mentioned above, we were using a fairly standard centralized logging scheme.
About 1 TB of logs were accumulated per day. The cluster consisted of about 10 Elasticsearch data nodes, X-Pack license was purchased (mainly for domain authorization and alerting). Applications were mostly deployed in the Kubernetes cluster. Fluent Bit was used to send them, followed by Kafka as a buffer and a Logstash pool under each namespace.
In the process of system operation, we encountered various problems. Some of them were solved quite easily. For some, only a workaround was suitable, and some could not be solved at all. Combining the second and third groups of problems prompted us to search for another solution. Let me first list the 4 most significant of these problems.
Strangely enough, the first component that started having problems was Fluent Bit. From time to time, it stopped sending individual Kubernetes pod logs. Analyzing debug logs, tuning buffers, and updating the version did not produce the desired effect. Vector was taken as a replacement. As it turned out later, it also had similar problems. But it was fixed in version 0.21.0.
The next annoyance was the forced use of workarounds when enabling DLQ on Logstash. Logstash does not know how to rotate logs that get into this queue. As an almost official workaround, it was suggested to simply restart the instance after reaching the threshold volume. This did not affect the system negatively since Kafka was used as input, and the service was terminated in graceful mode. But it was a pain to see the constantly growing number of restarts of pods. Restarts sometimes masked other problems.
Not a very convenient description of alerting rules. You can click through the Kibana web interface, but using the description through code is more convenient, like in Prometheus. The syntax is rather non-obvious and vomitous.
But the main problems were related to the increasing cost of resources consumed and, as a result, the cost of the system. The most greedy components turned out to be Logstash and Elasticsearch because JVM is notoriously partial to the amount of memory.
Jaeger was used for centralized tracers collection. It sent data via Kafka to a separate Elastics cluster. Tracers were not getting smaller. We had to scale the system to accommodate hundreds of gigabytes of traces daily.
The scheme looked like this:
A common inconvenience was also the use of different web interfaces for different aspects of monitoring:
Of course, Grafana allows both Elasticsearch and Jaeger to be connected as data sources, but the need to manipulate different data query syntaxes remains.
Our search for new logging and tracing solutions began with these assumptions. We did not conduct any comparative analysis of different systems. Now on the market, there are mostly serious products that require licensing. So we decided to take the open-source project Grafana Loki, which had already been successfully implemented in some companies.
So, why Loki was chosen:
Distributed helm-chart was chosen as the deployment method. Object Storage was chosen as the chunk storage. The more lightweight Vector replaced Logstash. The system is quite functional out of the box at low volumes (several hundred messages per second). Logs can be saved almost in real-time mode, search for fresh data works almost as fast as in Kibana. The resulting scheme looks as follows.
As the load increases, the recording chain starts to suffer. Both ingester and distributor start dropping logs, returning timeouts, etc. And when requesting data missing in the ingester's cache, the response timeout starts to approach a minute or even drop by timeout. Just in case, here is a diagram of how the installation of Loki via Distributed chart looks like.
In case of recording problems, it is worth paying attention to the parameters:
limits_config.ingestrion_burst _size_mb
limits_config.rate_mb
They are responsible for bandwidth. When traffic approaches the thresholds, messages will be discarded.
To increase search speed, you should use Memcached or Redis caching. You can also play with the parameters:
limits.config.split_queries_by_interal
frontend_worker.parallelisim
Loki can use caching of 4 types of data:
It is worth using at least the first three.
After tuning component settings, scaling, and enabling the cache, logging delays disappeared, and the search began to work in an acceptable time (within 10 seconds).
As for alerting rules, they are written similarly to Prometheus rules.
- alert: low_log_rate_common
expr: sum(count_over_time({namespace="common"}[15m])) < 50
for: 5m
labels:
severity: warning
annotations:
summary: Count is less than 50 from {{ $labels.namespace }}. Current VALUE = {{ $value }}
LogQL query language is used to access messages. It is similar in syntax to PromQL. In terms of visualization, everything looks like in Kibana.
A very convenient feature is setting up links from log messages to the corresponding tracers. In this case, the tracer is opened when you click on the link in the right half of the workspace.
Tempo is a younger product (introduced in 2020) with a very similar architecture to Loki. Distributed helm-chart was chosen as the deployment method.
Distributors can connect to Kafka directly, but in this form, it was impossible to achieve read speeds commensurate with write speeds. By using the Grafana agent, it was possible to solve this problem.
The following points should be paid attention to during the configuration process:
In addition to the tracer visualization, you can also get a connected graph showing all the components involved in processing the request.
From the initial installation of Loki in the dev environment to its implementation in production took about 2 months. Tempo took the same amount of time.
As a result, we were able to:
It is worth mentioning what we have not managed to achieve yet:
In general, I can say that the experience of switching to Grafana Stack was successful, but the process of improving the logging and tracing schemes did not end.