LIMITED TIME OFFER! Iām ready for my next adventure as a DevRel advocate/Technical Evangelist/IT Talespinner. If that sounds like something you need, drop me a line by email or on LinkedIn.
My blog on pricing from the other day caught the attention of the folks over at MetricFire, and we struck up a conversation about some of the ideas, ideals, and challenges swirling around monitoring, observability, and its place in the broader IT landscape.
At one point, JJ, the lead engineer, asked, āYou blogged about gearing up to get a certification in Open Telemetry. What is it about OTel that has you so excited?ā
I gave a quick answer, but JJās question got me thinking, and I wanted to put some of those ideas down here.
OTel Is the Best Thing Sinceā¦
Let me start by answering JJās question directly: I find Open Telemetry exciting because itās the biggest change in the way monitoring and observability are done since Traces (which came out around 2000, but wasnāt widely used until 2010-ish).
And Traces were the biggest change since⦠ever. Let me explain.
See this picture? This was what it was like to use monitoring to understand your environment back when I started almost 30 years ago. What we wanted was to know what was happening in that boat. But that was never an option.
We could scrape metrics together from network and OS commands and we could build some scripts and db queries that gave me a little bit more insight. We could collect and (with a lot of work) aggregate log messages together to spot trends across multiple systems. All of that would give us an idea of how the infrastructure was running and infer the things that might be happening topside. But we never really knew.
Tracing changed all that. All of a sudden we could get hard data (and get it in real-time) about what users were doing, and what was happening in the application when they did it.
It was a complete sea change (pun intended) for how we worked and what we monitored. Even so, tracing didnāt remove the need for metrics and logs. And famous (or infamous) āthree pillarsā of observability.
Recently, I started working through the book āLearning OpenTelemetryā and one of the comments that struck me was that these arenāt āthree pillarsā in the sense that they donāt combine to hold up a unified whole. Authors Ted Young and Austin Parker re-framed the combination of Metrics, Logs, and Traces as āThe three browser tabs of observabilityā because many tools put the effort back on the user to flip between screens and put it all together by sight.
On the other hand, OTel outputs can present all 3 streams of data as a single ābraid.ā
From Learning OpenTelemetry, by Ted Young and Austin Parker, Copyright Ā© 2024. Published by OāReilly Media, Inc. Used with permission.
It should be noted that despite OTelās ability to combine and correlate this information, the authors of the book point out later that many tools still lack the ability to present it that way.
Despite it being a work in progress (but what, in the world of IT, isnāt?), I still feel that OTel has already proven its potential to change the face of monitoring and observability.
OTel is the Esperanto of Monitoring
Almost every vendor will jump at the chance to get you to send all your data to them. They insist that theirs is the One True Observability Tool.
In fact, letās get this out in the open: There simply isnāt a singular ābestā monitoring tool out there any more than thereās one singular ābestā programming language, car model, or pizza style.* There isnāt a single tool that will cover 100% of your needs in every single use case.
And for the larger tools, even the use cases that arenāt part of their absolute sweet spot are going to cost you (in terms of hours or dollars) to get right.
So, youāre going to have multiple tools. It goes without saying (or at least it should) that youāre not going to ship a full copy of all your data to multiple vendors. Therefore, a big part of your work as a monitoring engineer (or team of engineers) is to map your telemetry to the use cases they support, and thus to the tools you need to employ in those use cases.
Thatās not actually a hard problem. Sure, itās complex, but once you have the mapping, making it happen is relatively easy. But, as I like to say, itās not the cost to buy the puppy that is the problem, itās the cost to keep feeding it.
Because the tools you have today are going to change down the road. Thatās when things get CRAZY hard. You have to hope things are documented well enough to understand all those telemetry-to-use-case mappings.
(Narrator: they will not, in fact, have it documented well enough)
Then you have to also hope your instrumentation is documented and understood well enough to know how to de-couple tool x and instrument tool y such that you maintain the same capabilities.
(Narrator: this is not how it will go down.)
But OTel solves both the ābuying the puppyā and the āfeeding the puppyā problem. My friend Matt Macdonald-Wallace (Solutions Architect at Grafana) put it like this:
OTEL does solve a lot of the problems around āOh great! now weāre trapped with vendor x and itās going to cost us millions to refactor all this codeā as opposed to āOh, weāre switching vendors? Cool, let me just update my endpointā¦ā
Not only that, but OTels ability to create pipelines (for those who are not up to speed on that concept, itās the ability to identify, filter, sample, and transform a stream of data before sending it to a specific destination) means you can send the same data stream to multiple locations selectively. Meaning your security team can get their raw unfiltered syslog while itās still on-premises. Some of the data ā traces, logs, and/or metrics ā can go to one or more vendors.
Which is why I say: OTel is the Esperanto of observability.
OTelās Secret Sauce Isnāt OTLP
ā¦itās standardization.
Before I explain why the real benefit to Otel is not OTLP, I should take a second to explain what OTLP is:
If you look up āWhat is Open Telemetry Line Protocol?ā youāll probably find some variation of āā¦a set of standards, rules, and/or conventions that specify how OTel elements send data from the thing that created it to a destinationā. This is technically true, but also not very helpful.
Functionally, OTLP is the magic box that takes metrics, logs, or traces and sends them where they need to go. Itās not as low level as, say, TCP, but in terms of how it changes a monitoring engineerās day, it may as well be. We donāt use OTLP so much as we indicate it should be used.
Just to be clear, OTLP is amazingly cool and important. Itās just not (in my opinion) AS important as some other aspects.
No, there are (at least) two things that, in my opinion, make OTel such an evolutionary shift in monitoring:
Collectors
First, it standardizes the model of having a 3-tier, collector (not agent) in the middle, architecture. For us old-timers in the monitoring space, the idea of a collector is nothing new. In the bygone era of everything-on-prem, you couldnāt get away with a thousand (or even a hundred) agents all talking to some remote destination. The shift to cloud architecture changed all that, but itās still not the best idea.
Having a single (or small number) of load-balanced systems that take all the data from multiple targets ā with the added benefit of being able to then process that data, filtering, sampling, combining, etc. ā before sending it forward is not just A Good Ideaā¢, it can have a direct impact on your bottom line by only sending the data you WANT (and in the form you want it) out the egress port that racks up such a big part of your monthly bill.
Semantics
Look, Iāll be the first to tell you that Iām not the worldās best developer. So, the issue of semantic terminology doesnāt usually keep me up at night. What DOES keep me up is the inability to get at a piece of data that I know should be there, but isnāt.
What I mean is that itās fairly common that the same data point ā say bandwidth ā is referred to by a completely different name and location on devices from two different vendors. And maybe that doesnāt seem so weird.
But how about the same data point being different on two different types of devices from the same vendor? Still not weird?
Letās talk about the same data point being different on the same device type from the same vendor, but two different models? Getting weird, right (not to mention annoying).
But the real kicker is when the same data point is different on two different parts of the same DEVICE.
Once youāve run down that particular rabbit whole, you have a whole different appreciation for semantic naming. If Iām looking for CPU or bandwidth or latency or whatever, I would really REALLY like for it to be called the same thing and be found in the semantically same location.
OTel does this, and does it as a core aspect of the platform. Iām not the only one to have noticed it, either.
Several years ago, during a meeting between the maintainers of Prometheus and OpenTelemetry, an unnamed Prometheus maintainer quipped, āYou know, Iām not sure about the rest of this, but these semantic conventions are the most valuable thing Iāve seen in a while.ā It may sound a bit silly, but itās also true.
From Learning OpenTelemetry, by Ted Young and Austin Parker, Copyright Ā© 2024. Published by OāReilly Media, Inc. Used with permission.
Summarizing the Data
Iāll admit that OpenTelemetry is still very (VERY) shiny for me.
But Iāll also admit that the more I dig into it, the more I find to like it. Hopefully, this blog has given you some reasons to check out OTel, too.
* OK, I lied. 1) Perl 2) The 1967 Ford Mustang 390 GT/A and 3) deep dish from Tel Aviv Kosher Pizza in Chicago