I frequently engage with customers that are amid breaking their monolithic applications into smaller microservices. Many teams with also see this migration as an opportunity to make applications more observable. As a result, customers inquire which metrics they should monitor for a typical cloud native application.
Previously, when a customer asked me how to instrument a service, I pointed them to the well known USE and RED methods. But, I felt the response wasn’t thorough. A list of specific metrics to monitor can be helpful for teams building cloud native applications. This post is an attempt to provide a list of metrics to collect in a typical application. Not all the metrics listed below apply to every application type. For example, batch-like workloads rarely serve traffic, and resultantly, don't need to keep a log of requests-served.
The goal of this document is to help developers come up with the golden signals for their applications.
Golden Signals, a term used first in the Google SRE handbook. Golden Signals are four metrics that will give you a very good idea of the real health and performance of your application as seen by the actors interacting with that service, whether they are final users or another service in your microservice application.
Cloud best practices recommend building systems that are observable. While the word observability (or “O11y” as it is popularly known) doesn’t have an official definition, it is the measure of a system’s ability to expose its internal state. The three pillars of observability are logs, metrics, and traces.
Modern systems are designed to produce logs, emit metrics, and provide traces to help developers and operators understand its internal state.
Emitting metrics by exposing them on an externally accessible HTTP endpoint is gaining wider adoption thanks to developers adopting Prometheus for monitoring. In this model, Prometheus pulls metrics by scraping the application’s
/metrics
endpoint.
When you run Node Exporter, it publishes metrics at
http://localhost:9100/metrics
Observability tools aggregate and analyze data from different sources to help you detect issues and identify bottlenecks. The goal is to use these system signals to improve its reliability and prevent downtime.
AIOps products like Amazon DevOps Guru can also detect anomalies using your application's logs, metrics, and traces (and other sources) and give you early signals to prevent a potential disruption.
For an application to function as designed, the application and its underlying system have to be healthy. Host metrics inform the operator of the host’s and infrastructure resource usage, like CPU, memory, I/O, etc. If you use Prometheus, Node Exporter collects this information automatically for you.
Host metrics rarely differ. Whether we run a process on an EC2 instance or a Raspberry Pi, we’re interested in the same metrics.
Unlike host metrics, application metrics are unique to each microservice. Application metrics are supposed to provide the operator the information so they can do these things:
There are several companies like application monitoring or APM companies like New Relic, DataDog that have products to aggregate application metrics using SDKs or agents. However, what they will not collect are the business specific metrics that only your application cares about.
In order to create a list of relevant metrics for an application, its architects must determine a signal for its every key function. The hallmark of a microservice is that it is designed to handle one key task. Therefore it shouldn’t have many key functions. Start by white-boarding the functions implemented in the code and creating a list of metrics that would help you gauge its performance (or its availability at the least).
Most measurements you’ll do will fall under one of these categories:
As the name suggests, this value is incremented when a function runs. Example: total requests served
A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.
Gauge
This type of metric tracks a value that increases or decreases over a period. Example: number of threads.
With that background, we can now talk about the common custom metrics developers use.
These are the obvious metrics to track for any application that serves traffic. Network metrics tell you how much load is placed on the system. Over time, these data points assist you when devising the scaling strategy for the system.
Things you should include are:
It is a best practice to monitor a systems saturation, which is a measure of your systems resource consumption. Every resource has a breaking point, beyond which additional stress causes performance degradation. Scalable and reliable systems are designed to never breach the breaking point.
However, simply collecting overall resource saturation at an application level is insufficient. You also need to look deeper at thread or resource pool level.
Consider collecting these metrics:
Common frameworks like Tomcat, Flask, etc. support exporting pre-defined metrics. For example, JMX already exposes a bunch of these metrics. See AWS CloudWatch documentation.
Besides serving the intended audience, bots or scripts flood internet facing web servers with requests. These automated requests can overload the system if unauthenticated requests are improperly handled (for example, not redirecting all unauthenticated requests to the authentication service and attempting to process an unauthenticated request).
Here are user related metrics to collect:
Some of these metrics may also come from your load balancer or ingress.
If your application follows the microservices approach, then the code fulfills one function, at least that’s the idea. What are the key performance indicators for your app’s function? Define them and track these metrics.
Should future releases cause performance regression, you’ll be able to detect it. Tracking these business metrics will help you track trends easily and avoid a cascading failure.
Here are common performance indicators that services care about:
If you still need help with identifying key metrics, ask yourself this question: In what ways can my application negatively affect the business even when it might appear to be healthy?
Along with monitoring your database instances using database monitoring tools, consider collecting database connection health metrics in your application. This is especially helpful if your application uses a shared database. If your application encounters database connection errors but the database remains operational for other application, you know the problem is on the application side, and not the database.
Consider recording these databases-related metrics:
SQLException
thrownWherever you’re persisting data, you need to ensure that you’re going to go over your quotas and run out of space. Besides, monitoring on disk and in-memory data volumes, don’t forget to monitor the data your application stores in databases and caches.
Speaking of cache, it is a best practice to monitor these metrics:
Also, consider using an external cache such as Redis or Memcached.
Keeping a track of how downstream services perform is also useful in understanding issues. Along with using timeouts, retries (preferably with exponential backoff), and circuit breakers, consider monitoring these metrics for every external service your service's proper functioning depends on:
The frequency at which you publish and collect metrics depends on your business requirements. For a retailer, knowing traffic patterns by the hour and day is useful in scaling capacity. Similarly, a travel company’s traffic pattern are influenced by holiday schedules.
Amazon EC2 provides instance metrics at 1-minute interval, which is a good start for critical metrics.
Remember that there’s a cost attached to exposing, collecting, and analyzing metrics. Collecting unnecessary information in metrics can put a strain on the system and slow down troubleshooting.
Consider giving the operator the control over the metrics your code should generate. This way, you can turn on specific metrics whenever needed.
Finding out which metrics to collect is an answer that only the most familiar with the code can answer. This post provides a list of metrics for you to get started.
Are there any metrics that I have overlooked? Let me know at @realz.
https://giedrius.blog/2019/05/11/push-vs-pull-in-monitoring-systems/
https://www.splunk.com/en_us/blog/learn/sre-metrics-four-golden-signals-of-monitoring.html
https://www.oreilly.com/library/view/learning-modern-linux/9781098108939/
https://www.oreilly.com/library/view/release-it/9781680500264/
Also published here.