This One Prometheus Feature Could Save You Hours of Debugging

In the world of observability, we often find ourselves working with two fundamental types of telemetry data: metrics and traces. While both are crucial for understanding system behavior, they traditionally existed in separate domains. Exemplars emerge as the crucial bridge between these two worlds, offering a powerful way to correlate aggregate metrics with individual traces. Let's dive deep into understanding exemplars and their practical applications.

What Are Exemplars?

An exemplar is a representative sample that links a metric to a specific trace. Think of it as a bookmark that says "here's an actual request that represents this metric." When you're looking at a metric—say, the latency of your API requests—an exemplar provides a direct link to a trace that occurred during that measurement period.

For example, consider a histogram metric tracking RPC durations. With exemplars, one of its data points might look like this in the exported metrics:

rpc_durations_histogram{le="1.0"} 69  # {traceId="9c78ed"} 0.9654 1.713867692e+09

In the above snippet, the histogram bucket le="1.0" has a count of 69, and after the # we see an exemplar: a trace ID 9c78ed with a recorded value 0.9654 at timestamp 1.713867692e+09. This means one of the RPC calls that fell into this bucket (<=1.0s latency) is represented by trace 9c78ed which had a latency of ~0.9654s. The exemplar “bridges” the metric and a specific trace, without adding a new metric time-series label (exemplars attach to metric values, not to the label set). In other words, exemplars do not increase the cardinality of your metrics – they are stored as metadata alongside the metric values, typically one exemplar per time series at a time.

Why Are Exemplars Valuable?

Exemplars solve several critical challenges in observability:

1. Bridging the Aggregation Gap

Metrics are great for understanding system behavior at scale, but they lose the context of individual requests. Exemplars restore this context by connecting aggregate data points to specific traces. This connection is invaluable when investigating performance issues or understanding unusual patterns in your metrics.

2. Performance Investigation

When troubleshooting performance issues, exemplars allow you to quickly move from identifying a problem in your metrics (like a latency spike) to investigating the root cause through a detailed trace. This significantly reduces the time needed to identify and resolve issues.

3. Anomaly Understanding

For unusual events or outliers in your metrics, exemplars provide immediate context by linking to traces that represent these anomalies. This helps you understand what makes these cases different from normal behavior.

Implementing Exemplars in Your Code

Let's look at how to implement exemplars in both Go and Python, two popular languages for backend services.

Go Implementation

Here's how you can instrument your Go code to expose exemplars:

package main

import (
    "context"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "go.opentelemetry.io/otel/trace"
)

var (
    // Define a histogram with exemplar support
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"handler"},
    )
)

func measureRequest(ctx context.Context, handlerName string, duration float64) {
    // Get the current span from context
    span := trace.SpanFromContext(ctx)
    
    // Create exemplar labels with trace ID
    exemplar := prometheus.Labels{
        "traceID": span.SpanContext().TraceID().String(),
    }
    
    // Record the observation with exemplar
    requestDuration.WithLabelValues(handlerName).
        ObserveWithExemplar(duration, exemplar)
}

Python Implementation

Here's how to implement exemplars in Python using the OpenTelemetry and Prometheus libraries:


from prometheus_client import Histogram
from opentelemetry import trace
from typing import Optional

# Define a histogram with exemplar support
request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['handler']
)

def measure_request(handler_name: str, duration: float, trace_id: Optional[str] = None):
    # Get current span from context
    current_span = trace.get_current_span()
    
    # Create exemplar labels
    exemplar = {'traceID': trace_id or current_span.get_span_context().trace_id}
    
    # Record the observation with exemplar
    request_duration.labels(handler=handler_name).observe(
        duration,
        exemplar=exemplar
    )

Configuring Systems to Support Exemplars

If you use remote storage or a global metrics system (e.g. Thanos, Mimir etc), exemplars are supported there as well - allowing you to push exemplar data through Prometheus’s remote write interface to backends.

Prometheus Remote Write Configuration

To send exemplars over remote write, ensure that your Prometheus remote_write configuration has send_exemplars: true enabledg. This is a setting in Prometheus 2.27+ that allows exemplar data to be included in remote write payloads (which uses Protocol Buffer format). For example:

remote_write:
  - url: https://mimir.example.com/api/v1/push
    send_exemplars: true

This instructs Prometheus to forward exemplars to the remote endpoint.

Best Practices for Using Exemplars

When implementing exemplars in your observability strategy, consider these best practices:

1. Selective Recording

Don't create exemplars for every single measurement. Instead, focus on recording exemplars for significant events or samples that represent important patterns in your metrics. This helps maintain system performance while still providing valuable insights.

2. Meaningful Context

When creating exemplars, include relevant context that will help during investigation. The trace ID is essential, but you might also want to include other identifiers that could help correlate the data with other systems or logging information.

3. Storage Consideration

Remember that exemplars add storage overhead to your metrics. Work with your metrics storage system (like Prometheus, Thanos, or Mimir) to understand how exemplars affect storage requirements and retention policies.

Limitations of Exemplars

While exemplars are powerful, they do have some limitations worth noting:

By definition, an exemplar is just one example, which can be a significant pitfall when investigating complex issues that require analyzing patterns across multiple traces
It's up to the developer to determine which traceID is most "relevant" to publish as an exemplar, which can be challenging for complex systems

Conclusion

Whether you're using Go, Python, or another language, the investment in implementing exemplars will pay dividends in your observability journey. As the technology continues to mature and gain wider adoption across the observability ecosystem, exemplars will become an increasingly essential tool for modern software engineering teams. Remember that exemplars are just one part of a comprehensive observability strategy. They work best when combined with well-structured logging, comprehensive metrics, and detailed tracing implementation. Together, these tools provide the visibility needed to maintain and improve modern distributed systems.

This One Prometheus Feature Could Save You Hours of Debugging

Too Long; Didn't Read

Company Mentioned

What Are Exemplars?

Why Are Exemplars Valuable?

1. Bridging the Aggregation Gap

2. Performance Investigation

3. Anomaly Understanding

Implementing Exemplars in Your Code

Go Implementation

Python Implementation

Configuring Systems to Support Exemplars

Prometheus Remote Write Configuration

Best Practices for Using Exemplars

1. Selective Recording

2. Meaningful Context

3. Storage Consideration

Limitations of Exemplars

Conclusion

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

This One Prometheus Feature Could Save You Hours of Debugging

Too Long; Didn't Read

Company Mentioned

What Are Exemplars?

Why Are Exemplars Valuable?

1. Bridging the Aggregation Gap

2. Performance Investigation

3. Anomaly Understanding

Implementing Exemplars in Your Code

Go Implementation

Python Implementation

Configuring Systems to Support Exemplars

Prometheus Remote Write Configuration

Best Practices for Using Exemplars

1. Selective Recording

2. Meaningful Context

3. Storage Consideration

Limitations of Exemplars

Conclusion

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics