Getting value from Distributed Tracing

Table of Contents

Background

In today’s application landscape, where microservices architectures are increasingly common, ensuring fast and reliable user experiences is crucial. Latency, the time it takes for an application to respond to a request, directly impacts user satisfaction and business outcomes. Traditional monitoring tools often struggle to pinpoint performance bottlenecks in complex distributed systems. Distributed tracing emerges as a powerful solution, offering a holistic view of requests as they flow through various microservices.

This article explores the benefits of distributed tracing, dives into how to read and analyze traces, and provides practical guidance on optimizing application performance using distributed tracing techniques.

Why latency matters?

Simply speaking, latency is the amount of time users need to wait for an application to do a unit of work, such as initial page load, saving a form, or moving to the next screen. Lower latency translates to a faster and more responsive user experience. Below you can find statistics about the business impact of application latency (Source: Chapter 8 of Distributed Tracing in Practice - Instrumenting, Analyzing, and Debugging Microservices; O’Reilly Media).

Google:

  • increasing the time to load a page of search results by half a second reduced the total number of searches by 20%
  • and conversely, if Google made a page faster, it would see a roughly proportional increase in the amount of usage.

Akamai

  • latency of 100 milliseconds could reduce e-commerce conversion by as much as 7%.

Pinterest

  • 40% reduction in visitor wait time resulted in a 15% increase (sign-up conversion).

Why distributed tracing?

Distributed tracing helps overcome limitations of traditional monitoring tools when it comes to latency optimization.

Traditional logs and metrics:

  • Lack of context
  • Needs manual correlation
  • Easy to create too verbose (logs)
  • Cardinality issue for metrics with many dimensions

Distributed tracing:

  • Shows whole request
  • Event centric
  • Has contextual metadata
  • Show relationships between different services

Distributed tracing system

Any distributed tracing system consists of the following parts:

  • Application instrumentation
  • Data collection
  • Data storage
  • Visualization and data mining
Distributed tracing system architecture
Distributed tracing system architecture

Why OpenTelemetry?

Many companies are switching their distributed tracing stack to use OpenTelemetry because of its numerous benefits:

  • Vendor- and tool-agnostic
  • Single set of APIs and conventions
  • Helps with instrumentation
  • Collecting metrics for sampling and transformation
  • Integrate OpenTelemetry once and use it with any storage and data visualization tools
OpenTelemetry was 2nd fastest growing project in CNCF landscape
OpenTelemetry was 2nd fastest growing project in CNCF landscape

Methodology for maximizing ROI: A Five-Step Process

Follow this five-step process when working with distributed tracing:

Avoid common mistakes

  • Not collecting traces into a single backend
  • Not enforcing standards
  • Multiple teams working without coordination
  • Not building internally used blueprints, libraries, documentation, and best practices

Getting value

Below, I will guide you through how to read distributed traces, find patterns, and optimize application performance.

Before you begin

Enabling distributed tracing for your application

Stay tuned for more of my articles.

Reading distributed traces

Distributed traces are visualized as Gantt chart looking timelines (in some sources this visualization is called Flame Graphs and waterfall view). Example:

Trace example
Trace example

Traces view shows:

  • how long a request took as a whole
  • what components of our system were used to serve the request
  • causality and order of events

Framework and mindset

Optimizing application is an iterative process consisting of the following stages:

  1. Collecting data (instrumenting your application)
  2. Making an assumption
  3. Finding statistically significant examples of requests
  4. Review of traces
  5. Improving data collection (adding more instrumentation)
  6. Go to step 1

Finding statistically significant requests to optimize (step 3)

Use histograms instead of latency time-series charts.

Time-series charts shows average performance for your application.

Time-series chart
Time-series chart

It is not uncommon to see requests to the same endpoint of a service has different performance profiles. Depending on the caller service or other dimensions (payload size, payload arguments, user related data and etc.) performance profiles can represent different behaviors in the distributed systems. We should not be measuring the performance with a single number (even if it’s a high-percentile number); we should be looking at the distribution of that number and do multimodal analysis. On the example below we can identify that latency distribution consist of 3 groups:

Request histograms
Request histograms

Review traces (step 4)

Critical path

The critical path is combination of segments for which a differential increase in segment time would result in the same increase in end-to-end latency.

Critical path view
Critical path view

Benefits of the critical path view:

  • we can ignore all spans that are off the critical path (optimizing them will not reduce latency)
  • look for a longest span to find most impactful parts to optimize

Find error markers

  1. Longest span. Optimizing the longest span on the critical is likely to produce the largest benefit to the overall request latency.

  2. Missing spans. Often missing traces indicate missing instrumentation.

Missing span
Missing span
  1. Avoid sequential execution or “staircase”. “Staircase” pattern often is indication of suboptimal execution. There are, of course, situations when the algorithm genuinely requires sequential execution of tasks, but more often it is simply a masked bug that can be rewritten to do logic in parallel calls.
Staircase
Staircase
  1. Finish at exactly the same time. Spans finishing at the same time is not necessarily always an issue, but it always should be further investigated. Potential negative causes:
  • when the system supports timeouts with cancellations, the top-level span may wait for other tasks to finish, but since they did not complete within allowed timeout, it canceled them and aborted the whole request;
  • there is a resource contention and all the requests are waiting on some lock, such as a long-running database transaction from another request that locked the table.
Finish at exactly the same time
Finish at exactly the same time
  1. Detecting architectural problems. Use architecture diagram views to identify:
  • circular communications and inter-dependencies (depending of 2 or more services on each other)
  • redundant communication
  1. Failed spans and retries patterns

Conclusion

Distributed tracing empowers developers and operations teams to gain deep insights into application behavior.

By identifying performance bottlenecks, error patterns, and architectural inefficiencies, distributed tracing empowers proactive performance optimization and troubleshooting. By following the steps outlined in this article, you can leverage distributed tracing to unlock significant performance improvements and help your business.

Ilya Mochalov
Ilya Mochalov
DevSecOps

Helping improve software delivery, operations, and security