Request correlation

The basic concept of distributed tracing appears to be very straightforward:

Instrumentation is inserted into chosen points of the program's code (trace points) and produces profiling data when executed
The profiling data is collected in a central location, correlated to the specific execution (request), arranged in the causality order, and combined into a trace that can be visualized or further analyzed

Of course, things are rarely as simple as they appear. There are multiple design decisions taken by the existing tracing systems, affecting how these systems perform, how difficult they are to integrate into existing distributed applications, and even what kinds of problems they can or cannot help to solve.

The ability to collect and correlate profiling data for a given execution or request initiator, and identify causally-related activities, is arguably the most distinctive feature of distributed tracing, setting it apart from all other profiling and observability tools. Different classes of solutions have been proposed in the industry and academia to address the correlation problem. Here, we will discuss the three most common approaches: black-box inference, domain-specific schemas, and metadata propagation.

Black-box inference

Techniques that do not require modifying the monitored system are known as black-box monitoring. Several tracing infrastructures have been proposed that use statistical analysis or machine learning (for example, the Mystery Machine [2]) to infer causality and request correlation by consuming only the records of the events occurring in the programs, most often by reading their logs. These techniques are attractive because they do not require modifications to the traced applications, but they have difficulties attributing causality in the general case of highly concurrent and asynchronous executions, such as those observed in event-driven systems. Their reliance on "big data" processing also makes them more expensive and higher latency compared to the other methods.

Schema-based

Magpie [3] proposed a technique that relied on manually-written, application-specific event schemas that allowed it to extract causality relationships from the event logs of production systems. Similar to the black-box approach, this technique does not require the applications to be instrumented explicitly; however, it is less general, as each application requires its own schemas.

This approach is not particularly suitable for modern distributed systems that consist of hundreds of microservices because it would be difficult to scale the manual creation of event schemas. The schema-based technique requires all events to be collected before the causality inference can be applied, so it is less scalable than other methods that allow sampling.

Metadata propagation

What if the instrumentation trace points could annotate the data they produce with a global identifier – let's call it an execution identifier – that is unique for each traced request? Then the tracing infrastructure receiving the annotated profiling data could easily reconstruct the full execution of the request, by grouping the records by the execution identifier. So, how do the trace points know which request is being executed when they are invoked, especially trace points in different components of a distributed application? The global execution identifier needs to be passed along the execution flow. This is achieved via a process known as metadata propagation or distributed context propagation.

Figure 3.2: Propagating the execution identifier as request metadata. The first service in the architecture (client) creates a unique execution identifier (Request ID) and passes it to the next service via metadata/context. The remaining services keep passing it along in the same way.

Metadata propagation in a distributed system consists of two parts: in-process and inter-process propagation. In-process propagation is responsible for making the metadata available to trace points inside a given program. It needs to be able to carry the context between the inbound and outbound network calls, dealing with possible thread switches or asynchronous behavior, which are common in modern applications. Inter-process propagation is responsible for transferring metadata over network calls when components of a distributed system communicate to each other during the execution of a given request.

Inter-process propagation is typically done by decorating communication frameworks with special tracing middleware that encodes metadata in the network messages, for example, in HTTP headers, Kafka records headers, and so on.

Figure 3.3: Metadata propagation in a single service. (1) The Handler that processes the inbound request is wrapped into instrumentation that extracts metadata from the request and stores it in a Context object in memory. (2) Some in-process propagation mechanism, for example, based on thread-local variables. (3) Instrumentation wraps an RPC client and injects metadata into outbound (downstream) requests.

The key disadvantage of metadata propagation-based tracing is the expectation of a white-box system whose components can be modified accordingly. However, it is more scalable and provides much higher accuracy of the data compared to black-box techniques, since all trace points are explicitly annotating the data with execution identifiers. In many programming languages, it is even possible to inject trace points automatically, without changes to the application itself, through a technique known as agent-based instrumentation (we will discuss this in more detail in Chapter 6, Tracing Standards and Ecosystem). Distributed tracing based on metadata propagation is by far the most popular approach and used by virtually all industrial-grade tracing systems today, both commercial and open source. Throughout the rest of this book, we will focus exclusively on this type of tracing systems. In Chapter 6, Tracing Standards and Ecosystem, we will see how new industry initiatives, such as the OpenTracing project [11], aim to reduce the cost of white-box instrumentation and make distributed tracing a standard practice in the development of modern cloud-native applications.

An acute reader may have noticed that the notion of propagating metadata alongside request execution is not limited to only passing the execution identifier for tracing purposes. Metadata propagation can be thought of as a prerequisite for distributed tracing, or distributed tracing can be thought of as an application built on top of distributed context propagation. In Chapter 10, Distributed Context Propagation we will discuss a variety of other possible applications.

Black-box inference

Techniques that do not require modifying the monitored system are known as black-box monitoring. Several tracing infrastructures have been proposed that use statistical analysis or machine learning (for example, the Mystery Machine [2]) to infer causality and request correlation by consuming only the records of the events occurring in the programs, most often by reading their logs. These techniques are attractive because they do not require modifications to the traced applications, but they have difficulties attributing causality in the general case of highly concurrent and asynchronous executions, such as those observed in event-driven systems. Their reliance on "big data" processing also makes them more expensive and higher latency compared to the other methods.

Schema-based

Magpie [3] proposed a technique that relied on manually-written, application-specific event schemas that allowed it to extract causality relationships from the event logs of production systems. Similar to the black-box approach, this technique does not require the applications to be instrumented explicitly; however, it is less general, as each application requires its own schemas.

This approach is not particularly suitable for modern distributed systems that consist of hundreds of microservices because it would be difficult to scale the manual creation of event schemas. The schema-based technique requires all events to be collected before the causality inference can be applied, so it is less scalable than other methods that allow sampling.

Metadata propagation

What if the instrumentation trace points could annotate the data they produce with a global identifier – let's call it an execution identifier – that is unique for each traced request? Then the tracing infrastructure receiving the annotated profiling data could easily reconstruct the full execution of the request, by grouping the records by the execution identifier. So, how do the trace points know which request is being executed when they are invoked, especially trace points in different components of a distributed application? The global execution identifier needs to be passed along the execution flow. This is achieved via a process known as metadata propagation or distributed context propagation.

Figure 3.2: Propagating the execution identifier as request metadata. The first service in the architecture (client) creates a unique execution identifier (Request ID) and passes it to the next service via metadata/context. The remaining services keep passing it along in the same way.

Metadata propagation in a distributed system consists of two parts: in-process and inter-process propagation. In-process propagation is responsible for making the metadata available to trace points inside a given program. It needs to be able to carry the context between the inbound and outbound network calls, dealing with possible thread switches or asynchronous behavior, which are common in modern applications. Inter-process propagation is responsible for transferring metadata over network calls when components of a distributed system communicate to each other during the execution of a given request.

Inter-process propagation is typically done by decorating communication frameworks with special tracing middleware that encodes metadata in the network messages, for example, in HTTP headers, Kafka records headers, and so on.

Figure 3.3: Metadata propagation in a single service. (1) The Handler that processes the inbound request is wrapped into instrumentation that extracts metadata from the request and stores it in a Context object in memory. (2) Some in-process propagation mechanism, for example, based on thread-local variables. (3) Instrumentation wraps an RPC client and injects metadata into outbound (downstream) requests.

The key disadvantage of metadata propagation-based tracing is the expectation of a white-box system whose components can be modified accordingly. However, it is more scalable and provides much higher accuracy of the data compared to black-box techniques, since all trace points are explicitly annotating the data with execution identifiers. In many programming languages, it is even possible to inject trace points automatically, without changes to the application itself, through a technique known as agent-based instrumentation (we will discuss this in more detail in Chapter 6, Tracing Standards and Ecosystem). Distributed tracing based on metadata propagation is by far the most popular approach and used by virtually all industrial-grade tracing systems today, both commercial and open source. Throughout the rest of this book, we will focus exclusively on this type of tracing systems. In Chapter 6, Tracing Standards and Ecosystem, we will see how new industry initiatives, such as the OpenTracing project [11], aim to reduce the cost of white-box instrumentation and make distributed tracing a standard practice in the development of modern cloud-native applications.

An acute reader may have noticed that the notion of propagating metadata alongside request execution is not limited to only passing the execution identifier for tracing purposes. Metadata propagation can be thought of as a prerequisite for distributed tracing, or distributed tracing can be thought of as an application built on top of distributed context propagation. In Chapter 10, Distributed Context Propagation we will discuss a variety of other possible applications.

Schema-based

Magpie [3] proposed a technique that relied on manually-written, application-specific event schemas that allowed it to extract causality relationships from the event logs of production systems. Similar to the black-box approach, this technique does not require the applications to be instrumented explicitly; however, it is less general, as each application requires its own schemas.

This approach is not particularly suitable for modern distributed systems that consist of hundreds of microservices because it would be difficult to scale the manual creation of event schemas. The schema-based technique requires all events to be collected before the causality inference can be applied, so it is less scalable than other methods that allow sampling.

Metadata propagation

What if the instrumentation trace points could annotate the data they produce with a global identifier – let's call it an execution identifier – that is unique for each traced request? Then the tracing infrastructure receiving the annotated profiling data could easily reconstruct the full execution of the request, by grouping the records by the execution identifier. So, how do the trace points know which request is being executed when they are invoked, especially trace points in different components of a distributed application? The global execution identifier needs to be passed along the execution flow. This is achieved via a process known as metadata propagation or distributed context propagation.

Figure 3.2: Propagating the execution identifier as request metadata. The first service in the architecture (client) creates a unique execution identifier (Request ID) and passes it to the next service via metadata/context. The remaining services keep passing it along in the same way.

Metadata propagation in a distributed system consists of two parts: in-process and inter-process propagation. In-process propagation is responsible for making the metadata available to trace points inside a given program. It needs to be able to carry the context between the inbound and outbound network calls, dealing with possible thread switches or asynchronous behavior, which are common in modern applications. Inter-process propagation is responsible for transferring metadata over network calls when components of a distributed system communicate to each other during the execution of a given request.

Inter-process propagation is typically done by decorating communication frameworks with special tracing middleware that encodes metadata in the network messages, for example, in HTTP headers, Kafka records headers, and so on.

Figure 3.3: Metadata propagation in a single service. (1) The Handler that processes the inbound request is wrapped into instrumentation that extracts metadata from the request and stores it in a Context object in memory. (2) Some in-process propagation mechanism, for example, based on thread-local variables. (3) Instrumentation wraps an RPC client and injects metadata into outbound (downstream) requests.

The key disadvantage of metadata propagation-based tracing is the expectation of a white-box system whose components can be modified accordingly. However, it is more scalable and provides much higher accuracy of the data compared to black-box techniques, since all trace points are explicitly annotating the data with execution identifiers. In many programming languages, it is even possible to inject trace points automatically, without changes to the application itself, through a technique known as agent-based instrumentation (we will discuss this in more detail in Chapter 6, Tracing Standards and Ecosystem). Distributed tracing based on metadata propagation is by far the most popular approach and used by virtually all industrial-grade tracing systems today, both commercial and open source. Throughout the rest of this book, we will focus exclusively on this type of tracing systems. In Chapter 6, Tracing Standards and Ecosystem, we will see how new industry initiatives, such as the OpenTracing project [11], aim to reduce the cost of white-box instrumentation and make distributed tracing a standard practice in the development of modern cloud-native applications.

An acute reader may have noticed that the notion of propagating metadata alongside request execution is not limited to only passing the execution identifier for tracing purposes. Metadata propagation can be thought of as a prerequisite for distributed tracing, or distributed tracing can be thought of as an application built on top of distributed context propagation. In Chapter 10, Distributed Context Propagation we will discuss a variety of other possible applications.

Metadata propagation

What if the instrumentation trace points could annotate the data they produce with a global identifier – let's call it an execution identifier – that is unique for each traced request? Then the tracing infrastructure receiving the annotated profiling data could easily reconstruct the full execution of the request, by grouping the records by the execution identifier. So, how do the trace points know which request is being executed when they are invoked, especially trace points in different components of a distributed application? The global execution identifier needs to be passed along the execution flow. This is achieved via a process known as metadata propagation or distributed context propagation.

Figure 3.2: Propagating the execution identifier as request metadata. The first service in the architecture (client) creates a unique execution identifier (Request ID) and passes it to the next service via metadata/context. The remaining services keep passing it along in the same way.

Metadata propagation in a distributed system consists of two parts: in-process and inter-process propagation. In-process propagation is responsible for making the metadata available to trace points inside a given program. It needs to be able to carry the context between the inbound and outbound network calls, dealing with possible thread switches or asynchronous behavior, which are common in modern applications. Inter-process propagation is responsible for transferring metadata over network calls when components of a distributed system communicate to each other during the execution of a given request.

Inter-process propagation is typically done by decorating communication frameworks with special tracing middleware that encodes metadata in the network messages, for example, in HTTP headers, Kafka records headers, and so on.

Figure 3.3: Metadata propagation in a single service. (1) The Handler that processes the inbound request is wrapped into instrumentation that extracts metadata from the request and stores it in a Context object in memory. (2) Some in-process propagation mechanism, for example, based on thread-local variables. (3) Instrumentation wraps an RPC client and injects metadata into outbound (downstream) requests.

The key disadvantage of metadata propagation-based tracing is the expectation of a white-box system whose components can be modified accordingly. However, it is more scalable and provides much higher accuracy of the data compared to black-box techniques, since all trace points are explicitly annotating the data with execution identifiers. In many programming languages, it is even possible to inject trace points automatically, without changes to the application itself, through a technique known as agent-based instrumentation (we will discuss this in more detail in Chapter 6, Tracing Standards and Ecosystem). Distributed tracing based on metadata propagation is by far the most popular approach and used by virtually all industrial-grade tracing systems today, both commercial and open source. Throughout the rest of this book, we will focus exclusively on this type of tracing systems. In Chapter 6, Tracing Standards and Ecosystem, we will see how new industry initiatives, such as the OpenTracing project [11], aim to reduce the cost of white-box instrumentation and make distributed tracing a standard practice in the development of modern cloud-native applications.

An acute reader may have noticed that the notion of propagating metadata alongside request execution is not limited to only passing the execution identifier for tracing purposes. Metadata propagation can be thought of as a prerequisite for distributed tracing, or distributed tracing can be thought of as an application built on top of distributed context propagation. In Chapter 10, Distributed Context Propagation we will discuss a variety of other possible applications.

Anatomy of distributed tracing

The following diagram shows a typical organization of distributed tracing systems, built around metadata propagation. The microservices or components of a distributed application are instrumented with trace points that observe the execution of a request. The trace points record causality and profiling information about the request and pass it to the tracing system through calls to a Tracing API, which may depend on the specific tracing backend or be vendor neutral, like the OpenTracing API [11] that we will discuss in Chapter 4, Instrumentation Basics with OpenTracing.

Figure 3.4: Anatomy of distributed tracing

Special trace points at the edges of the microservice, which we can call inject and extract trace points, are also responsible for encoding and decoding metadata for passing it across process boundaries. In certain cases, the inject/extract trace points are used even between libraries and components, for example, when a Python code is making a call to an extension written in C, which may not have direct access to the metadata represented in a Python data structure.

The Tracing API is implemented by a concrete tracing library that reports the collected data to the tracing backend, usually with some in-memory batching to reduce the communications overhead. Reporting is always done asynchronously in the background, off the critical path of the business requests. The tracing backend receives the tracing data, normalizes it to a common trace model representation, and puts it in a persistent trace storage. Because tracing data for a single request usually arrives from many different hosts, the trace storage is often organized to store individual pieces incrementally, indexed by the execution identifier. This allows for later reconstruction of the whole trace for the purpose of visualization, or additional processing through aggregations and data mining.

Sampling

Sampling affects which records produced by the trace points are captured by the tracing infrastructure. It is used to control the volume of data the tracing backend needs to store, as well as the performance overhead and impact on the applications from executing tracing instrumentation. We discuss sampling in detail in Chapter 8, All About Sampling.

Preserving causality

If we only pass the execution identifier as request metadata and tag tracing records with it, it is sufficient to reassemble that data into a single collection, but it is not sufficient to reconstruct the execution graph of causally-related activities. Tracing systems need to capture causality that allows assembling the data captured by the trace points in the correct sequence. Unfortunately, knowing which activities are truly causally-related is very difficult, even with very invasive instrumentation. Most tracing systems elect to preserve Lamport's happens-before relation [4], denoted as → and formally defined as the least strict partial order on events, such that:

If events a and b occur in the same process, then a → b if the occurrence of event a preceded the occurrence of event b
If event a is the sending of a message and event b is the reception of the message sent in event a, then a → b

The happens-before relation can be too indiscriminate if applied liberally: may have influenced is not the same as has influenced. The tracing infrastructures rely on the additional domain knowledge about the systems being traced, and the execution environment, to avoid capturing irrelevant causality. By threading the metadata along the individual executions, they establish the relationships between items with the same or related metadata (that is, metadata containing different trace point IDs by the same execution ID). The metadata can be static or dynamic throughout the execution.

Tracing infrastructures that use static metadata, such as a single unique execution identifier, throughout the life cycle of a request, must capture additional clues via trace points, in order to establish the happens-before relationships between the events. For example, if part of an execution is performed on a single thread, then using the local timestamps allows correct ordering of the events. Alternatively, in a client-server communication, the tracing system may infer that the sending of a network message by the client happens before the server receiving that message. Similar to black-box inference systems, this approach cannot always identify causality between events when additional clues are lost or not available from the instrumentation. It can, however, guarantee that all events for a given execution will be correctly identified.

Most of today's industrial-grade tracing infrastructures use dynamic metadata, which can be fixed-width or variable-width. For example, X-Trace [5], Dapper [6], and many similar tracing systems use fixed-width dynamic metadata, where, in addition to the execution identifier, they record a unique ID (for example, a random 64-bit value) of the event captured by the trace point. When the next trace point is executed, it stores the inbound event ID as part of its tracing data, and replaces it with its own ID.

In the following diagram, we see five trace points causally linked to a single execution. The metadata propagated after each trace point is a three-part tuple (execution ID, event ID, and parent ID). Each trace point stores the parent event ID from inbound metadata as part of its captured trace record. The fork at trace point b and join at trace point e illustrate how causal relationships forming a directed acyclic graph can be captured using this scheme.

Figure 3.5: Establishing causal relationships using dynamic, fixed-width metadata

Using fixed-width dynamic metadata, the tracing infrastructure can explicitly record happens-before relationships between trace events, which gives it an edge over the static metadata approach. However, it is also somewhat brittle if some of the trace records are lost because it will no longer be able to order the events in the order of causality.

Some tracing systems use a variation of the fixed-width approach by introducing the notion of a trace segment, which is represented by another unique ID that is constant within a single process, and only changes when metadata is sent over the network to another process. It reduces the brittleness slightly, by making the system more tolerant to the loss of trace records within a single process, in particular when the tracing infrastructure is proactively trying to reduce the volume of trace data to control the overhead, by keeping the trace points only at the edges of a process and discarding all the internal trace points.

When using end-to-end tracing on distributed systems, where profiling data loss is a constant factor, some tracing infrastructures, for example, Azure Application Insights, use variable-width dynamic metadata, which grows as the execution travels further down the call graph from the request origin.

The following diagram illustrates this approach, where each next event ID is generated by appending a sequence number to the previous event ID. When a fork happens at event 1, two distinct sequence numbers are used to represent parallel events 1.1 and 1.2. The benefit of this scheme is higher tolerance to data loss; for example, if the record for event 1.2 is lost, it is still possible to infer the happens-before relationship 1 → 1.2.1.

Figure 3.6: Establishing causal relationships using dynamic, variable-width metadata

Inter-request causality

Sambasivan and others [10] argue that another critical architectural decision that significantly affects the types of problems an end-to-end tracing infrastructure is able to address is the question of how it attributes latent work. For example, a request may write data to a memory buffer that is flushed to the disk at a later time, after the originating request has been completed. Such buffers are commonly implemented for performance reasons and, at the time of writing, the buffer may contain data produced by many different requests. The question is: who is responsible for the use of resources and the time spent by the system on writing the buffer out?

The work can be attributed to the last request that made the buffer full and caused the write (trigger-preserving attribution), or it can be attributed proportionally to all requests that produced the data into the buffer before the flush (submitter-preserving attribution). Trigger-preserving attribution is easier to implement because it does not require access to the instrumentation data about the earlier executions that affected the latent work.

However, it disproportionally penalizes the last request, especially if the tracing infrastructure is used for monitoring and attributing resource consumption. The submitter-preserving attribution is fair in that regard but requires that the profiling data for all previous executions is available when latent work happens. This can be quite expensive and does not work well with some forms of sampling usually applied by the tracing infrastructure (we will discuss sampling in chapter 8).

Inter-request causality

Sambasivan and others [10] argue that another critical architectural decision that significantly affects the types of problems an end-to-end tracing infrastructure is able to address is the question of how it attributes latent work. For example, a request may write data to a memory buffer that is flushed to the disk at a later time, after the originating request has been completed. Such buffers are commonly implemented for performance reasons and, at the time of writing, the buffer may contain data produced by many different requests. The question is: who is responsible for the use of resources and the time spent by the system on writing the buffer out?

The work can be attributed to the last request that made the buffer full and caused the write (trigger-preserving attribution), or it can be attributed proportionally to all requests that produced the data into the buffer before the flush (submitter-preserving attribution). Trigger-preserving attribution is easier to implement because it does not require access to the instrumentation data about the earlier executions that affected the latent work.

However, it disproportionally penalizes the last request, especially if the tracing infrastructure is used for monitoring and attributing resource consumption. The submitter-preserving attribution is fair in that regard but requires that the profiling data for all previous executions is available when latent work happens. This can be quite expensive and does not work well with some forms of sampling usually applied by the tracing infrastructure (we will discuss sampling in chapter 8).

Trace models

In Figure 3.4, we saw a component called "Collection/Normalization." The purpose of this component is to receive tracing data from the trace points in the applications and convert it to some normalized trace model, before saving it in the trace storage. Aside from the usual architectural advantages of having a façade on top of the trace storage, the normalization is especially important when we are faced with the diversity of instrumentations. It is quite common for many production environments to be using numerous versions of instrumentation libraries, from very recent ones to some that are several years old. It is also common for those versions to capture trace data in very different formats and models, both physical and conceptual. The normalization layer acts as an equalizer and translates all those varieties into a single logical trace model, which can later be uniformly processed by the trace visualization and analysis tools. In this section, we will focus on two of the most popular conceptual trace models: event model and span model.

Event model

So far, we have discussed tracing instrumentation taking the form of trace points that record events when the request execution passes through them. An event represents a single point in time in the end-to-end execution. Assuming that we also record the happens-before relationships between these events, we intuitively arrive to the model of a trace as a directed acyclic graph, with nodes representing the events and edges representing the causality.

Some tracing systems (for example, X-Trace [5]) use such an event model as the final form of the traces they surface to the user. The diagram in Figure 3.7 illustrates an event graph observed from the execution of an RPC request/response by a client-server application. It includes events collected at different layers of the stack, from application-level events (for example, "client send" and "server receive") to events in the TCP/IP stack.

The graph contains multiple forks used to model request execution at different layers, and multiple joins where these logical parallel executions converge to higher-level layers. Many developers find the event model difficult to work with because it is too low level and obscures useful higher-level primitives. For example, it is natural for the developer of the client application to think of the RPC request as a single operation that has start (client sent) and end (client receive) events. However, in the event graph these two nodes are far apart.

Figure 3.7: Trace representation of an RPC request between client and server in the event model, with trace events recorded at application and TCP/IP layers

The next diagram (Figure 3.8) shows an even more extreme example, where a fairly simple workflow becomes hard to decipher when represented as an event graph. A frontend Spring application running on Tomcat is calling another application called remotesrv, which is running on JBoss. The remotesrv application is making two calls to a PostgreSQL database.

It is easy to notice that aside from the "info" events shown in boxes with rounded corners, all other records come in pairs of entry and exit events. The info events are interesting in that they look almost like a noise: they most likely contain useful information if we had to troubleshoot this particular workflow, but they do not add much to our understanding of the shape of the workflow itself. We can think of them as info logs, only captured via trace points. We also see an example of fork and join because the info event from tomcat-jbossclient happens in parallel with the execution happening in the remotesrv application.

Figure 3.8: Event model-based graph of an RPC request between a Spring application running on Tomcat and a remotesrv application running on JBoss, and talking to a PostgreSQL database. The boxes with rounded corners represent simple point-in-time "info" events.

Span model

Having observed that, as in the preceding example, most execution graphs include well-defined pairs of entry/exit events representing certain operations performed by the application, Sigelman and others [6], proposed a simplified trace model, which made the trace graphs much easier to understand. In Dapper [6], which was designed for Google's RPC-heavy architecture, the traces are represented as trees, where tree nodes are basic units of work referred to as spans. The edges in the tree, as usual, indicate causal relationships between a span and its parent span. Each span is a simple log of timestamped records, including its start and end time, a human-readable operation name, and zero or more intermediary application-specific annotations in the form of (timestamp, description) pairs, which are equivalent to the info events in the previous example.

Figure 3.9: Using the span model to represent the same RPC execution as in as in Figure 3.8. Left: the resulting trace as a tree of spans. Right: the same trace shown as a Gantt chart. The info events are no longer included as separate nodes in the graph; instead they are modeled as timestamped annotations in the spans, shown as pills in the Gantt chart.

Each span is assigned a unique ID (for example, a random 64-bit value), which is propagated via metadata along with the execution ID. When a new span is started, it records the ID of the previous span as its parent ID, thus capturing the causality. In the preceding example, the remote server represents its main operation in the span with ID=6. When it makes a call to the database, it starts another span with ID=7 and parent ID=6.

Dapper originally advocated for the model of multi-server spans, where a client application that makes an RPC call creates a new span ID, and passes it as part of the call, and the server that receives the RPC logs its events using the same span ID. Unlike the preceding figure, the multi-server span model resulted in fewer spans in the tree because each RPC call is represented by only one span, even though two services are involved in doing the work as part of that RPC. This multi-server span model was used by other tracing systems, such as Zipkin [7] (where spans were often called shared spans). It was later discovered that this model unnecessarily complicates the post-collection trace processing and analysis, so newer tracing systems like Jaeger [8] opted for a single-host span model, in which an RPC call is represented by two separate spans: one on the client and another on the server, with the client span being the parent.

The tree-like span model is easy to understand for the programmers, whether they are instrumenting their applications or retrieving the traces from the tracing system for analysis. Because each span has only one parent, the causality is represented with a simple call-stack type view of the computation that it is easy to implement and to reason about.

Effectively, traces in this model look like distributed stack traces, a concept very intuitive to all developers. This makes the span model for traces the most popular in the industry, supported by the majority of tracing infrastructures. Even tracing systems that collect instrumentations in the form of single point-in-time events (for example, Canopy [9]) go to the extra effort to convert trace events into something very similar to the span model. Canopy authors claim that "events are an inappropriate abstraction to expose to engineers adding instrumentation to systems," and propose another representation they call modeled trace, which describes the requests in terms of execution units, blocks, points, and edges.

The original span model introduced in Dapper was only able to represent executions as trees. It struggled to represent other execution models, such as queues, asynchronous executions, and multi-parent causality (forks and joins). Canopy works around that by allowing instrumentation to record edges for non-obvious causal relationships between points. The OpenTracing API, on the other hand, sticks with the classic, simpler span model but allows spans to contain multiple "references" to other spans, in order to support joins and asynchronous execution.

Event model

So far, we have discussed tracing instrumentation taking the form of trace points that record events when the request execution passes through them. An event represents a single point in time in the end-to-end execution. Assuming that we also record the happens-before relationships between these events, we intuitively arrive to the model of a trace as a directed acyclic graph, with nodes representing the events and edges representing the causality.

Some tracing systems (for example, X-Trace [5]) use such an event model as the final form of the traces they surface to the user. The diagram in Figure 3.7 illustrates an event graph observed from the execution of an RPC request/response by a client-server application. It includes events collected at different layers of the stack, from application-level events (for example, "client send" and "server receive") to events in the TCP/IP stack.

The graph contains multiple forks used to model request execution at different layers, and multiple joins where these logical parallel executions converge to higher-level layers. Many developers find the event model difficult to work with because it is too low level and obscures useful higher-level primitives. For example, it is natural for the developer of the client application to think of the RPC request as a single operation that has start (client sent) and end (client receive) events. However, in the event graph these two nodes are far apart.

Figure 3.7: Trace representation of an RPC request between client and server in the event model, with trace events recorded at application and TCP/IP layers

The next diagram (Figure 3.8) shows an even more extreme example, where a fairly simple workflow becomes hard to decipher when represented as an event graph. A frontend Spring application running on Tomcat is calling another application called remotesrv, which is running on JBoss. The remotesrv application is making two calls to a PostgreSQL database.

It is easy to notice that aside from the "info" events shown in boxes with rounded corners, all other records come in pairs of entry and exit events. The info events are interesting in that they look almost like a noise: they most likely contain useful information if we had to troubleshoot this particular workflow, but they do not add much to our understanding of the shape of the workflow itself. We can think of them as info logs, only captured via trace points. We also see an example of fork and join because the info event from tomcat-jbossclient happens in parallel with the execution happening in the remotesrv application.

Figure 3.8: Event model-based graph of an RPC request between a Spring application running on Tomcat and a remotesrv application running on JBoss, and talking to a PostgreSQL database. The boxes with rounded corners represent simple point-in-time "info" events.

Span model

Having observed that, as in the preceding example, most execution graphs include well-defined pairs of entry/exit events representing certain operations performed by the application, Sigelman and others [6], proposed a simplified trace model, which made the trace graphs much easier to understand. In Dapper [6], which was designed for Google's RPC-heavy architecture, the traces are represented as trees, where tree nodes are basic units of work referred to as spans. The edges in the tree, as usual, indicate causal relationships between a span and its parent span. Each span is a simple log of timestamped records, including its start and end time, a human-readable operation name, and zero or more intermediary application-specific annotations in the form of (timestamp, description) pairs, which are equivalent to the info events in the previous example.

Figure 3.9: Using the span model to represent the same RPC execution as in as in Figure 3.8. Left: the resulting trace as a tree of spans. Right: the same trace shown as a Gantt chart. The info events are no longer included as separate nodes in the graph; instead they are modeled as timestamped annotations in the spans, shown as pills in the Gantt chart.

Each span is assigned a unique ID (for example, a random 64-bit value), which is propagated via metadata along with the execution ID. When a new span is started, it records the ID of the previous span as its parent ID, thus capturing the causality. In the preceding example, the remote server represents its main operation in the span with ID=6. When it makes a call to the database, it starts another span with ID=7 and parent ID=6.

Dapper originally advocated for the model of multi-server spans, where a client application that makes an RPC call creates a new span ID, and passes it as part of the call, and the server that receives the RPC logs its events using the same span ID. Unlike the preceding figure, the multi-server span model resulted in fewer spans in the tree because each RPC call is represented by only one span, even though two services are involved in doing the work as part of that RPC. This multi-server span model was used by other tracing systems, such as Zipkin [7] (where spans were often called shared spans). It was later discovered that this model unnecessarily complicates the post-collection trace processing and analysis, so newer tracing systems like Jaeger [8] opted for a single-host span model, in which an RPC call is represented by two separate spans: one on the client and another on the server, with the client span being the parent.

The tree-like span model is easy to understand for the programmers, whether they are instrumenting their applications or retrieving the traces from the tracing system for analysis. Because each span has only one parent, the causality is represented with a simple call-stack type view of the computation that it is easy to implement and to reason about.

Effectively, traces in this model look like distributed stack traces, a concept very intuitive to all developers. This makes the span model for traces the most popular in the industry, supported by the majority of tracing infrastructures. Even tracing systems that collect instrumentations in the form of single point-in-time events (for example, Canopy [9]) go to the extra effort to convert trace events into something very similar to the span model. Canopy authors claim that "events are an inappropriate abstraction to expose to engineers adding instrumentation to systems," and propose another representation they call modeled trace, which describes the requests in terms of execution units, blocks, points, and edges.

The original span model introduced in Dapper was only able to represent executions as trees. It struggled to represent other execution models, such as queues, asynchronous executions, and multi-parent causality (forks and joins). Canopy works around that by allowing instrumentation to record edges for non-obvious causal relationships between points. The OpenTracing API, on the other hand, sticks with the classic, simpler span model but allows spans to contain multiple "references" to other spans, in order to support joins and asynchronous execution.

Span model

Having observed that, as in the preceding example, most execution graphs include well-defined pairs of entry/exit events representing certain operations performed by the application, Sigelman and others [6], proposed a simplified trace model, which made the trace graphs much easier to understand. In Dapper [6], which was designed for Google's RPC-heavy architecture, the traces are represented as trees, where tree nodes are basic units of work referred to as spans. The edges in the tree, as usual, indicate causal relationships between a span and its parent span. Each span is a simple log of timestamped records, including its start and end time, a human-readable operation name, and zero or more intermediary application-specific annotations in the form of (timestamp, description) pairs, which are equivalent to the info events in the previous example.

Figure 3.9: Using the span model to represent the same RPC execution as in as in Figure 3.8. Left: the resulting trace as a tree of spans. Right: the same trace shown as a Gantt chart. The info events are no longer included as separate nodes in the graph; instead they are modeled as timestamped annotations in the spans, shown as pills in the Gantt chart.

Each span is assigned a unique ID (for example, a random 64-bit value), which is propagated via metadata along with the execution ID. When a new span is started, it records the ID of the previous span as its parent ID, thus capturing the causality. In the preceding example, the remote server represents its main operation in the span with ID=6. When it makes a call to the database, it starts another span with ID=7 and parent ID=6.

Dapper originally advocated for the model of multi-server spans, where a client application that makes an RPC call creates a new span ID, and passes it as part of the call, and the server that receives the RPC logs its events using the same span ID. Unlike the preceding figure, the multi-server span model resulted in fewer spans in the tree because each RPC call is represented by only one span, even though two services are involved in doing the work as part of that RPC. This multi-server span model was used by other tracing systems, such as Zipkin [7] (where spans were often called shared spans). It was later discovered that this model unnecessarily complicates the post-collection trace processing and analysis, so newer tracing systems like Jaeger [8] opted for a single-host span model, in which an RPC call is represented by two separate spans: one on the client and another on the server, with the client span being the parent.

The tree-like span model is easy to understand for the programmers, whether they are instrumenting their applications or retrieving the traces from the tracing system for analysis. Because each span has only one parent, the causality is represented with a simple call-stack type view of the computation that it is easy to implement and to reason about.

Effectively, traces in this model look like distributed stack traces, a concept very intuitive to all developers. This makes the span model for traces the most popular in the industry, supported by the majority of tracing infrastructures. Even tracing systems that collect instrumentations in the form of single point-in-time events (for example, Canopy [9]) go to the extra effort to convert trace events into something very similar to the span model. Canopy authors claim that "events are an inappropriate abstraction to expose to engineers adding instrumentation to systems," and propose another representation they call modeled trace, which describes the requests in terms of execution units, blocks, points, and edges.

The original span model introduced in Dapper was only able to represent executions as trees. It struggled to represent other execution models, such as queues, asynchronous executions, and multi-parent causality (forks and joins). Canopy works around that by allowing instrumentation to record edges for non-obvious causal relationships between points. The OpenTracing API, on the other hand, sticks with the classic, simpler span model but allows spans to contain multiple "references" to other spans, in order to support joins and asynchronous execution.

Clock skew adjustment

Anyone working with distributed systems programming knows that there is no such thing as accurate time. Each computer has a hardware clock built in, but those clocks tend to drift, and even using synchronization protocols like NTP can only get the servers maybe within a millisecond of each other. Yet we have seen that end-to-end tracing instrumentation captures the timestamp with most tracing events. How can we trust those timestamps?

Clearly, we cannot trust the timestamps to be actually correct, but this is not what we often look for when we analyze distributed traces. It is more important that timestamps in the trace are correctly aligned relative to each other. When the timestamps are from the same process, such as the start of the server span and the extra info annotations in the following diagram, we can assume that their relative positions are correct. The timestamps from different processes on the same host are generally incomparable because even though they are not subject to the hardware clock skew, the accuracy of the timestamps depends on many other factors, such as what programming language is used for a given process and what time libraries it is using and how. The timestamps from different servers are definitely incomparable due to hardware clock drifts, but we can do something about that.

Figure 3.10: Clock skew adjustment. When we know the causality relationships between the events, such as "client-send must happen before server-receive", we can consistently adjust the timestamps for one of the two services, to make sure that the causality constraints are satisfied. The annotations within the span do not need to be adjusted, since we can assume their timestamps to be accurate relative to the beginning and end timestamps of the span.

Consider the client and server spans at the top diagram in Figure 3.10. Let's assume that we know from instrumentation that this was a blocking RPC request, that is, the server could not have received the request before the client sent it, and the client could not have received the response before the server finished the execution (this reasoning only works if the client span is longer than the server span, which is not always the case). These basic causality rules allow us to detect if the server span is misaligned on the timeline based on its reported timestamps, as we can see in the example. However, we don't know how much it is misaligned.

We can adjust the timestamps for all events originating from the server process by shifting it to the left until its start and end events fall within the time range of the larger client span, as shown at the bottom of the diagram. After this adjustment, we end up with two variables, Clock skew adjustment and , that are still unknown to us. If there are no more occurrences of client and server interaction in the given trace, and no additional causality information, we can make an arbitrary decision on how to set the variables, for example, by positioning the server span exactly in the middle of the client span:

The values of Clock skew adjustment and calculated this way provide us with an estimate of the time spent by RPC in network communication. We are making an arbitrary assumption that both request and response took roughly the same time to be transmitted over the network. In other cases, we may have additional causality information from the trace, for example the server may have called a database and then another node in the trace graph called the same database server. That gives us two sets of constraints on the possible clock skew adjustment of the database spans. For example, from the first parent we want to adjust the database span by -2.5 ms and from the second parent by -5.5 ms. Since it's the same database server, we only need one adjustment to its clock skew, and we can try to find the one that works for both calling nodes (maybe it's -3.5 ms), even though the child spans may not be exactly in the middle of the parent spans, as we have arbitrarily done in the preceding formula.

In general, we can walk the trace and aggregate a large number of constraints using this approach. Then we can solve them as a set of linear equations for a full set of clock skew adjustments and we can apply to the trace to align the spans.

In the end, the clock skew adjustment process is always heuristic, since we typically don't have other reliable signals to calculate it precisely. There are scenarios when this heuristic technique goes wrong and the resulting trace views make little sense to the users. Therefore, the tracing systems are advised to provide both adjusted and unadjusted views of the traces, as well as to clearly indicate when the adjustments are applied.

Trace analysis

Once the trace records are collected and normalized by the tracing infrastructure, they can be used for analysis, using visualizations or data mining algorithms. We will cover some of the data mining techniques in Chapter 12, Gathering Insights with Data Mining.

Tracing system implementers are always looking for new creative visualizations of the data, and end users often build their own views based on specific features they are looking for. Some of the most popular and easy-to-implement views include Gantt charts, service graphs, and request flow graphs.

We have seen examples of Gantt charts in this chapter. Gantt charts are mostly used to visualize individual traces. The x axis shows relative time, usually from the beginning of the request, and the y axis represents different layers and components of the architecture participating in the execution of the request. Gantt charts are good for analyzing the latency of the requests, as they easily show which spans in the trace take the longest time, and combined with critical path analysis can zoom in on problematic areas. The overall shape of the chart can reveal other performance problems at a glance, like the lack of parallelism among sub-requests or unexpected synchronization/blocking.

Service graphs are constructed from a large corpus of traces. Fan-outs from a node indicate calls to other components. This visualization can be used for analysis of service dependencies in large microservices-based applications. The edges can be decorated with additional information, such as the frequency of calls between two given components in the corpus of traces.

Request flow graphs represent the execution of individual requests, as we have seen in the examples in the section on the event model. When using the event model, the fan-outs in the flow graph represent parallel execution and fan-ins are joins in the execution. With the span model, the flow graphs can be shown differently; for example, fan-outs can simply represent the calls to other components similar to the service graph, rather than implying concurrency.

Summary

This chapter introduced the fundamental principles underlying most open source, commercial, and academic distributed tracing systems, and the anatomy of a typical implementation. Metadata propagation is the most-popular and frequently-implemented approach to correlating tracing records with a particular execution, and capturing causal relationships. Event model and span model are the two completing trace representations, trading expressiveness for ease of use.

We briefly mentioned a few visualization techniques, and more examples of visualization, and data mining use cases, will be discussed in subsequent chapters.

In the next chapter, we will go through an exercise to instrument a simple "Hello, World!" application for distributed tracing, using the OpenTracing API.

References

Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dynamic Instrumentation of Production Systems. Proceedings of the 2004 USENIX Annual Technical Conference, June 27-July 2, 2004.
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. October 6–8, 2014.
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. Using Magpie for request extraction and workload modelling. OSDI '04: Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation, 2004.
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21 (7), July1978.
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-Trace: a pervasive network tracing framework. In NSDI '07: Proceedings of the 4^th USENIX Symposium on Networked Systems Design and Implementation, 2007.
Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed system tracing infrastructure. Technical Report dapper-2010-1, Google, April 2010.
Chris Aniszczyk. Distributed Systems Tracing with Zipkin. Twitter Engineering blog, June 2012: https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.html.
Yuri Shkuro. Evolving Distributed Tracing at Uber Engineering. Uber Engineering blog, February 2017: https://eng.uber.com/distributed-tracing/.
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. Canopy: An End-to-End Performance Tracing and Analysis System. Symposium on Operating Systems Principles, October 2017.
Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, and Gregory R. Ganger. So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102, April 2014.
The OpenTracing Project: http://opentracing.io/.

Mastering Distributed Tracing

By : Cole, Yuri Shkuro

Mastering Distributed Tracing

By: Cole, Yuri Shkuro

Overview of this book

Chapter 3. Distributed Tracing Fundamentals

The idea

Request correlation

Black-box inference

Schema-based

Metadata propagation

Black-box inference

Schema-based

Metadata propagation

Schema-based

Metadata propagation

Metadata propagation

Anatomy of distributed tracing

Sampling

Preserving causality

Inter-request causality

Inter-request causality

Trace models

Event model

Span model

Event model

Span model

Span model

Clock skew adjustment

Trace analysis

Summary

References

Mastering Distributed Tracing

By : Cole, Yuri Shkuro

Mastering Distributed Tracing

By: Cole, Yuri Shkuro

Overview of this book

Chapter 3. Distributed Tracing Fundamentals

The idea

Request correlation

Black-box inference

Schema-based

Metadata propagation

Black-box inference

Schema-based

Metadata propagation

Schema-based

Metadata propagation

Metadata propagation

Anatomy of distributed tracing

Sampling

Preserving causality

Inter-request causality

Inter-request causality

Trace models

Event model

Span model

Event model

Span model

Span model

Clock skew adjustment

Trace analysis

Summary

References

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?