Uber Research Introduces ‘CRISP’: A Tool To Extract Critical Paths From Numerous Jaeger Traces

Source: https://eng.uber.com/crisp-critical-path-analysis-for-microservice-architectures/

Microservices architecture consists of a collection of small, self-contained services. Each service should implement a particular business capability and be self-contained within a bounded context. A bounded context is a natural division inside a firm that provides an explicit boundary within which a domain model can exist. 

Uber’s backend is the epitome of such microservice structures. Uber has numerous microstructures interacting with each other through remote procedure calls(RPC). A remote procedure call (RPC) is when a computer program causes a procedure (subroutine) to operate in a different address space without the programmer explicitly specifying the details for the remote interaction.

However, the life of a request call creates much complexity in terms of the interactions. As a result of the nested and asynchronous nature of these interactions, a need arises to invoke numerous other downstream options. Hence, determining which underlying service contributes to the entire end-to-end latency encountered by a top-level request is extremely difficult. Other factors such as time-to-live error rates also fall within the scope of interest.

Uber introduces a tool: CRISP, to pinpoint and quantify underlying services that affect the overall latency in a microservice architecture. The name CRISP is derived from the words critical and span. It is essential to figure out the critical path in a complex service dependency graph, and CRISP does precisely this. In a microservice dependency graph, the critical path is the longest chain of dependent tasks. To lower a request’s end-to-end latency, the critical path length must be reduced. As a result, prioritizing the services that are on the critical path benefits a latency optimization endeavor.

CRISP generates the critical route via a request’s graph of dependencies using JaegerTM’s RPC tracing tool. CRISP has knobs that allow you to explore the details using different percentile values.

Critical Path:

RPC generally results in a direct acyclic graph (DAG). The computational DAG comprises nodes and edges, with nodes representing serial execution units and edges representing task interdependence. The total effort is the sum of all node weights, and the critical path is the DAG’s longest weighted path. In parallel execution, the crucial path is the longest path through the DAG.

Jaeger™ Tracing Tool:

For tracing interactions across microservices, Uber heavily relies on Jaeger, an open-source utility software. A Jaeger trace is a JSON blob made up of several “spans.” A span is a unit of serial labor like a weighted node in the computation of DAG. Except for the root span, every span has a reference to its parent span, which defines the caller-callee relationship between spans. The connection comes in two flavors:

  • Child-of: represents a request-response relationship where the parent waits for the child to finish
  • Follows-from: represents a “fire-and-forget” mode of operation where the parent does not wait for the child to complete.

Although it is possible to collect a reliable dependency graph of RPC calls for a given request, the amount of data would be restricted if all submissions were traced. To tackle this issue, a sampling strategy is employed to collect only a fraction of requests. When a request is flagged for Jaeger monitoring at the system’s entry point, all of the request’s following downstream calls are gathered into a single trace.

Unfortunately, in a complicated microservice context, the visual evaluation of the Jaeger traces is complex. This necessitates software development that can evaluate and summarise traces and turn them into actionable tasks for engineers.

The start and finish of service are represented by “spans,” and each span can contain child spans for the RPC calls that the service makes to downstream services. The “spawn” and “sync” synchronization events are represented by the beginning and end of these child spans. Speeding up the components on the critical path is strictly necessary to speed up the service. 

Replaying the trace in reverse order from the end to the beginning, looking for synchronization events, is how the critical route is computed from an individual trace.

Critical Path Algorithm

Starting with the top-level span, the CP algorithm ranks all children according to their completion time. The CP algorithm is responsible for all of the components on the critical path. The procedure starts with the latest finishing child (LFC) of the top-level Jaeger “span.” It recursively invokes CP on LFC, which estimates the critical route for a specific Jaeger span. When the CP algorithm exits recursion, it crawls backward and chooses another child who returned before LFC began.

There are two views on the critical path time on every operation, let’s say S:

  • Inclusive Time: This metric is the sum of the wall clock time spent on the S and the wallclock time spent on the transitive closure of all other S calls. This is also solely the execution time of S.
  • Exclusive Time: This metric represents the amount of time spent in S as measured by the wall clock. If T appears on the critical route, this metric ignores the time spent in S’s children’s operation.

Dealing with Clock Skew:

To insert start and end times for spans, Jaeger spans use the local clock of the host servicing its RPC. Although all times are in UTC, clocks may have arbitrary skews. As a result, a parent span may begin after its child span, or a child span may conclude after the end of its parent span. The parent-child bond is the only reality in Jaeger’s traces.

The clock skew could lead into one of the below issues:

  • Serial Child Invocation Appearing as Overlapping Children:

Consider operation A, a simple wrapper that launches three underlying operations in that order: B, C, and D. Before executing the next call, parent A waits for the previous one to return. However, operation B slightly overlaps with operation C in the Jaeger trace because of the temporal skew. Operation C also overlaps with D. Only D and B would be added to the critical path by the CP method, while C would be left out. The extra hours would be assigned to A by the algorithm. To fix this issue of “misattribution,” provide a short overlap between the overlapping children.

  • Child span outside parent’s span time range: 

Consider that child B finishes after parent A in the recorded time. In such cases, B is trimmed to A’s completion time. This recursive process changes C’s end time to that of B’s now-adjusted end time.

Aggregating multiple path times:

Each trace generates a unique critical path. The purpose is to provide essential guidance on which services contribute to end-to-end latency (and how much). The data acquired from several traces in each critical path is summed up into an aggregate. The aggregation process essentially adds up the time contributions of each operation on each important channel to create a global pool of operations. The critical path’s casualty/time dimension is crushed during the aggregation phase.

The results are presented in multiple ways. It is an HTML file that includes: 

  • Flame Graphs
  • Heat Map
  • Textual Summary

Hardware Selection:

Uber’s data center is composed of a wide range of hardware CPU SKUs. Services can be installed on a variety of hardware. As a result, a span executed on behalf of a request may operate on additional hardware on different requests. The critical path for one of Uber’s vital services was collected using CRISP and identified that a downstream operation, “Op-M,” was on the critical path. 

It is concluded that a faster CPU clock has a substantial influence on lowering tail and total latency. Since tail latency is heavily employed in capacity allocation, this discrepancy considerably impacts overall capacity allocation. Critical pathways must now be classified as CPU SKU sensitive or insensitive. This classification enables data center-wide microservice schedulers to prioritize SKU-sensitive services on the critical path onto their SKUs, where they perform better.

Uber uses CRISP to collect and visualize performance statistics for a variety of microservices regularly. CRISP has been efficient enough to analyze and visualize complicated microservice interactions using heatmaps and flame graphs to provide short critical route summary information. CRISP insights are assisting us in identifying and prioritizing various performance issues.

Github: https://github.com/uber-research/CRISP

Reference: https://eng.uber.com/crisp-critical-path-analysis-for-microservice-architectures/