OpenTracing
emerging industry standard for distributed tracing
Table of Contents
Introduction
1
OpenTracing basics
2
OpenTracing API
6
Context propagation
9
Distributed tracers
Zipkin
Span ingestion
Storage
Jaeger
Span ingestion
Storage
Zipkin vs Jaeger
Other tracers
11
11
14
15
21
23
24
31
32
Supported instrumentation
32
1 Introduction
As organizations are embracing the cloud-native movement and thus migrating their
applications from monolithic to microservice architectures, the need for general visibility
and observability into software behavior becomes an essential requirement. Because the
monolithic code base is segregated into multiple independent services running inside their
own processes, which in addition can scale to various instances, such a trivial task as
diagnosing the latency of an HTTP request issued from the client can end up being a
serious deal. To fulfill the request, it has to propagate through load balancers, routers,
gateways, cross machine’s boundaries to communicate with other microservices, send
asynchronous messages to message brokers, etc. Along this pipeline, there could be a
possible bottleneck, contention or communication issue in any of the aforementioned
components. Debugging through such a complex workflow wouldn’t be feasible if not
relying on some kind of tracing/instrumentation mechanism. That’s why distributed
tracers like Zipkin, Jaeger or AppDash were born (most of them are inspired on Google’s
Dapper large-scale distributed tracing platform). All of the aforementioned tracers help
engineers and operation teams to understand and reason about system behavior as
complexity of the infrastructure grows exponentially. Tracers expose the source of truth
for the interactions originated within the system. Every transaction (if properly
instrumented) might reflect performance anomalies in an early phase when new services
are being introduced by (probably) independent teams with polyglot software stacks and
continuous deployments.
However, each of the tracers stick with its proprietary API and other peculiarities that
makes it costly for developers to switch between different tracer implementations. Since
implanting instrumentation points requires code modification, OSS services, application
frameworks and other platforms would have hard time if tying to a single tracer vendor.
1
OpenTracing aims to offer a consistent, unified and tracer-agnostic instrumentation API
for a wide range of frameworks, platforms and programming languages. It abstracts away
the differences among numerous tracer implementations, so shifting from an existing one
to a new tracer system would only require configuration changes specific to that new
tracer. For what it’s worth, we should mention the benefits of distributed tracing:
out of the box infrastructure overview, how the interactions between
services are done and their dependencies efficient and fast detection of
latency issues ntelligent error reporting. Spans transport error messages
and stack traces. We can take advantage of that insight to identify root
cause factors or cascading failures. trace data can be forwarded to log
processing platforms for query and analysis
2 OpenTracing basics
In a distributed system, a trace encapsulates the transaction’s state as it propagates
through the system. During the journey of the transaction, it can create one or multiple
spans. A span represents a single unit of work inside transaction, for example, an RPC
client/server call, sending query to the database server, or publishing a message to the
message bus. Speaking in terms of OpenTracing data model, the trace can also been seen
as a collection of spans structured around the directed acyclic graph (DAG). The edges
indicate the casual relationships (references) between spans. The span is identified by its
unique ID, and optionally may include the parent identifier. If the parent identifier is
omitted, we call that span as root span. The span also comprises human-readable
operation name, start and end timestamps. All spans are grouped under the same
trace identifier.
2
The diagram above depicts the transit of an hypothetical RPC request. The client makes
an HTTP request to the server which results in generating one parent span. In order to
satisfy the client’s request, the server sends a query to the storage engine. That operation
produces one more span. The response from the database engine to the server and from
the server to the client creates two additional spans.
Spans may contain tags that represent contextual metadata relevant to a specific
request.
3
They consist of an unbounded sequence of key-value pairs, where keys are strings and
values can be strings, numbers, booleans or date data types. Tags allow for context
enrichment that may be useful for monitoring or debugging system behavior.
While not mandatory, it’s highly recommended to follow the OpenTracing semantics
guidelines when naming tags. Such as that, we should assign component tag to the
framework, module or library which generates span/spans, use peer.hostname and
peer.port to describe target hosts, etc. Another reason for tagging standardization is
making the tracer aware of existence of certain tags that would add intelligence or
instruct the tracer to put special emphasis on them.
As illustrated on Figure 2, the
spans are annotated with tags that
obey
OpenTracing
semantic
conventions. Furthermore, the
spans are rendered with different
chart. This type of waterfall-like
visualization adds the dimension of
time and thus makes it easier to
spot the duration of each span.
Besides tags, OpenTracing has a
notion of log events. They
represent timestamped textual
(although not limited to textual
content) annotations that may be
recorded along the duration of a
span. Events could express any
occurrence of interest to the active
span, like timer expiration, cache
miss events, build or deployment
starting events, etc.
4
Baggage items allow for cross-span propagation, i.e., they let associate metadata that
also propagates to future children of the root span. In other words, the local data is
transported along the full path as request if traveling downstream through the system.
However, this powerful feature should be used carefully because it can easily saturate
network links if the propagated items are about to be injected into many descendant
spans.
As at the time of writing, OpenTracing supports two types of relationships:
ChildOf – to express casual references between two spans. Following with our
RPC scenario, the server side span would be the ChildOf the initiator (request)
span.
FollowsFrom – when parent span isn’t linked to the outcome of the child span. This
relationship is usually used to model asynchronous executions like emitting
messages to the message bus.
5
3 OpenTracing API
OpenTracing API is modeled around two fundamental types:
Tracer – knows how to create a new span as well as inject/extract span contexts
across process boundaries. All OpenTracing compatible tracers must provide a
client with the implementation of the Tracer interface.
Span – tracer’s build method yields a brand new created span. We can invoke a
number of operations after the span has been started, like aggregating tags,
changing span’s operation name, binding references to other spans, adding
baggage items, etc.
SpanContext – the consumers of the API only interact with this type when
injecting/extracting the span context from the transport protocol.
Let’s see some code. Although we’ll focus on Java, the API semantics are identical (or at
least they should be) for any other programming language (OpenTracing has API specs
for Go, Python, JavaScript, Java, C#, Objective-C, C++, Ruby, PHP).
Figure 4 represents the role of OpenTracing API instrumentation within the tracing
landscape. It’s important for the tracer clients to be compatible with the OpenTracing
specification. For instance, we could be biased to use Zipkin tracing system. The
instrumentation points in our applications are created via OpenTracing API despite we’re
using Zipkin clients for span reporting. After evaluating other tracers, we could figure out
Jaeger fits better our needs. In that case, switching from Zipkin to Jaeger would be a
matter of registering the corresponding instance of the tracer, while instrumentation
points would remain the same, i.e., we wouldn’t have to adapt any code.
6
Because Jaeger tracer is compatible with Zipkin span formats, we could use the same
Zipkin client to submit span requests to Jaeger.
Before being able to create a span, we have to register the tracer. This step is tied to the
particular tracer implementation, but basically it consists on indicating the tracer’s
7
endpoint and the component which sends the instrumentation data to the tracer. In
case of Jaeger tracer, we would have the following code snippet:
import
com.uber.jaeger.Configuration
;
import
io.opentracing.util.GlobalTracer
;
Configuration
config = n
ew
Configuration(component,
new C
onfiguration
.SamplerConfiguration(
"const"
, 1),
new
Configuration
.ReporterConfiguration(
true
,
host,
port,
1000,
10000)
);
GlobalTracer
.
register
(config.getTracer());
To start a new span use the buildSpan method within try block which automatically
finishes the span and handles any exceptions:
io.opentracing.Tracert
racer =
GlobalTracer
.
get
();
try
(A
ctiveSpan s
pan = tracer.buildSpan(
"create-octi"
)
.setTag(
"http.url"
,
"/api/octi"
)
.setTag(
"http.method"
,
"POST"
)
.setTag(
"peer.hostname"
, "
apps.sematext.com"
)
.startActive()) {
// HTTP request code here
}
8
Tracer registration and span management can be simplified with Sematext
opentracing-common library.
TracerInitializer t
racerInitializer =
new
TracerInitializer(
Tracers
.Z
IPKIN
);
tracerInitializer.setup(
"localhost"
, 9411, "
log-service"
);
SpanOperations
s
panOps = n
ew
SpanTemplate(tracerInitializer.getTracer());
try
(A
ctiveSpan s
pan = spanOps.startActive(
"create-octi"
)) {
// add tags and make the HTTP request
}
The best practice is to create an instance of TracerInitializer and SpanTemplate via
dependency injection container to reuse those references from any place within the
application.
4 Context propagation
One of the most compelling and powerful features attributed to tracing systems is
distributed context propagation. Context propagation composes the causal chain
and dissects the transaction from inception to finalization – it illuminates the request’s
path to its final destination.
From a technical point of view, context propagation is the ability for the system or
application to extract the propagated span context from a variety of carriers like HTTP
headers, AMQP message headers or Thrift fields, and then join the trace from that point.
Context propagation is very efficient since it only involves propagating identifiers and
baggage items. All other metadata like tags, logs, etc. isn’t propagated but transmitted
asynchronously to the tracer system. It’s the responsibility of the tracer to assemble and
construct the full trace from distinct spans that might be- injected in-band /out-of-band.
9
The figure 5 illustrates the flow of the context propagation. A request hits the first
service (probably triggered by user interaction from mobile/web application). At this point
no active span is scheduled, so service 1 will start a new span and populate the tags to
contextualize the request. This is the parent for the subsequent spans. Let’s suppose the
context is injected and carried to service 2 (that lives on another machine) via HTTP
headers. Service 2 attempts to extract the span context from the headers. If context is
decoded successfully, another child span will be generated under the same trace
10
identifier. As we already pointed out, only identifiers are propagated – tags
contributed by individual spans are sent out of band. Following with the path, the service
3 deserializes the span context, might add tags, baggage items, etc. Then, it injects the
context, crosses the process boundaries to collaborate with the next service, and so on
until casual chain is completed.
OpenTracing standardizes context propagation across process boundaries by
Inject/Extract pattern.
5 Distributed tracers
OpenTracing hides the differences between different distributed tracer implementations,
so in order to instrument the application via OpenTracing standard, it’s necessary to have
an OpenTracing compatible tracer correctly deployed and listening for incoming span
requests. The following section is a breakdown of some prominent distributed tracers.
5.1
Zipkin
Zipkin is a distributed tracing system implemented in Java and with OpenTracing
compatible API. It’s responsible for span ingestion and storage by providing a number of
collectors (HTTP, Kafka, Scribe) as well as storage engines (in-memory, MySQL,
Cassandra, Elasticsearch). The UI is also a self-contained web application (although it can
be served separately) and is used to explore the traces and their associated spans.
Spans may be sent to collectors out-of-band, i.e., the data is reported asynchronously to
Zipkin since the span is completed and trace/span identifiers don’t have to propagate
downstream, or in-band if context propagation is required and headers are used to
transport the identifiers.
11
The component that is responsible for transporting the spans is called a reporter.
Every instrumented application contains a reporter. It records timing metrics, associates
metadata and routes it to the collector.
12
To get started with Zipkin, download and run zipkin-server as standalone jar (note JRE
8 is required to bootstrap the Zipkin server):
After server startup is done, you should see the output like on the image below.
Alternatively, run the containerized Zipkin server from Docker image:
The command above will:
- fetch the latest zipkin image from the remote Docker repository
- expose the port 9411 on the host machine so you can browse the UI on
http://localhost:9411
- run the container in detached mode
Run docker ps to make sure container is in running state.
13
5.1.1
Span ingestion
The collectors are responsible for forwarding the span requests to the storage layer.
HTTP collector is the default ingress point for span stream.
Other than HTTP collector, Zipkin also offers Kafka and Scribe for span ingestion.
14
5.1.2
Storage
As mentioned above, Zipkin supports in-memory, MySQL, Cassandra and Elasticsearch
storage engines. In-memory store comes in handy for dev environments and for the POC
scenarios where persistence is not required. MySQL storage type is discouraged for
production environments due to known performance issues.
For production workloads, Cassandra or Elasticsearch are more suitable options.
To enable the Elasticsearch storage, export STORAGE_TYPE and ES_HOST environment
variables.
NOTE: if X-Pack is enabled (default option in official Elastic Docker image), you’ll need
to
provide the credentials for the REST API endpoint via ES_USERNAME and ES_PASSWORD
environment variables.
The following mapping is used to describe the structure of the spans:
"mappings" : {
"servicespan" : {
"_all" : {
"enabled" : false
},
"properties" : {
"serviceName" : {
"type" : "keyword",
15
"ignore_above" : 256
},
"spanName" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"_default_" : {
"_all" : {
"enabled" : false
}
},
"span" : {
"_all" : {
"enabled" : false
},
"properties" : {
"annotations" : {
"type" : "nested",
"dynamic" : "false",
"properties" : {
"endpoint" : {
"dynamic" : "false",
"properties" : {
"serviceName" : {
"type" : "keyword",
"ignore_above" : 256
}
16
}
},
"value" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"binaryAnnotations" : {
"type" : "nested",
"dynamic" : "false",
"properties" : {
"endpoint" : {
"dynamic" : "false",
"properties" : {
"serviceName" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"key" : {
"type" : "keyword",
"ignore_above" : 256
},
"value" : {
"type" : "keyword",
"ignore_above" : 256
}
17
}
},
"duration" : {
"type" : "long"
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "keyword",
"ignore_above" : 256
},
"timestamp" : {
"type" : "long"
},
"timestamp_millis" : {
"type" : "date",
"format" : "epoch_millis"
},
"traceId" : {
"type" : "keyword",
"ignore_above" : 256
}
18
}
},
"dependencylink" : {
"enabled" : false,
"_all" : {
"enabled" : false
}
}
These are some of the most relevant document’s fields:
traceId – an unique identifier for the trace
id – span identifier
name – name of the operation associated with the span
duration – duration of the span in microseconds
timestamp – span start time expressed in epoch microseconds
binaryAnnotations – an array of tags associated with the span
Here is an example of an indexed document related to instrumentation of the SQL
statements:
{
"timestamp_millis" : 1502364354905,
"traceId" : "176dde0179621d08",
"id" : "176dde0179621d08",
"name" : "create-app",
"timestamp" : 1502364354905221,
"duration" : 1988,
"binaryAnnotations" : [
19
{
"key" : "db.instance",
"value" : "apps",
"endpoint" : {
"serviceName" : "opentracing-jdbc",
"ipv4" : "192.168.1.23"
}
},
{
"key" : "db.statement",
"value" : "INSERT INTO apps (name) VALUES (slack)",
"endpoint" : {
"serviceName" : "opentracing-jdbc",
"ipv4" : "192.168.1.23"
}
},
{
"key" : "db.type",
"value" : "sql",
"endpoint" : {
"serviceName" : "opentracing-jdbc",
"ipv4" : "192.168.1.23"
}
}
]
}
}
20
5.2
Jaeger
Despite not being as mature as Zipkin, Jaeger is another distributed tracing system that’s
in process of massive adoption. The backend is implemented in Go language and it has
support for in-memory, Cassandra and Elasticsearch span stores.
Jaeger’s architecture is built with scalability and parallelism in mind. The client emits the
traces to the agent which listens for inbound spans and routes them to the collector.
The responsibility of the collector is to validate, transform and store the spans to the
21
persistent storage. To access tracing data from the storage, the query service exposes a
REST API endpoints and the React based UI.
22
Jaeger can be installed from sources using the Go toolchain (go 1.7, glide and yarn
package managers are necessary to run the build process):
The command above will build and run all components (agent, collector, query) together
with in-memory storage enabled.
If that’s too much pain, we can fetch the official Docker image and spawn a container:
Additionally, you can run each of the components in a separate container by pulling the
corresponding image. To build the images manually and orchestrate the execution of the
containers use this docker-compose deployment descriptor.
To explore the traces, navigate to http://localhost:16686.
23