Jaeger tracing for observability beginners [Quick Guide]

Jaeger is a tracing tool that allows engineers to track issues among 10s, 100s and even 1000s of services and their dependencies.

In technical terms, Jaeger collects “tracing data” for distributed services to populate Grafana dashboards that highlight downtime/slow-load risk and errors.

This makes it an essential component of a strong observability practice. Observability depends on effective tooling and practices around logging, monitoring and tracing.

In this quick guide, we’ll explore a few tactics to run Jaeger well on Kubernetes. But before we do that, let’s uncover its origin story…

Jaeger was created in 2015 by an engineer at Uber, Yuri Shkuro, who wanted to help engineers work out where issues were popping up. This emerged as a critical need at Uber over time.

Glimpse of microservices that drive the Uber app. A large number of these services get triggered every time you request an Uber ride.
Above: a glimpse of services that drive the Uber app. A large number of these services get triggered every time you request an Uber ride. (Source: Youtube, Jaeger Intro – Yuri Shkuro)

The Uber app may seem simple at first glance, but it ran and still runs as a complex network of microservices. Many of these services depend on other services as well as their own sub-services.

Weaknesses in the service chain can risk the whole user request falling apart. In business terms, Uber risks losing ride fares at a large scale if one or some of the component services fails or slows down.

“In deep distributed systems, finding what is broken and where is often more difficult than why

— Yuri Skhuro, Founder & Maintainer, CNCF Jaeger

Jaeger helps engineers find out what services are experiencing issues and where. That way, they can fix small issues before they snowball into serious problems or even crises.

Do your observability needs justify using Jaeger?

You might be wondering whether you even need Jaeger.

After all, your use case might not be as complex as Uber’s. Yuri created Jaeger to handle a complex web of services and millions of requests per day.

Simply put, tracing is not an absolute must-have for simpler architectures. But it is handy for finding bottlenecks if your application runs off more than a handful of services, say more than 10.

Also, imagine this situation. Your application suddenly gets a traffic spike and a large volume of requests are not completing. How will you find the culprit fast enough to fix the issue?

Feel the need for Jaeger’s capabilities? Let’s explore how it works.

How Jaeger tracing works

Step 1

Jaeger Agent gathers “span data” by sampling parts of UDP packets transmitted by microservices

Step 2

Data (service name, start time, duration) gets sent on to the Collector

Step 3

Collector sends data to 2 places: Analytics and Visual Dashboard

Et voilà!

Above: this is what tracing data can look like in the Jaeger UI (Source: Youtube, Jaeger Intro – Yuri Shkuro)

Now let’s explore how to install Jaeger on a Kubernetes cluster

2 ways to install Jaeger on K8s

I will assume that you know how Kubernetes clusters are structured in terms of containers, nodes, pods, sidecars etc.

Jaeger Agent can run on a Kubernetes cluster in two distinct ways: as a daemon or sidecar. Let’s compare both of them.

Setup Jaeger as a daemonset

Mechanism: Jaeger Agent runs as a pod and collects data from all other pods within the same node

Useful for: single tenant or non-production clusters

Benefits: lower memory overhead, simpler setup

Risk: security risk if deployed on multi-tenant cluster

LEARN BY DOING: simple Jaeger setup tutorial via Digital Ocean

Setup Jaeger as a sidecar

Mechanism: Jaeger Agent runs as a container alongside service container within every pod

Useful for: multi-tenant clusters, public cloud clusters

Benefits: granular control, higher security potential

Risk: more DevOps supervision required

LEARN BY DOING: deploy Jaeger as a sidecar via Jaeger’s Github

Remember earlier when I mentioned that Jaeger samples parts of UDP packets transmitted by services?

Well, there are 2 sampling methods, heads-based sampling and tails-based sampling. Each has its own benefits and downsides.

Let’s explore:

Heads-based sampling

Also known as: upfront sampling

Mechanism: sampling decision is made prior to request completion

Useful for: high-throughput use cases, looking at aggregated data

Benefits: cheaper sampling method – lower network and storage overhead

Risk: potential to miss outlier requests due to less than 100% sampling

Work required: easy setup, supported by Jaeger SDKs

Config notes: sampling based on flip-of-coin or until certain rate is achieved

Tails-based sampling

Also known as: response sampling

Mechanism: sampling decision is made after the request has been completed

Useful for: catching anomalies in latency, failed requests

Benefits: more intelligent approach to looking at request data

Risk: temporary storage for all traces – more infra overhead, single node only

Work required: extra work – connect to a tool that supports tail-based sampling like Lightstep

Config notes: sampling based on latency criteria and tags

Now that you’ve picked your sampling method, you will also need to consider that Jaeger’s collector has finite data capacity.

Prevent Jaeger’s collector from getting clogged

Jaeger’s collector holds data temporarily before it writes onto a database. This database is then queried by visual UI (as seen in above).

But a problem can arise: the collector can get clogged if the database can’t write fast enough during high traffic situations.

Problem:

  • Collector’s temporary storage model becomes problematic when traffic spikes
  • Some data gets dropped so the collector can stay afloat from the flood of incoming request data
  • Your tracing may look patchy in areas because of the gaps in sampling data
  • Risk of missing failed or problematic requests if they were in the sampling that gets dropped

Solution:

  • Consider asynchronous span ingestion technique to solve this problem
  • This means adding a few components between your collector and database:
    • Apache Kafka – real-time data streaming at scale
    • Apache Flink – processes Kafka data asynchronously
    • 2 jaeger components – jaeger-ingester and jaeger-indexer – push Flink output to storage

Once these mid-step systems are in place, the collector will be less likely to get overloaded and tempted to dump data

How to implement:

Remember when you first heard Kubernetes terms like nodes, pods, sidecars, multitenant? Confusion.

Same story with Apache Kafka and Flink. Lots of new jargon to learn that is beyond our high-level scope here.

But these links – access them in order – might help you get started with your implementation:

Youtube – Jaeger straight-to-DB vs asynch write method

Youtube – Apache Kafka videos by Confluent

Practical overview (with example) of Apache Flink

Wrapping up

This concludes our quick rundown of Jaeger and the promise it holds for distributed tracing of microservices.