Jaeger is a tracing tool that allows engineers to track issues among 10s, 100s and even 1000s of services and their dependencies.
In technical terms, Jaeger collects “tracing data” for distributed services to populate Grafana dashboards that highlight downtime/slow-load risk and errors.
This makes it an essential component of a strong observability practice. Observability depends on effective tooling and practices around logging, monitoring and tracing.
In this quick guide, we’ll explore a few tactics to run Jaeger well on Kubernetes. But before we do that, let’s uncover its origin story…
Jaeger was created in 2015 by an engineer at Uber, Yuri Shkuro, who wanted to help engineers work out where issues were popping up. This emerged as a critical need at Uber over time.
The Uber app may seem simple at first glance, but it ran and still runs as a complex network of microservices. Many of these services depend on other services as well as their own sub-services.
Weaknesses in the service chain can risk the whole user request falling apart. In business terms, Uber risks losing ride fares at a large scale if one or some of the component services fails or slows down.
“In deep distributed systems, finding what is broken and where is often more difficult than why“— Yuri Skhuro, Founder & Maintainer, CNCF Jaeger
Jaeger helps engineers find out what services are experiencing issues and where. That way, they can fix small issues before they snowball into serious problems or even crises.
Do your observability needs justify using Jaeger?
You might be wondering whether you even need Jaeger.
After all, your use case might not be as complex as Uber’s. Yuri created Jaeger to handle a complex web of services and millions of requests per day.
Simply put, tracing is not an absolute must-have for simpler architectures. But it is handy for finding bottlenecks if your application runs off more than a handful of services, say more than 10.
Also, imagine this situation. Your application suddenly gets a traffic spike and a large volume of requests are not completing. How will you find the culprit fast enough to fix the issue?
Feel the need for Jaeger’s capabilities? Let’s explore how it works.
How Jaeger tracing works
Jaeger Agent gathers “span data” by sampling parts of UDP packets transmitted by microservices
Data (service name, start time, duration) gets sent on to the Collector
Collector sends data to 2 places: Analytics and Visual Dashboard
Now let’s explore how to install Jaeger on a Kubernetes cluster
2 ways to install Jaeger on K8s
I will assume that you know how Kubernetes clusters are structured in terms of containers, nodes, pods, sidecars etc.
Jaeger Agent can run on a Kubernetes cluster in two distinct ways: as a daemon or sidecar. Let’s compare both of them.
Setup Jaeger as a daemonset
Mechanism: Jaeger Agent runs as a pod and collects data from all other pods within the same node
Useful for: single tenant or non-production clusters
Benefits: lower memory overhead, simpler setup
Risk: security risk if deployed on multi-tenant cluster
LEARN BY DOING: simple Jaeger setup tutorial via Digital Ocean
Setup Jaeger as a sidecar
Mechanism: Jaeger Agent runs as a container alongside service container within every pod
Useful for: multi-tenant clusters, public cloud clusters
Benefits: granular control, higher security potential
Risk: more DevOps supervision required
LEARN BY DOING: deploy Jaeger as a sidecar via Jaeger’s Github
Remember earlier when I mentioned that Jaeger samples parts of UDP packets transmitted by services?
Well, there are 2 sampling methods, heads-based sampling and tails-based sampling. Each has its own benefits and downsides.
Also known as: upfront sampling
Mechanism: sampling decision is made prior to request completion
Useful for: high-throughput use cases, looking at aggregated data
Benefits: cheaper sampling method – lower network and storage overhead
Risk: potential to miss outlier requests due to less than 100% sampling
Work required: easy setup, supported by Jaeger SDKs
Config notes: sampling based on flip-of-coin or until certain rate is achieved
Also known as: response sampling
Mechanism: sampling decision is made after the request has been completed
Useful for: catching anomalies in latency, failed requests
Benefits: more intelligent approach to looking at request data
Risk: temporary storage for all traces – more infra overhead, single node only
Work required: extra work – connect to a tool that supports tail-based sampling like Lightstep
Config notes: sampling based on latency criteria and tags
Now that you’ve picked your sampling method, you will also need to consider that Jaeger’s collector has finite data capacity.
Prevent Jaeger’s collector from getting clogged
Jaeger’s collector holds data temporarily before it writes onto a database. This database is then queried by visual UI (as seen in above).
But a problem can arise: the collector can get clogged if the database can’t write fast enough during high traffic situations.
- Collector’s temporary storage model becomes problematic when traffic spikes
- Some data gets dropped so the collector can stay afloat from the flood of incoming request data
- Your tracing may look patchy in areas because of the gaps in sampling data
- Risk of missing failed or problematic requests if they were in the sampling that gets dropped
- Consider asynchronous span ingestion technique to solve this problem
- This means adding a few components between your collector and database:
- Apache Kafka – real-time data streaming at scale
- Apache Flink – processes Kafka data asynchronously
- 2 jaeger components – jaeger-ingester and jaeger-indexer – push Flink output to storage
Once these mid-step systems are in place, the collector will be less likely to get overloaded and tempted to dump data
How to implement:
Remember when you first heard Kubernetes terms like nodes, pods, sidecars, multitenant? Confusion.
Same story with Apache Kafka and Flink. Lots of new jargon to learn that is beyond our high-level scope here.
But these links – access them in order – might help you get started with your implementation:
This concludes our quick rundown of Jaeger and the promise it holds for distributed tracing of microservices.