Grafana Labs

How Jaeger helped Grafana Labs improve query performance and root out tough bugs

Challenge

Grafana Labs’ hosted metrics platform incorporates Metrictank, a Graphite-compatible metrics service, and Cortex, the CNCF sandbox project for multitenant, horizontally scalable Prometheus-as-a-Service. But as the company started adding scale, query performance issues became noticeable. Without a way to visualize the path of requests end-to-end, the team attempted to solve the problem by guessing the cause of the slowness and rolling out a “fix”—“many times shooting in the dark, only to have our assumptions invalidated after a lot of experimentation,” says Software Engineer Goutham Veeramachaneni.

Solution

The team already had Jaeger distributed tracing installed for Metrictank. “We doubled down on it with Cortex to improve the query performance,” says VP of Product Tom Wilkie.

Impact

Cortex query performance has improved by as much as 10x. Jaeger has also helped with figuring out bugs: “It visualizes data in a way that allows us to see where the performance issues are pretty much immediately,” Wilkie says.

Challenges:

Optimization, Troubleshooting

Industry:

Software

Location:

United States

Cloud Type:

Public

Product Type:

Installer

Published:

December 11, 2019

Projects used

By the numbers

Query performance

Improved by up to 10x

Queries per second

Tens of thousands

Confidence in operating the system grew by an order of magnitude

The company behind the popular open source Grafana project, Grafana Labs offers customers a hosted metrics platform called Grafana Cloud, which incorporates Metrictank, a Graphite-compatible metrics service, and Cortex, the CNCF sandbox project for multitenant, horizontally scalable Prometheus-as-a-Service.

Grafana Labs engineers run Metrictank and Cortex to troubleshoot their own technical issues. But as the company started adding scale—Cortex and Metrictank each process tens of thousands of requests per second—query performance issues became noticeable. That latency negatively impacts Grafana Cloud customers’ user experience.

Without a way to visualize the path of requests end-to-end, the team attempted to solve the problem by guessing the cause of the slowness and rolling out a “fix”—“many times shooting in the dark, only to have our assumptions invalidated after a lot of experimentation,” says Software Engineer Goutham Veeramachaneni.

The Metrictank team had already been using Jaeger distributed tracing to understand requests better and to see all logs in one place. “What we were looking for originally was a better way to do logging,” says Principal Software Engineer Dieter Plaetinck. “When I found out about how Jaeger adds this context to our logging—all the log lines related to one particular request are basically banded together—that really appealed to me.”

With that experience using Jaeger, “we doubled down on it with Cortex to improve the query performance,” says VP of Product Tom Wilkie. Jaeger allowed the team to drill down to specific requests and quickly find the queries that were causing latency. The results with Jaeger were stellar: Query performance was improved by as much as 10x.

As it turned out, Jaeger has also helped the Grafana Labs team with bug-hunting: “It visualizes data in a way that allows us to see where the performance issues are pretty much immediately,” Wilkie says.

For one particularly difficult Cortex issue, tracing turned out to be an invaluable debugging tool. “It was quite a big bug, but it was super intermittent and really hard to reproduce,” says Wilkie. “Customers were reporting the bug, but we could never come up with a way of reliably triggering it—which meant it was really hard for us to know what was causing it. And it was really hard for us to know when we had fixed it.”

Senior Software Engineer Joe Elliott eventually did some data analysis that helped identify queries that hit the bug. “Then we used Jaeger to figure out what was different about those queries, because Jaeger brings all the data together for a single query,” says Wilkie. “That’s what makes it so useful.”

With Jaeger in place, “the confidence in operating our system grew by an order of magnitude,” says Veeramachaneni. “It’s easier to visualize where the problems are, and it just made me more confident at tackling things because I’m able to see exactly what’s going wrong.”

“The way Jaeger visualizes requests and helps us very intuitively diagnose performance issues makes it really useful.”
— TOM WILKIE, VP OF PRODUCT AT GRAFANA LABS

For companies considering adopting Jaeger, Veeramachaneni says, “I would say start with an extremely low number of traces and keep on increasing it. Monitor Jaeger using the Prometheus monitoring mixin that we contributed to the community. It includes recording rules, dashboards, and alerts to give you a lot of visibility. And always instrument your code before you put it into production.”

When monitoring Jaeger, Plaetinck adds, “if there’s one takeaway in terms of checking your stack, it’s just look for anything—spans, traces—being dropped anywhere.”

Wilkie adds that getting the instrumentation correct is the hardest part. “Distributed tracing is only really useful if you get everything right, which has been the challenge at Grafana Labs,” he says. “So you have to make it clear who owns the tracing; if it’s no one’s responsibility, it could be broken when you actually need it. And if you drop some data in the middle of a trace, then suddenly the whole trace becomes useless because you can’t see the association between the top and the bottom of it.”

Jaeger is currently being used in a relatively small part of Grafana Labs’ system, but the team is figuring out how best to expand its reach. “We want to get to a larger percentage of traffic and understand some of the different configurations you can use to store and process data,” says Elliott.

“With Jaeger, it’s easier to visualize where the problems are, and it just made me more confident at tackling things because I’m able to see exactly what’s going wrong.”
— GOUTHAM VEERAMACHANENI, SOFTWARE ENGINEER AT GRAFANA LABS

At the same time, “if you trace every request into your system, then the Jaeger install quickly becomes bigger than your system itself,” says Wilkie. “And if you mess up and accidentally send it too much data, it becomes useless instantly, which is something we’re working on.”

Finding the right balance of utility and cost is an immediate goal. “The way Jaeger visualizes requests and the way it helps us very intuitively diagnose performance issues makes it really useful, which means we want to use it for everything,” says Wilkie. “But it’s just too expensive to use for everything right now. So we’re trying to get a decent handle on when is it economical to trace something and what’s valuable to store and monitor.”

The other big focus for Grafana Labs is to build Jaeger UI into the open source Grafana project for a more seamless experience. The team is also exploring how another CNCF project, Open Telemetry, can be used with Jaeger for tail-based sampling.

These developments would add to an already growing set of features that Veeramachaneni, for one, has been excited to see coming from the Jaeger project. Over the past few years, he says, “Jaeger has become more and more mature, and I’m quite happy at the pace it’s going. I love the project and the community.”