Observability should not slow you down

Posted on May 30, 2019

Originally published on Medium by Travis Jeppson, Sr. Director of Engineering, Nav Inc

In any application, the lack of observability is the same as riding a bike with a blindfold over your eyes. The only inevitable outcome is crashing, and crashing always comes with a cost. This cost tends to be the only focus we have when we look at observability, but this isn’t the only cost. The other cost of observability isn’t usually addressed until it becomes more painful than the cost of crashing -the cost of maintenance and adaptability.

I’ve listened to, and watched, many conference talks about this subject; and had my fair share of conversations with vendors as well. Maintenance and adaptability aren’t generally mentioned. I’ve only had these topics come up when I’m talking to other companies about their adopted platform, how they were actually able to integrate observability into real-life situations, and from my own experiences doing the same. The reason these topics come up after some practical application is that we’ve all hit the proverbial wall.

We’ve all run into problems, or incompatibilities, or vendor lock-in that feels almost impossible to get rid of. Our observability begins to dwindle, the blindfold starts falling down over our eyes, and we’re again heading to an inevitable fate. What can be done? Revisit the entire scenario? Wait for a major crash and create an ROI statement to show we have to re-invest in major parts of our applications? This can’t possibly be the only way to deal with this problem. This is an anti-pattern to the way we build software. Observability is supposed to empower speed and agility, not hold it back.

There is another way, and it starts by determining the key elements on which you won’t make concessions. During the last iteration of trying to get this right at Nav, we had a lot of discussions around our previous attempts. The first attempt was a solution that we thought initially had unlimited integrations; it turns out it didn’t have the one we needed, Kubernetes. We also couldn’t produce custom metrics from our applications, so that solution had to go. We weren’t about to wait for them to tell us an integration was ready, we were ready to move. We decided to go with a solution that was end-to-end customizable, we could spend time developing our telemetry data, and how to interpret it. This, unfortunately, forced us into a maintenance nightmare. On the third iteration, however, we decided to settle somewhere in the middle. We sat down and defined our “no compromise” priorities, and started finding solutions that fit. Here’s how we saw the priorities for Nav.

1. Customization! We needed adaptability, no waiting for integrations

First and foremost the solution needed to allow for custom metrics, and handle them like a first-class citizen. This needed to be true for our infrastructure metrics as well as anything coming from our applications. Adaptability was key in our decision: If the solution we chose was adaptable, then we should be free to adjust any component of our infrastructure without having to check if our observability would be affected.

2. No vendor-specific code in our applications, not even libraries

This may seem a little harsh at first, but the fact of the matter is that we didn’t want to have a dependency on a vendor. We use a wide variety of languages at Nav -Ruby, Elixir, Go, Python, Javascript, Java, the list goes on. It was almost impossible to find a vendor solution that would work with all of those languages. We decided the language needed to be agnostic, which means we couldn’t have any vendor code or libraries in our applications. The other side of this is that we didn’t want to be locked to the solution, since we had previously run into issues with that problem.

3. HELP! The maintenance cannot be overwhelming

This meant that at some point we would probably need a vendor to help us out. We didn’t want a ridiculous uptime for our observability platform to be our concern, we wanted to worry about the uptime of our application instead. We also didn’t want to worry about the infrastructure of the observability platform, we wanted to worry about our own. Catch my drift? We also wanted some guidance about what to pay attention to. We wanted a simple way to build dashboards, and the ability to allow pretty much every engineer to build their own dashboards around their own metrics.

Now the Rest: Our Second Tier of Priorities

Now we get into the “like to have” priorities. The following were more of a wish list, the top three were dealbreakers for the solution we came up with. Fortunately, as will be illustrated later, we didn’t need to compromise on any of our priorities.

4. Alerting needed to be easy to do, and integrate with our on-call solution

With our end-to-end customized solution (attempt #2 in observability) alerting was ridiculously tedious. It was a JSON document that had so many defining parts that we never really had any good alerts setup. We also caused a lot of on-call burnout due to large amounts of false positives. We didn’t want to repeat this.

5. We didn’t want to pay the same price for our non-production environments as we do for production

It is a giant pet-peeve of mine that it is required of anyone to pay the same price for observability, just because the size of the environments is the same. Why must this be? I don’t actually care nearly as much if a development environment goes down for 5 minutes; but I definitely care if production is down for 5 minutes.

The Final Decision: Nav’s Tri-Product Solution

With these priorities in hand, we set out to create a solution that worked. To cut a long story short, there didn’t end up being the perfect solution, there wasn’t a solution that could give us the top 3 priorities … on their own. It turns out we needed multiple pieces to work seamlessly together.

Prometheus

Prometheus is an open source metric aggregation service. The fantastic thing about Prometheus is that it is built around a standard, which they also created. This standard is called the exposition format. You can provide a text-based endpoint and Prometheus will come by and “scrape” the data off of this endpoint and feed it into a time series database. This … is … simply … amazing! Our developers were able to write this endpoint in their own code bases, and publish any kind of custom metric they desired.

StatsD

StatsD was a solution originally written by Etsy, StatsD provided a way for us to push metrics on our software that wasn’t associated with a web server, such as short-lived jobs, or event-driven computations.

Between StatsD and Prometheus, we were able to publish custom metrics from virtually anywhere. The other great thing, is with both of these solutions being open source, there was already a thriving community building out assistive components to these two libraries.

The final piece of the puzzle for us was where the vendor came into play. With our priorities set, we found a vendor that did seamlessly integrate with Prometheus metrics, they would even scrape the metrics for us, so we didn’t even need to run Prometheus, just use their standards. They also ingested our StatsD metrics without a hitch.

SignalFx

SignalFx was the vendor we ended up selecting, this is what ended up working for us, and our priorities. The key component with the vendor selection is that the solution fulfills your needs from a managed, and ease-of-use view point. That being said, I’ll illustrate how SignalFx fulfilled this for us.

The tailing part of our third priority is we wanted some guidance on what to pay attention to, SignalFx had some very useful dashboards out of the gate that used our Prometheus metrics to pinpoint some of our key infrastructure components, like Kubernetes and AWS.

They also have a very robust alerting system which was as simple as identifying the “signal” we wanted to pay attention to, and adding a particular constraint to it. These constraints could be anything between a static threshold, to outliers, to historical anomalies. This was significantly simpler than our second attempt, and this was built around custom metrics! Win, Win!

Finally, SignalFx charges per metric you send them, the great thing about this is that our non-prod environments are pretty quiet, we dialed down their resolution to a minute or two, so the metrics that are constantly being generated, like CPU, or memory, didn’t cost an arm and a leg. This fulfilled our final priority, and allowed us to save a significant amount of money over other vendor solutions.

The takeaway from all of this is that the observability platform we use, if built around standardized systems, doesn’t have to be painful. In fact it can be just the opposite. We have been able to accelerate our development and we have never had surprises due to the maintainability and adaptability of our observability platform.

For more on Nav’s cloud native journey, check out the case study and video.

Hong Kong