Why we selected Thanos for long term metrics storage

Posted on April 18, 2022

CNCF projects highlighted in this post

Guest post originally published on the Elastisys blog

Metrics answer 3 questions: Are your users happy? Is your application happy? Are your servers happy? Application developers create dashboards based on metrics for situational awareness or to identify long-term trends. Who wouldn’t want to measure their growth and compare the daily active users from today with the value one year ago?

Prometheus – one of the first Graduated CNCF projects – is the go-to solution for collecting application and platform metrics, as well as short-term storage. If your application is already containerized, Prometheus and Kubernetes are BFF. For long-term metrics storage, Prometheus integrates with 28 solutions. Which one should you choose?

In this post, we tell the story of how we selected among the plethora of projects for long-term metrics storage. This blog post is the result of 3 all-hands meetings and over 200 person-hours of hands-on work, where our engineers took each candidate project and plugged it into Elastisys Compliant Kubernetes. YMMV, we hope our story will save you time or at least inform you for a similar evaluation process.

Evaluation Criteria

1. Long-term “Health”

Governance (prefer community-driven, CNCF)
Open-source (prefer Apache 2.0)
Security patching history / policy
Popularity / adoption (GitHub contributors / commits / stars, who uses it?)

2. MUST HAVEs

Remote_read or “direct” PromQL, remote_write
Compression, aggregation
Deduplication

3. Easy to operate

Highly available
Disaster recovery
Backups / long-term archival
Security patching
Upgrades
Capacity needed

Criteria 1: Long-term Health

Our architectural decision process favors caring for the future more than the present. Features can be added and removed, but changing project ownership and aligning interests is a lot harder. Many remember the Istio governance discussion that brought this risk higher up the agenda. Hence, long-term health is a higher priority than the exact features a project has today.

Only community-driven governance can truly ensure that the project does not depend on the interests or balance sheet of any single company, be it small or large. Furthermore, community-driven open-source benefits business continuity, putting a smile on your CISO’s face.

Finally, we took our crystal ball out and asked if the candidate projects would be around next year. Migration costs are not something we want to pay too often. Unfortunately, crystal balls are yet to be invented, so in the meantime we resorted to vanity metrics: number of GitHub contributors, commits, stars and forks; as well as what other cool kids are using a project.

Criteria 2: MUST HAVE Features

We cannot live without the following. Prometheus remote_read integration or PromQL support: This allows both application developers and platform engineers to reuse – and contribute to – popular dashboards, such as those published on Grafana Labs’s website or provided with the Prometheus Operator. I guess we’re not big fans of reinventing the wheel.

When it comes to long-term storage, size does matter. Not only because … well storage costs … but also because it makes off-site replication and querying faster. Two complementary techniques can achieve this. First, compression – needed by some projects, but not all – stores metrics in a more compact (perhaps slower to query) format. Compression – as we understand and use the term here – implies no loss of information. Second, aggregation implies loss of information by reducing the resolution of the data. This can happen either in “time” or in “space”. In time, the time resolution is decreased, e.g., values are stored with a time resolution of 15 minutes as opposed to 15 seconds. In space, labels are removed, e.g., you can retrieve the average CPU usage for all Pods of an application, but you no longer have access to the time-series of individual Pods.

Finally, deduplication: Common wisdom says that your monitoring stack needs to be one order-of-magnitude more resilient than your monitored system. As you guessed it, that implies running several Prometheuses (or Promethesis?), so that a Node failing at 2am can be dealt with during business hours. But something somewhere needs to deduplicate redundantly collected metrics, unless your Product Manager asked you to count each user twice. Deduplication ensures that metrics are presented only once, although they are collected and stored twice.

Criteria 3: Easy to Operate

Features are important, but what happens on day 2?

As part of our evaluation, we also wanted to get a “feel” for how the new project would support our data security practices. Would applying security patches feel like “todo; done” or more like launching the James Webb Space Telescope? What about setting up high availability?

What about disaster recovery? Do we even need to perform disaster recovery or can the project store all critical data in append-only S3-compatible object storage? In other words, “disaster recovery” – mind you, it would be quite disastrous to lose your provider’s object storage – would be as simple as rclone from the off-site replica.

Finally, how capacity hungry is the project? The purpose of the platform is to serve the application. So, while metrics storage deserves all TLC in the world, it should preferably not need more CPUs and memory than the application itself … please.

Comparison of the top contenders

After performing a rough filtering of all 28 Prometheus integrations, we moved ahead and did a more thorough analysis of 6 candidates, including plugging them into Elastisys Compliant Kubernetes.

Of the top six (InfluxDB, TimescaleDB, M3DB, Victoria Metrics, Thanos, and Cortex), let’s see how they each compare to each other.

Honorable mention: InfluxDB

InfluxDB is a purpose-built time-series database owned and developed by InfluxData. It comes in two versions, open source (MIT licensed) and enterprise. Amongst others, the enterprise version brings high availability and horizontal scalability (clustering). InfluxDB stores data on disk, i.e., PersistentVolumes in Kubernetes parlance. InfluxDB 1 is deprecated, and users are recommended to switch to InfluxDB 2 as soon as possible.

Overall, InfluxDB is an amazing project and version 1 has served us well for years. Thank you InfluxDB for bringing us this far!

Deselected because: However, we had to say farewell to InfluxDB due to a few reasons. First, it’s not community-driven. Second, the open-source version lacks MUST HAVEs, like high availability and deduplication. Third, in our environments, it proved rather resource hungry. In some cases, we had to reduce retention to 3 days to stay within our 16 GB RAM budget.

We considered InfluxDB 2. However, there are no immediate plans to add support for remote_read, so we had to pass.

Bronze medal: TimescaleDB

TimescaleDB is a time-series database owned and built by Timescale. It’s implemented as an extension of PostgreSQL. Using TimescaleDB for metrics storage implies that you can leverage existing in-house knowledge around operating PostgreSQL and reuse your access control, high availability and disaster recovery procedures.

TimescaleDB comes in two versions: open source (Apache 2.0) and Timescale License (TSL). The latter is similar to the controversial Server-Side Public License (SSPL) adopted by MongoDB and Elasticsearch in 2021. The TSL version adds compression and aggregation.

Deselected because: Unfortunately, the project is not community-driven. Its open-source version lacks compression. And you definitely need compression! TimescaleDB initially stores each value – together with its timestamp and labels – as one database row, which is extremely space consuming. Compression merges related values into a single row, to obtain something that resembles a bit more the super-efficient TSDB file format, but in a PostgreSQL database.

I’m getting a bit philosophical here, but the TimescaleDB really made me wonder: Is a relational database really the right nest for metrics? Metrics are pretty much append-only, so all the hard work that PostgreSQL is doing for ensuring transactionality is rather wasted. Our DBA kind of freaked out at the amount of WAL generated and sent to the secondary. You should have totally seen his face. So the Noes have it. Unlock!

Silver medal: M3DB and VictoriaMetrics

M3DB and VictoriaMetrics are two time-series databases backed by Uber+Chronosphere and VictoriaMetrics, respectively. They are both open source Apache 2.0-licensed. They both come with their own Kubernetes operator to simplify operations. Feature-wise, they ticked off all the boxes. They both store metrics on disk, are highly available and seem to be able to handle tons of metrics.

Overall, we left with a pleasant surprise on how well they both integrated with Grafana and Prometheus, how easy it was to set up high availability and how well they could handle tons of metrics.

Deselected because: We deselected them mainly because they are not community-driven. This is a selection and not a grading, so we really needed to find a deselection reason. But except that, kudos to the teams behind them!

Gold medal: Thanos and Cortex

Thanos and Cortex are both CNCF Incubating projects, so community-driven open-source checks. They tick all our MUST HAVEs, are easy to use and can handle tons of metrics. Prometheus and Grafana love them, and so did our platform engineers.

Looking at the big picture, they have a similar design. For the write path, metrics are first stored on disk in TSDB format, then TSDB segments are regularly uploaded to S3-compatible object storage. The read path looks fairly symmetrical, with the disk acting as a sort of cache. Makes a lot of sense for append-only data like metrics!

Disaster recovery? No need to worry! All important data is in object storage. Feel free to rclone it to another location and another instance of Thanos or Cortex will happily read it. We already use S3-compatible object storage for long-term logs and backups, so reusing a very widely available infrastructure service further simplifies operations and fosters portability across clouds.

All-in-all, both projects are amazing and fairly similar. They seem to have co-evolved. I bet that if a new feature lands in Thanos, Cortex will get in shortly too, and vice-versa.

Okay, so now what? While we do want a resilient monitoring stack, we are not as fanatic as to run both Thanos and Cortex. In the end, there can be only one.

The overall winner is…

… Thanos!

Indeed, Thanos is our default metrics storage backend in Elastisys Compliant Kubernetes v0.19 and our only metrics storage backend starting Elastisys Compliant Kubernetes v0.20.

Big Disclaimer

While we hope this selection is useful in many contexts, it is by no means to be taken as absolute. Each use-case is unique and brings unique requirements. Do not skip due diligence. You have been warned.

No production metrics were harmed in the making of this blog post.

Yokohama, Japan