Guest post originally published on Medium’s blog by Dotan Horovits

TL;DR Key updates:

We Just celebrated 10 year birthday to Prometheus last month. Prometheus was the second project to join the Cloud Native Computing Foundation after Kubernetes in 2016, and has quickly become the de-facto way to monitor Kubernetes workloads. The plug-and-play experience, just putting Prometheus server and starting to see metrics flowing in tagged with Kubernetes labels, was a compelling offer. Today Prometheus is a graduated CNCF project, a mature project that is widely used to monitor Kubernetes and other frameworks. But it is by no means standing still.

Prometheus is a highly active project, featuring releases every six weeks. I was impressed by the volume of updates shared at PrometheusDay and KubeCon in Detroit two months ago, and then a few weeks later at PromCon and in Open Source Monitoring Conference in Germany.

This drove me to devote the 2022 closing episode of OpenObservability Talks to the updates in Prometheus, where I hosted Julien Pivotto, one of the prominent maintainers of Prometheus. Julien is also the co-founder of O11y, a company that provides support for open source observability tools such as Prometheus, Thanos and Grafana. Juilen is by far the authoritative source for all things Prometheus. So let’s dive right in.

Agent Mode for fetch-and-forward to external backend

Prometheus project offers many pieces of the monitoring solution puzzle, including service discovery, metric scraping capabilities and also the time series database for storing the metrics.

As the Prometheus ecosystem grew, people started using different backends to store the metrics, sometimes referred to as long-term storage solutions. This ecosystem includes open source projects such as Thanos, Cortex and Mimir, as well as various vendor solutions, all aligned with the Prometheus protocol (disclaimer: my company Logz.io offers such a managed service).

With the growing ecosystem of backend solutions, there’s been a growing demand for using Prometheus only for fetching the metrics, a scrape-and-forward agent. This drove the release of the Prometheus Agent Mode last year.

The Agent Mode is a lightweight mode for Prometheus that takes away the heavyweight database, enabling use of Prometheus for ingesting metrics and forwarding to external backends. You can read more on the Agent Mode on this post.

Prometheus architecture. Source: Prometheus.io

Service Discovery new mechanisms

Cloud native systems are highly distributed applications, spanning many nodes, service and instances. They are also very dynamic, with instances spinning up and tearing down. How does Prometheus keep track of all the components from which it needs to collect metrics at any given point in time?

Prometheus uses a Service Discovery (SD) mechanism to connect to the “source of truth” of the underlying environment, whether public cloud, private cloud, Kubernetes of other, and dynamically learn about the different components in the system (Targets in Prometheus terms) from which it needs to scrape metrics, and stay up to date on any changes in these targets.

Prometheus has Service Discovery mechanisms for AWS EC2, AZURE, GCE, Docker, Kubernetes, Nomad, OpenStack and many others. This rich integration ecosystem is in fact on of Prometheus’ big selling points. SDs are being added in an impressive pace, with 10 SDs having been added in the past two years alone. Check out the docs on the different SDs and how to use them.

In addition to the file-based discovery, a new HTTP Service Discovery was recently added, which enables custom integration with other sources. While the file-based discovery required writing a file on the Prometheus server and running a sidecar to access and read the file, the HTTP SD opens the option to avoid sidecars altogether, by just accessing an HTTP endpoint. Removing the need for a sidecar makes the monitoring architecture more robust, and reduces the barrier to entry in using Prometheus.

The rich SD ecosystem is powerful, but can burden the binary. Indeed Prometheus’ config package depended on all the service discoveries. Now a recent plugin system enables users to include or exclude SDs at build time according to their needs. This flexibility also enables users to compose their own own plug-in, in case it is a custom or proprietary one that cannot be shared upstream, or if timeline constraints cannot tolerate waiting for it to be integrated upstream.

Native Histrograms and Examplars support in the database

Not many know, Prometheus draw inspiration from Facebook’s Gorilla paper for its design of the time series database. The database is a mature and stable core component of Prometheus, but there are some exciting innovations there as well.

An important update that came at KubeCon NA was the introduction of native high-resolution histograms. Histograms are a metric type that can sample the distribution of metric values, by counting events in configurable “buckets” or “bins” (you can read more about metric types here). Previously you had to define the buckets of interest upfront (say, one second, two second, four second, eight second), which may not have been what you actually needed later on, when using it for investigating an actual incident. The native histograms remove this prerequisite, and enables users to calculate percentiles more easily and with greater accuracy. Note that native histograms are still an experimental feature.

Another important addition to Prometheus is Exemplars, which enable Metric-Trace correlation. When there’s a certain performance metric spiking, we often want to investigate it further by looking at a specific trace of that occurrence, to see where this latency is coming from. With the correct exemplar attached to the metrics is the correct time and the correct value, it becomes straightforward and easy to do.

PromQL query language enhancements

Prometheus introduced PromQL, a very powerful query language based on labeling. The labeling model proved to be highly effective and flexible for high cardinality metrics and ad-hoc slicing and dicing by different dimensions. This was especially clear in comparison to the hierarchical model employed by Graphite at the time.

New use cases for Prometheus by the scientific community drove the introduction of new functions to PromQL, such as sign function and trigonometric functions. You can also offset means that when you will select your data, you can use offsets to shift the selection by an amount of time. PromQL now also supports negative offsets you can use on your data.

PromLens query builder tool is now part of Prometheus

PromQL is indeed a powerful query language, but building complex queries can get tricky. This is where a good UI can help. PromLens is a query builder, a tool for building, understanding, and debugging complex PromQL queries.

Julius Volz, one of the creators of Prometheus, together with Chronosphere, have just recently open-sourced PromLens and contributed it to the Prometheus project. A great addition to the Prometheus stack. Some parts of PromLens will probably be directly available inside the Prometheus UI.

Prometheus UI improvements

Prometheus UI hasn’t been the strongest part of the stack. But the community has been investing in uplifting it, and not just with the upcoming PromLens capabilities. The entire UI was migrated to a React-based UI, to improve performance. Seeing the high scales that Prometheus is used, which can reach thousands of target, much effort was put on smarter loading, to avoid loading all at once and get the UI stuck.

The UI now offers suggested query auto-completion based on the metrics that you have in your Prometheus server, and based on the PromQL syntax. And let’s not forget the cool factor of the dark mode.

Alertmanager now supports Telegram, Discord and WebEx

Alertmanager enables dispatching alerts on your metrics to various destinations. Alertmanager offers a rich suite of integrations with notification applications and protocols such as Email, Slack, PagerDuty, OpsGenie and more. Recently the Alertmanager was added integrations to Telegram, Discord, and Cisco WebEx. Alertmanager not only takes care of routing notifications to the correct receiver, but it also takes care of deduplication, grouping, silencing notifications (you don’t need to get alerts overnight if there’s an on-call team, right?) and more. There are also improvement in memory consumption and more. For the latest list of receivers check out the docs.

New Long Term Support for Prometheus

Not every company can keep up with the project’s releases every six weeks. Larger organizations and enterprises move slower in upgrading their software, and may still be running on older versions for months and even a year. These enterprises can do without the latest features, but they still need basic fixes. A new Long Term Support (LTS) for Prometheus announced last month at PromCon, will provide bug fixes, security fixes and documentation fixes for a year to selected releases, starting with v2.37. This is a highly important enabler to make Prometheus (and other open source projects) enterprise grade. You can read more about LTS here.

More updates

Are these all the updates about Prometheus? Not at all! You can check out the talks from last month’s PromCon, as well as the project’s Slack channels, Discourse, and of course on GitHub.

And what about the Prometheus ecosystem? On my fireside chat with Julien, we also discussed the popular Grafana project and the new Perses project that aims to provide a CNCF-native alternative, we discussed CNCF’s Thanos and Cortex, as well as Mimir that forked off of Cortex and the CNCF, and more.

Want to learn more? Check out the OpenObservability Talks latest episode: What’s new in the Prometheus ecosystem? on AppleSpotify or other apps.