All Posts By

Luc Perkins

Cortex: a multi-tenant, horizontally scalable Prometheus-as-a-Service

By | Blog

Prometheus is one of the standard-bearing open-source solutions for monitoring and observability. From its humble origins at SoundCloud in 2012, Prometheus quickly garnered widespread adoption and later became one of the first CNCF projects and just the second to graduate (after Kubernetes). It’s used by tons of forward-thinking companies in production, including heavyweights like DigitalOcean, Fastly, and Weaveworks, and has its own dedicated yearly conference, PromCon.

Prometheus: powerful but intentionally limited

Prometheus has succeeded in part because the core Prometheus server and its various complements, such as Alertmanager, Grafana, and the exporter ecosystem, form a compelling end-to-end solution to a crucial but difficult problem. Prometheus does not, however, provide some of the capabilities that you’d expect from a full-fledged “as-a-Service” platform, such as multi-tenancy, authentication and authorization, and built-in long-term storage.

Cortex, which joined the CNCF in September as a sandbox project, is an open-source Prometheus-as-a-Service platform that seeks to fill those gaps and to thereby provide a complete, secure, multi-tenant Prometheus experience. I’ll say a lot about Cortex further down; first, let’s take a brief excursion into the more familiar world of Prometheus to get our bearings.

Why Prometheus?

As a CNCF developer advocate, I’ve had the opportunity to become closely acquainted both with the Prometheus community and with Prometheus as a tool (mostly working on docs and the Prometheus Playground). Its great success is really no surprise to me for a variety of reasons:

  • Prometheus instances are easy to deploy and manage. I’m especially fond of near-instant configuration reloading and of the fact that all Prometheus components are available as static binaries.
  • Prometheus offers a simple and easily adoptable metrics exposition format that makes it easy to write your own metrics exporters. This format is even being turned into an open standard via the OpenMetrics project (which also recently joined the CNCF sandbox).
  • Prometheus offers a simple but powerful label-based querying language, PromQL, for working with time series data. I find PromQL to be highly intuitive

Why Prometheus-as-a-Service?

Early on, Prometheus’ core engineers made the wise decision to keep Prometheus lean and composable. From the get-go, Prometheus was designed to do a small set of things very well and to work seamlessly in conjunction with other, optional components (rather than overburdening Prometheus with an ever-growing array of hard-coded features and integrations). Here are some things that Prometheus was not meant to provide:

  • Long-term storage — Individual Prometheus instances provide durable storage of time series data, but they do not act as a distributed data storage system with features like cross-node  replication and automatic repair. This means that durability guarantees are restricted to that of a single machine. Fortunately, Prometheus offers a remote write API that can be used to pipe time series data to other systems.
  • A global view of data — As described in the bullet point above, Prometheus instances act as isolated data storage units. Prometheus instances can be federated but that adds a lot of complexity to a Prometheus setup and again, Prometheus simply wasn’t built as a distributed database. This means that there’s no simple path to achieving a single, consistent, “global” view of your time series data.
  • Multi-tenancy — Prometheus by itself has no built-in concept of a tenant. This means that it can’t provide any sort of fine-grained control over things like tenant-specific data access and resource usage quotas.

Why Cortex?

As a Prometheus-as-a-Service platform, Cortex fills in all of these crucial gaps with aplomb and thus provides a complete out-of-the-box solution for even the most demanding monitoring and observability use cases.

  • It supports four long-term storage systems out of the box: AWS DynamoDB, AWS S3, Apache Cassandra, and Google Cloud Bigtable.
  • It offers a global view of Prometheus time series data that includes data in long-term storage, greatly expanding the usefulness of PromQL for analytical purposes.
  • It has multi-tenancy built into its very core. All Prometheus metrics that pass through Cortex are associated with a specific tenant.

The architecture of Cortex

Cortex has a fundamentally service-based design, with its essential functions split up into single-purpose components that can be independently scaled:

  • Distributor — Handles time series data written to Cortex by Prometheus instances using Prometheus’ remote write API. Incoming data is automatically replicated and sharded, and sent to multiple Cortex ingesters in parallel.
  • Ingester — Receives time series data from distributor nodes and then writes that data to long-term storage backends, compressing data into Prometheus chunks for efficiency.
  • Ruler — Executes rules and generates alerts, sending them to Alertmanager (Cortex installations include Alertmanager).
  • Querier — Handles PromQL queries from clients (including Grafana dashboards), abstracting over both ephemeral time series data and samples in long-term storage.

Each of these components can be managed independently, which is key to Cortex’s scalability and operations story. You can see a basic diagram of Cortex and the systems it interacts with below:

As the diagram shows, Cortex “completes” the Prometheus Monitoring System. To adapt it to existing Prometheus installations, you just need to re-configure your Prometheus instances to remote write to your Cortex cluster and Cortex handles the rest.


Single-tenant systems tend to be fine for small use cases and non-production environments, but for large organizations with a plethora of teams, use cases, and environments, those systems become untenable (no pun intended). To meet the exacting requirements of such large organizations, Cortex provides multi-tenancy not as an add-on or a plugin but rather as a first-class capability.

Multi-tenancy is woven into the very fabric of Cortex. All time series data that arrives in Cortex from Prometheus instances is marked as belonging to a specific tenant in the request metadata. From there, that data can only be queried by the same tenant. Alerting is multi-tenant as well, with each tenant able to configure its own alerts using Alertmanager configuration.

In essence, each tenant has its own “view” of the system, its own Prometheus-centric world at its disposal. And if you do use Cortex in a single-tenant fashion, you can expand out to an indefinitely large pool of tenants at any time.

Use cases

Several years into its development, users of Cortex have tended to cluster into two broad categories:

  1. Service providers building hosted, managed platforms offering a monitoring and observability component. If you were building a Platform-as-a-Service offering like Heroku or Google App Engine, for example, Cortex would enable you to provide each application running on the platform with the full spectrum of capabilities provided by Prometheus and to treat each application (or perhaps each account or customer) as a separate tenant of the system.Weave Cloud and Grafana Labs are examples of comprehensive cloud platforms that use Cortex to enable customers to use Prometheus to the fullest.
  2. Enterprises with many  internal customers running their own apps, services, and “stacks.”EA and StorageOS are examples of large enterprises that have benefited from Cortex.

Cortex, the Prometheus ecosystem, and the CNCF

Cortex has some highly compelling technological bona fides, but under the current industry Zeitgeist I also think it’s important to point out its open source bona fides as well:

  • Cortex is under the Apache 2.0 license and backed by the CNCF.
  • It’s tightly coupled only with other Apache 2.0 CNCF projects, with no strong interlinkages with closed-source, proprietary, or vendor-specific technologies.
  • Project collaborators include Prometheus core maintainers like Goutham Veeramachaneni and Tom Wilkie, engineers from companies like Weaveworks, Grafana Labs, Platform9, and others heavily invested in the monitoring and observability space.
  • Cortex is already running in production powering Weave Cloud and Grafana Cloud, two cloud offerings (and core contributors) whose success is crucially dependent on the future trajectory of Cortex.

With the addition of Cortex to the CNCF sandbox, there are now three Prometheus-related projects under the CNCF umbrella (including Prometheus itself and OpenMetrics). We know that monitoring and observability are essential components of the cloud native paradigm, and we’re happy to see continued convergence around some of the core primitives that have organically emerged from the Prometheus community. The Cortex project is energetically carrying this work forward and I’m excited the Prometheus-as-a-Service offshoot of the Prometheus ecosystem take shape.



gRPC-Web is going GA

By | Blog

On behalf of the Cloud Native Computing Foundation, I’m excited to announce the GA release of gRPC-Web, a JavaScript client library that enables web apps to communicate directly with backend gRPC services, without requiring an HTTP server to act as an intermediary. This means that you now easily build truly end-to-end gRPC application architectures by defining your client- and server-side data types and service interfaces using .proto files. gRPC-Web thus provides a compelling new alternative to the entire REST paradigm of web development.

The basics

gRPC-Web enables you to define the service “contract” between client web applications and backend gRPC servers using .proto definitions and auto-generate client JavaScript (you can choose between Closure compiler JavaScript or the more widely used CommonJS). What you get to leave out of the development process: creating custom JSON serialization and deserialization logic, wrangling HTTP status codes (which can vary across REST APIs), content type negotiation, etc.

From a broader architectural perspective, what gRPC-Web makes possible is end-to-end gRPC. The diagram below illustrates this:

In the gRPC-Web universe on the left, a client application speaks Protocol Buffers to a gRPC backend server that speaks Protocol Buffers to other gRPC backend services. In the REST universe on the right, the web app speaks HTTP to a backend REST API server that then speaks Protocol Buffers to backend services.

To be clear, there’s nothing wrong with the REST application on the right per se. Tons of highly successful applications have been built using REST API servers that communicate with backend services using non-HTTP protocols. But imagine those applications’ development processes united around a single protocol and a single set of .proto interfaces (and thus a single set of service contracts) and you can almost feel the countless hours saved and headaches avoided. The benefits of gRPC-Web aren’t “merely” technological; they’re organizational as well. That bright orange line isn’t just a different protocol—it’s an independent source of work and cognitive load that you can now easily turn bright green.

Advantages of using gRPC-Web

gRPC-Web will offer an ever-broader feature set over time. But I can see it offering a handful of big wins from the get-go:

  • End-to-end gRPC — As mentioned above, with gRPC-Web you can officially cut the REST component out of your stack and replace it with pure gRPC, enabling you to craft your entire RPC pipeline using Protocol Buffers. Imagine a scenario in which a client request goes to an HTTP server, which then interacts with 5 backend gRPC services. There’s a good chance that you’ll spend as much time building the HTTP interaction layer as you will building the entire rest of the pipeline.
  • Tighter coordination between frontend and backend teams — Think back to the diagram up above. With the entire RPC pipeline defined using Protocol Buffers, you no longer need to have your “microservices teams” alongside your “client team.” The client-backend interaction is just one more gRPC layer amongst others. I honestly have yet to fully grasp the implications for end-to-end testing, service mesh integration, continuous integration/delivery, and more.
  • Easily generate client libraries — With gRPC-Web, the server that interacts with the “outside” world, i.e. the membrane connecting your backend stack to the internet, is now a gRPC server instead of an HTTP server, that means that all of your service’s client libraries can be gRPC libraries. Need client libraries for Ruby, Python, Java, and 4 other languages? You no longer need to write HTTP clients for all of them.

A gRPC-Web example

The previous section illustrated some of the high-level advantages of gRPC-Web for large-scale applications. Now let’s get closer to the metal with an example: a simple TODO app. In gRPC-Web you can start with a simple todos.proto definition like this:

You can use that .proto definition to generate CommonJS client-side code using the protoc command-line tool:

Now fetching a list of TODOs from a backend gRPC server can be as simple as this:

Again, no HTTP codes or methods, no JSON parsing, no header negotiation. You declare data types and a service interface and gRPC-Web abstracts away all the “hard wiring” boilerplate, leaving you with a clean and human-friendly API (essentially the same API as the current Node.js for gRPC API, just transferred to the client).

On the backend, the gRPC server can be written in any language that supports gRPC, including Go, Java, C++, Ruby, Node.js, and many others (see language-specific docs in the official gRPC docs). The last piece of the puzzle is the service proxy. From the get-go, gRPC-Web will support Envoy as the default service proxy, which has a built-in envoy.grpc_web filter that you can apply with just a few lines of copy-and-pastable configuration. I’ll be saying more about this on the Envoy blog soon.

Next steps

Going GA means that the core building blocks are firmly in place and ready for usage in production web applications. But there’s still much more to come for gRPC-Web. Check out the official roadmap to see what the core team envisions for the near future.

If you’re interested in contributing to gRPC-Web, there are a few things that the core team would love community help with:

  • Front-end framework integration — Commonly used front-end frameworks like React, Angular, and Vue don’t yet offer official support for gRPC-Web. But we would love to see these frameworks support it, as each of them would greatly benefit from gRPC.
  • Language-specific proxy support — As of the GA release, Envoy is the default proxy for gRPC-Web, offering support via a special module. But we’d also love to see development of in-process proxies for specific languages. In-process proxies obviate the need for special proxies—such as Envoy and nginx—and would make using gRPC-Web even easier.

We’d also love to get feature requests from the community. Currently the best way to make feature requests is to fill out the gRPC-Web roadmap features survey. Just list features you’d like to see and also let us know if you’d like to contributing to those features’ development in the I’d like to contribute to section. The gRPC-Web engineers will be sure to take that information to heart over the course of the project’s development.