Guest post originally published on Infracloud’s blog by Sayan Das, DevOps Engineer at InfraCloud Technologies

Hey there! If you are reading this blog post, then I guess you are already aware of Prometheus and how it helps us in monitoring distributed systems like Kubernetes. And if you are familiar with Prometheus, then chances are that you have come across the name called Thanos. Thanos is a popular OSS that helps enterprises achieve a HA Prometheus setup with long-term storage capabilities. One of the common challenges of distributed monitoring is to implement multi-tenancy. Thanos receiver is a Thanos component designed to address this common challenge. Receiver was part of Thanos for a long time, but it was EXPERIMENTAL. Recently, Thanos went GA with the Receiver component.

Motivation

We tried this component with one of our clients, it worked well. However, due to lack of documentation, the setup wasn’t as smooth as we would have liked it to be. The purpose of this blog post is to lay out a simple guide for those who are looking forward to creating a multi-tenant monitoring setup using Prometheus and Thanos receiver. In this blog post, we will try to use Thanos receiver to achieve a simple multi-tenant monitoring setup where tenant side Prometheus can be a nearly stateless component on the tenant side.

A few words on Thanos Receiver

Receiver is a Thanos component that can accept remote write requests from any Prometheus instance and store the data in its local TSDB, optionally it can upload those TSDB blocks to an object storage like S3 or GCS at regular intervals. Receiver does this by implementing the Prometheus Remote Write API. It builds on top of existing Prometheus TSDB and retains their usefulness while extending their functionality with long-term-storage, horizontal scalability, and down-sampling. It exposes the StoreAPI so that Thanos Queriers can query received metrics in real-time.

Multi-tenancy

Thanos receiver supports multi-tenancy. It accepts Prometheus remote write requests, and writes these into a local instance of Prometheus TSDB. The value of the HTTP header (“THANOS-TENANT”) of the incoming request determines the id of the tenant Prometheus. To prevent data leaking at the database level, each tenant has an individual TSDB instance, meaning a single Thanos receiver may manage multiple TSDB instances. Once the data is successfully committed to the tenant’s TSDB, the requests return successfully. Thanos Receiver also supports multi-tenancy by exposing labels which are similar to Prometheus external labels.

Hashring configuration file

If we want features like load-balancing and data replication, we can run multiple instances of Thanos receiver as a part of a single hashring. The receiver instances within the same hash-ring become aware of their peers through a hashring configuration file. Following is an example of a hashring configuration file.

[
   {
       "hashring": "tenant-a",
       "endpoints": ["tenant-a-1.metrics.local:19291/api/v1/receive", "tenant-a-2.metrics.local:19291/api/v1/receive"],
       "tenants": ["tenant-a"]
   },
   {
       "hashring": "tenants-b-c",
       "endpoints": ["tenant-b-c-1.metrics.local:19291/api/v1/receive", "tenant-b-c-2.metrics.local:19291/api/v1/receive"],
       "tenants": ["tenant-b", "tenant-c"]
   },
   {
       "hashring": "soft-tenants",
       "endpoints": ["http://soft-tenants-1.metrics.local:19291/api/v1/receive"]
   }
]

Architecture

In this blog post, we are trying to implement the following architecture. We will use Thanos v0.14 in this blog post.

A simple multi-tenanct monitoring model with prometheus and thanos receive

A simple multi-tenancy model with Thanos Receiver

Brief overview on the above architecture:

The above architecture obviously misses few features that one would also expect from a multi-tenant architecture, e.g: tenant isolation, authentication, etc. This blog post only focuses how we can use the Thanos Receiver to store time-series from multiple prometheus(es) to achieve multi-tenancy. Also the idea behind this setup is to show how we can make the prometheus on the tenant side nearly stateless yet maintain data resiliency.

We will improve this architecture, in the upcoming posts. So, stay tuned.

Prerequisites

Cluster setup

Clone the repo:

 git clone https://github.com/dsayan154/thanos-receiver-demo.git 

Setup a local KIND cluster

  1. cd local-cluster/
  2. Create the cluster with calico, ingress and extra-port mappings: ./create-cluster.sh cluster-1 kind-calico-cluster-1.yaml
  3. Deploy the nginx ingress controller: kubectl apply -f nginx-ingress-controller.yaml
  4. cd -

Install minio as object storage

  1. kubectl create ns minio
  2. helm repo add bitnami https://charts.bitnami.com/bitnami
  3. helm upgrade --install --namespace minio my-minio bitnami/minio --set ingress.enabled=true --set accessKey.password=minio --set secretKey.password=minio123 --debug
  4. Add the following line to /etc/hosts127.0.0.1 minio.local
  5. Login to http://minio.local/ with credentials minio:minio123
  6. Create a bucket with name thanos from UI

Install Thanos Components

Create shared components

kubectl create ns thanos 

## Create a file _thanos-s3.yaml_ containing the minio object storage config for tenant-a: cat << EOF > thanos-s3.yaml type: S3 config: bucket: "thanos" endpoint: "my-minio.minio.svc.cluster.local:9000" access_key: "minio" secret_key: "minio123" insecure: true EOF
## Create secret from the file created above to be used with the thanos components e.g store, receiver kubectl -n thanos create secret generic thanos-objectstorage --from-file=thanos-s3.yaml kubectl -n thanos label secrets thanos-objectstorage part-of=thanos ## go to manifests directory cd manifests/

Install Thanos Receive Controller

Install Thanos Receiver

  1. Create the thanos-receiver statefulsets and headless services for soft and hard tenants.We are not using persistent volumes just for this demo.kubectl apply -f thanos-receive-default.yaml kubectl apply -f thanos-receive-hashring-0.yamlThe receiver pods are configured to store 15d of data and with replication factor of 2
  2. Create a service in front of the thanos receiver statefulset for the soft tenants.kubectl apply -f thanos-receive-service.yamlThe pods of thanos-receive-default statefulset would load-balance the incoming requests to other receiver pods based on the hashring config maintained by the thanos receiver controller.

Install Thanos Store

  1. Create a thanos store statefulsets.kubectl apply -f thanos-store-shard-0.yamlWe have configured it such that the thanos querier fans out queries to the store only for data older than 2w. Data earlier than 15d are to be provided by the receiver pods. P.S: There is a overlap of 1d between the two time windows is intentional for data-resiliency.

Install Thanos Querier

  1. Create a thanos querier deployment, expose it through service and ingresskubectl apply -f thanos-query.yamlWe configure the thanos query to connect to receiver(s) and store(s) for fanning out queries.

Install Prometheus(es)

Create shared resource

kubectl create ns sre 
kubectl create ns tenant-a
kubectl create ns tenant-b

Install Prometheus Operator and Prometheus

We install the prometheus-operator and a default prometheus to monitor the cluster

helm upgrade --namespace sre --debug --install cluster-monitor stable/prometheus-operator \ 
--set prometheus.ingress.enabled=true \ 
--set prometheus.ingress.hosts[0]="cluster.prometheus.local" \ 
--set prometheus.prometheusSpec.remoteWrite[0].url="http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive" \ 
--set alertmanager.ingress.enabled=true \ --set alertmanager.ingress.hosts[0]="cluster.alertmanager.local" \ --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]="grafana.local"

Install Prometheus and ServiceMonitor for tenant-a

In tenant-a namespace:

kubectl apply -f nginx-proxy-a.yaml
kubectl apply -f prometheus-tenant-a.yaml

Install Prometheus and ServiceMonitor for tenant-b

In tenant-b namespace:

kubectl apply -f nginx-proxy-b.yaml
kubectl apply -f prometheus-tenant-b.yaml

Add some extra localhost aliases

Add the following lines to /etc/hosts :

127.0.0.1 minio.local 127.0.0.1 query.local 127.0.0.1 cluster.prometheus.local 127.0.0.1 tenant-a.prometheus.local 127.0.0.1 tenant-b.prometheus.local

The above would allow you to locally access the miniothanos queriercluster monitoring prometheustenant-a’s prometheustenant-b’s prometheus. We are also exposing Alertmanager and Grafana, but we don’t require those in this demo.

Test the setup

Access the thanos querier from http://query.local/graph and from the UI, execute the query count (up) by (tenant_id). We should see a following output:

Screenshot showing query output on Thanos
Query Output

Otherwise, if we have `jq` installed, you can run the following command:

curl -s http://query.local/api/v1/query?query="count(up)by("tenant_id")"|jq -r '.data.result[]|"\(.metric) \(.value[1])"'
{"tenant_id":"a"} 1
{"tenant_id":"b"} 1
{"tenant_id":"cluster"} 17

Either of the above outputs show that, clustera and b prometheus tenants are respectively having 17, 1 and 1 scrape targets up and running. All these data are getting stored in thanos-receiver in real time by prometheus’ remote write queue. This model creates an opportunity for the tenant side prometheus to be nearly stateless yet maintain data resiliency.

In our next post we would improve this architecture to enforce tenant isolation on thanos-querier.

If you encounter any issues while going through this article you can comment here.

Sources: