Monitoring ADCs the Cloud Native Way With Prometheus and Grafana

Posted on September 28, 2020 by Dave Blakey

CNCF projects highlighted in this post

Guest post by Dave Blakey, CTO Snapt

Cloud Native computing has fundamentally shifted the paradigm for how applications are built and run. Built around concepts of ephemeral compute and immutable infrastructure based on Containers, Cloud Native computing focuses on treating each bit of functionality as a small application to be delivered as a service. In that sense, Cloud Native is the polar opposite of so-called monolithic applications that cannot be decomposed into services easily and require massive infrastructure footprints.

ADCs originally evolved from the monolithic times, when monitoring infrastructure was mostly about monitoring a set of static boxes and IP addresses. Cloud Native has turned the concept of ADCs on its head; there is no point in using ADCs for Cloud Native infrastructure unless they can also be Cloud Native – living in small containers, immutable, ephemeral and easy to spin up and scale in seconds, not hours or days. Shifting to a paradigm of Cloud Native for ADCs means shifting to other Cloud Native tools and systems. Two of the most important are Prometheus (for monitoring) and Grafana (dashboarding).

This article runs through what a Cloud Native ADC is and how to think about monitoring ADCs with a Cloud Native lens, using Prometheus and Grafana. Some of the basic design principles required at the high level include monitoring and observability that:

Easily maps to Cloud Native microservice architectures such as service meshes
Works with highly ephemeral infrastructure and container orchestration (e.g. Kubernetes)
Can be used as part of a unified monitoring and visualization tool kit for all levels of infrastructure (physical or virtual servers, networks, services)
Can be set up quickly and customized and shared among teams and communities

Evolving from Monolithic ADCs to Cloud Native ADCs

Application Delivery Controllers (ADCs) were built to serve the old monolithic world. In their first incarnation – and one that is still common – ADCs were large hardware systems that lived in data centers. Monitoring their performance was important but less complicated; they were either up or down, and no organization had more than a handful of them to watch over to track latency and response times. They had static IP addresses and locations. It was more like monitoring a building than monitoring a piece of software. Standing up a new ADC might take a week.

When computing finally began to virtualize with the likes of VMWare and Xen, makers of ADCs began to virtualize legacy ADC software by retrofitting it to reside inside large VMs. These VMs tended to be among the largest deployed. Stand up time came down from a week to an hour or or two, but these types of ADC deployments were too large and too expensive to deploy more than a few instances even for infrastructure covering thousands of servers.

Cloud Native has changed everything about monitoring ADCs. While legacy ADCs can be deployed to accelerate traffic and load balance for Cloud Native applications, this ADC form factor cannot take advantage of the best capabilities of Cloud Native – agility, observability, scalability, security and resilience.

Big Iron ADCs and Big VM ADCs are by definition not agile: they cannot be quickly and easily deployed or scaled in response to bursts of traffic. You cannot stand one up for an hour when, for example, a key network link goes down and you need to reroute global traffic. These monolithic ADCs cannot provide granular observability into microservice behaviors. Monolithic ADCs provide the same perimeter and external protections but they do not handle well security needs of ephemeral networks and infrastructure; they play poorly with cloud APIs that are tied to machine images and constantly shifting locations and sizes of computing elements including service meshes, internal load balancers, databases and application servers.

In contrast, Cloud Native ADCs are designed from the ground up to embrace Cloud Native principles. They can stand up in a matter of minutes and scale in near real-time. They are built on Containers and are immutable with a lightweight footprint that functions more like a Kubernetes sidecar than an actual piece of infrastructure. Their power is in their structure as a grouping of resilient and agile nodes, which matches the form factor of microservice-based applications. For all these reasons, a Cloud Native monitoring and observability paradigm more closely fits Cloud Native ADCs than do legacy monitoring and observability methodologies.

ADC vs Load Balancer

Many Cloud Native teams may wonder what the difference is between an ADC and a load balancer (like Amazon’s Elastic Load Balancer). A load balancer balances traffic (primarily HTTP) and requests across multiple servers. Load balancers can work at Layer 7 (application) or Layer 4 (Network) or both. This can include traffic shaping, global load balancing (across numerous load balancers, and internal load balancing (internal East-West traffic as is common in service meshes like Istio). An ADC has the basic capabilities of a load balancer but adds web acceleration capabilities and a Web Application Firewall (WAF). It often incorporates DDOs protection. ADCs generally do not work at Layer 4. ADC also have out-of-the-box telemetry to inform you on the performance of the application and quickly visualize the topology of your application with a view towards performance.

Cloud Native ADCs + Prometheus + Grafana

Prometheus is a time-series database designed to monitor Cloud Native applications. It has become the most widely-used monitoring software for keeping tabs on Cloud Native applications broadly and Kubernetes infrastructure in particular. With an active developer community, Prometheus is more lightweight than earlier generations of monitoring software. It uses a “pull” model with exporters to scrape data from infrastructure, in contrast with older-style push monitoring that relies on agents which can bog down performance. Prometheus also has a mature set of tooling and features including robust service discovery and alerting. Key features include:

a multi-dimensional data model with time series data identified by metric name and key/value pairs
PromQL, a flexible query language to leverage this dimensionality
no reliance on distributed storage; single server nodes are autonomous
time series collection happens via a pull model over HTTP
pushing time series is supported via an intermediary gateway
targets are discovered via service discovery or static configuration
multiple modes of graphing and dashboarding support

Originally designed by the Cloud Native development team at SoundCloud, Prometheus is platform and scale agnostic. It works across the entire range of computing from monitoring the CPU and Disk on your local laptop to monitoring some of the largest infrastructure setups in the world. It also runs on all major public and private clouds. It can be used to monitor East-West or North-South traffic. Prometheus “Explorers” are included in most major Cloud Native software packages (including Snapt Nova Community Edition and above).

Grafana is an open source analytics and interactive visualization software platform that has become the default choice of the Cloud Native development community. Grafana provides ready-made charts, graphs, and alerts capabilities. It has a large community that shares thousands of dashboards customized for numerous monitoring permutations. The dashboards can be imported and run instantly. Grafana has pre-baked support for hundreds of data source types including hardware, software, and networking. Prometheus does have its own basic charting capability but, for the most part, Prometheus treats Grafana as a visualization engine extension.

Anyone thinking of deploying Cloud Native ADCs would do well to start with a monitoring and observability stack based on Prometheus and Grafana; the combination of large active communities, solid documentation and ease of use is compelling.

How To Think About Monitoring and Observability of Cloud Native ADCs + Prometheus and Grafana

Assuming now that you are ready to connect your Cloud Native ADC to a Prometheus instance and pipe the outputs into a Grafana dashboard. First, consider your design requirements.

East-West + North-South

With legacy ADCs, you were primarily looking at North-South traffic (from external to internal and vice versa and not East-West traffic (back and forth across internal services). Cloud Native applications treat internal and external requests the same and blur the distinction between the two. The same set of microservices responds to the needs of both East-West and North-South traffic requests. So any ADC monitoring scenario needs to factor in this reality.

Multi/Hybrid-Cloud

Most organizations embracing Cloud Native are operating with at least two infrastructure scenarios and sometimes more. It’s common for companies to use one or two major public clouds for different applications or pieces of applications. At the same time, organizations that are newer to Cloud Native most likely have legacy data centers and private-VM infrastructure (such as databases) that coexist with Cloud Native applications. Thus, any ADC monitoring must be able to easily pull in both sets of telemetry, normalize the signals and graph them in a single place.

Flexible Observability At Scale

Cloud Native ADC deployments are not monolithic and may include hundreds of individual ADC nodes. This means that it’s critically important for monitoring teams and SREs to be able to see both the global and local status of their applications in the same monitoring environment. That requirement elevates the importance of query simplicity, dashboard flexibility and strong automatic discovery of new services or changes to existing services running on ephemeral infrastructure with moving IP addresses.

Practical Considerations And Recommendations

The basic conventions of monitoring remain the same in Cloud Native. Key parameters to monitor are:

Quality of traffic (response times, timeouts, etc.)
Quantity of traffic (bytes, requests, etc.)
Failure modalities (HTTP error codes)

Quality tells you what customers (internal and external) of your services are experiencing. Quantity tells you what your infrastructure is experiencing in gross teams and helps you map to threshold. Failure modalities helps you understand what’s happening and may alert you to tail-risk problems that can be experienced by a significant percentage of users without triggering thresholds (e.g. If .5% of your users are experiencing severe latency, this may be within thresholds but is not acceptable).

Because of the flexibility and ease of set-up with Grafana and Prometheus, extra dashboards are effectively free. (You can either set a fresh one up in 30 minutes or import one from colleagues or the Grafana community dashboard catalog). This means you probably want to design several monitoring environments focused on different aspects.

A traditional North-South view to see what your users are seeing
An East-West view to understand what is happening with your specific service traffic
A network-view to understand transport issues (assuming your running a network load balancer or service mesh)
A view of key services that spans North-South and East-West to create a unified view
A global dashboard
A local dashboard that is drilled down to regions and other geographically bounded instantiations of your applications

Because monitoring so many moving parts is challenging for normal humans, as part of your setup you should think about the following practices:

Color coding dashboard metrics to simplify trouble-shooting: With Grafana, it is easy to set colors on any metric to thresholds and assign colors
Define acceptable response and capacity metrics: In Prometheus, it is simple to create a list of key thresholds for your most important metrics. Set these metrics in advance and create a document or spreadsheet listing what those metric thresholds are in order to avoid any confusion and to help in setting key resilience and scaling features like autoscaling, retries, and timeouts

We hope that this guide helps you think through the basics of setting up a Cloud Native ADC infrastructure with proper monitoring and observability capabilities. Curiously, ADCs are something that many developers coming from Cloud Native are unfamiliar. Conversely, architects and developers from legacy monolith worlds are not as familiar with Cloud Native design principles. With Cloud ADCs, Prometheus and Grafana, you can effectively come at this problem from either direction – starting as a monolithic application expert or as a Cloud Native application expert – and arrive at the same happy place.

Hyderabad, India

Evolving from Monolithic ADCs to Cloud Native ADCs

ADC vs Load Balancer

Cloud Native ADCs + Prometheus + Grafana

How To Think About Monitoring and Observability of Cloud Native ADCs + Prometheus and Grafana