By Colin Sullivan

Earth from space at night

Developing and deploying applications that communicate in distributed systems, especially in cloud computing, is complex. Messaging has evolved to address the general needs of distributed applications but hasn’t gone far enough. We need a messaging system that takes the next steps to address cloud, edge, and IoT needs. These include ever-increasing scalability requirements in terms of millions, if not billions of endpoints, a new emphasis toward resiliency of the system as a whole over individual components, end-to-end security, and the ability to have a zero-trust system. In this post we’ll discuss the steps NATS is taking to address these needs, leading toward a securely connected world.

Let’s break down the challenges into scalability, resiliency at scale, and security.

Scalability

To support millions, or even billions of endpoints spanning the globe, most architectures would involve a federated approach with many layers filtering up to a control layer, driven by a required central authority for configuration and security. Instead, NATS is taking a distributed and decentralized approach.

We are introducing a resilient, self-healing way to connect NATS clusters together – NATS superclusters (think clusters of clusters) that optimize data flow. If one NATS server cluster can handle millions of clients, imagine hundreds of them connected to support truly global connectivity with no need for a central authority or hub – no single point of failure.

Traditional messaging systems require precise knowledge of a server cluster topology. This was acceptable in the past as there was no requirement otherwise. However, this severely hinders scalability in a hybrid cloud environment, and beyond to edge and IoT. NATS addresses this with the cloud-native feature of auto-discovery. NATS servers share topology by automatically exchanging cluster information with each other. When scaling upwards and adding a NATS server to an existing cluster, topology information is automatically shared, and boom – each server has complete knowledge of the cluster in real-time with zero configuration changes. And it gets better – clients also receive this topology information. NATS maintainer supported clients use this to keep an up-to-date list of available servers, enabling clients to connect to new servers and avoid connecting to servers that are no longer available – again with no configuration changes. View this feature in the context of scaling up (or down), rolling upgrades, or cruising past instance failures and this is an operator’s dream.

Resiliency at Scale

In messaging, priorities have changed with the move from traditional on-premise computing to cloud computing: The needs of the system as a whole must be prioritized over individual components in the system. Messaging systems evolved to suit predictable and well-defined systems. We knew exactly what and where our hardware resources were to accurately identify, predict, and shore up weak points of the system. While cloud vendors have done a great job, you cannot count on the same levels of predictability unless you are willing to pay for dedicated hardware.

Today we have machine instances disappearing (often intentionally), spurious network issues, or even unpredictable availability of CPU cycles on multi-tenant instances. Add to this the decomposition of applications into microservices and we have many more moving parts in a distributed system – and many more places to fail.

Unhealthy servers or clients that can’t keep up are known to NATS as slow consumers. The NATS messaging system will not assist slow consumers. Instead, they are disconnected, protecting the system as a whole. Server clusters self-heal, and clients might be restarted or simply reconnect elsewhere. If using NATS streaming, missed messages are redelivered.

This works so well that we have users who deploy on spot instances knowing the instances will terminate at some point, counting on the holistically resilient behavior of NATS to reduce their operational costs.

Security

The NATS team aims to enable the creation of a securely connected world where data transfer is protected end to end and metadata is anonymous. NATS is taking a novel approach to solving this.

Concerning identities, we’ll continue to support user/password authentication, but our preference will be a new approach using Nkeys – a form of an Ed25519 key made extremely simple.

NKeys are fast and resistant to side channel attacks. Public Nkeys may be registered with the NATS server and during connect, the endpoint (client application) signs a nonce with its private key and returns it to the server where the identity can be verified. The NATS messaging system will never have access or store private keys. Operators or organizations manage their own private keys. In this new model, authorization to publish and receive data are tied to an NKey identity.

NKeys may be assigned to accounts, which are securely isolated communication contexts that allow multi-tenancy. Servers may be configured with multiple accounts, to support true multi-tenancy, bifurcating topology from the data flow. The decision to silo data is driven by use case, rather than software limitations, and operators only need to maintain one NATS deployment. An account can span several NATS server clusters (enterprise multi-tenancy), be limited to just one cluster (data silo by cluster), exist within a subset of a cluster or even in a single server.

Between accounts, specific data can be shared or exchanged through a service (request/reply) or a stream (publish/subscribe). Mutual agreement between accounts allows data flow, allowing for decentralized account management.

Of course, NATS will continue to support TLS connections, but applications can go a step further by using NKeys to sign payloads.

Connecting it all together

These features, along with other work the NATS team is doing behind scalability, reliability at scale, and security will allow NATS to provide secure and decentralized global connectivity. We’ll be discussing these in-depth at KubeCon + CloudNativeCon North America in December, and have three sessions – we’d love it if you can attend. We always enjoy a good conversation around solving hard problems like this and invite you to stop by to visit the NATS maintainers at the Synadia booth to learn more.