Guest post by Eli Goldberg and Omri Zamir, Platform Team at Salt Security

At Salt Security, we pioneered API security. Purpose-built to protect APIs across their entire life cycle, the Salt platform enables our customers to prevent API attacks. If our platform is down, the attacks our customers continually experience won’t be blocked. Consequently, downtime is not an option. 

We built the platform as microservices running on Kubernetes from the get-go — as we grew, we started running into backward-compatibility issues. To address them, we decided to move to gRPC, but that introduced another challenge: load balancing, which is not supported by Kubernetes. That’s how we stumbled across Linkerd. Not only did it help with gRPC load balancing, but it also improved the efficiency, reliability, and overall performance of our platform.  

The platform, infrastructure, and team

The Salt platform continuously monitors and learns from customer traffic, allowing it to quickly and easily identify anomalies and sophisticated API attacks in seconds. Our customers span from mid-size to Fortune 500 companies and include household names such as The Home Depot, AON, Telefonica Brasil, Equinix, Citi National Bank, KHealth, and many more.

At any given time, countless attacks are carried out across the internet. As gateways into a company’s most valuable data and services, APIs are an attractive target for malicious hackers.

At Salt Security, we host the metadata of our customers’ APIs and run AI and ML against the data to find and stop attacks. It’s imperative that we minimize downtime while ingesting all that customer traffic. Consider a customer with traffic that averages around 20,000 requests per second (RPS). Downtime of one second represents 20,000 opportunities for malicious attacks — a significant risk! A data breach could last only a few seconds, but its consequences can be devastating since it creates a window of opportunity perfect for a denial of service attack or a PII exposure. 

Avoiding downtime is a core part of our job. We are part of the platform team along with two other engineers. To ensure we have no service disruption, we built our platform running ~40 microservices in multiple Kubernetes clusters that span both Azure and AWS regions — if one provider experiences a substantial outage, our platform will immediately trigger a failover to keep the Salt service available to our customers. 

gRPC: a single point of truth for messaging 

In 2020, Salt started to grow quickly, and the platform needed to scale, which led to a few challenges we had to tackle fast. First and foremost, because the architecture and our services started to evolve quickly, the messages exchanged between services did so too.

To enable our teams to move fast and with confidence, we aimed to find a way to ensure that no change to an API call would break.

We found gRPC to be the perfect candidate.

Why gRPC, or in our case Protobuf specifically? Think about how data is serialized with Protobuf (Protocol Buffers) as opposed to JSON, or other serialization frameworks like Kryo. When a field is removed from a message (or an object), services may be introduced with these changes in an unsynchronized manner. 

Let’s say you have two services: service A and service B. Service A sends service B a message with three fields. Removing one of those fields from the message, or changing the field name, could result in one of those services being unable to deserialize, or open, the message. Protobuf serialization addresses that problem by enumerating each field and reserving fields to ensure no one would ever read deleted fields by accident. That, combined with tools such as Buf, allowed us to build a single repository in which we declare all company-internal APIs, and keep it stable and error-free during compile time in a CI pipeline.

Moving to gRPC standardized our internal API and created a single point of truth for messages passed between our services. In short, the main driver to adopt gRPC was ensuring backward compatibility. 

That being said, gRPC has many other benefits, including strongly typed APIs, amazing performance, and a huge amount of support for more than 10 languages. But, as we learned, it also came with its challenges. Primarily, load-balancing gRPC requests, which use HTTP/2 under the hood and thus cannot be effectively balanced by Kubernetes’ native TCP load balancing. Since all our microservices are replicated for load balancing and high availability, having no way to distribute the cross-service communication between replicas misses this mission-critical requirement.

Linkerd: the answer to gRPC’s load balancing

To solve the gRPC load balancing issue, we researched a number of solutions and ended up investigating in more depth Envoy, Istio, and Linkerd. While we were familiar with the term “service mesh,” we had no considerable production burn time with any of these solutions. So we set out to assess each tool to determine which was the best for our use case. 

Since Salt is such a fast-growing company, one of the key things we look at when evaluating any technology is the level of effort it takes to maintain it. We have to be smart about how we allocate our resources. 

We started with Linkerd and it turned out to be so easy to get started that we gave up on evaluating the other two — we loved it from the get-go. Within hours of stumbling over it online, we had deployed it on our dev environment. Three more days and it was up and running on our new service running in production. Next, we started migrating our services to gRPC and adding them to the service mesh.

A new world of visibility, reliability, and security 

Once our services were meshed, we realized we had something far more powerful than a load balancing solution for gRPC. Linkerd opened up a new world of visibility, reliability, and security — a huge a-ha moment. 

Service-to-service communication was now fully mTLSed, preventing potential bad actors from “sniffing” our traffic in case of an internal breach. With Linkerd’s latest gRPC retry feature, momentary network errors now look like small delays, instead of a hard failure that triggered an investigation, just to find out it was nothing. 

We’ve also come to love the Linkerd dashboard! We are now able to see our internal live traffic and how services communicate with each other on a beautiful map. Thanks to a table displaying all request latencies, we are identifying the slightest regression in terms of backend performance. Linkerd’s tap feature showed us live aggregations of requests per endpoint that were flagging excessive calls between services that we weren’t aware of. 

Linkerd’s CLI is very easy to use. The check command provides instant feedback on what is wrong with a deployment, and that’s generally all you need to know. For example, when resource problems cause one of the control plane’s pods to be evacuated, linkerd check will tell you those pods are missing — that’s pretty neat! 

We also quickly realized that Linkerd isn’t just a tool for production – it has the same monitoring and visibility capabilities as other logging, metrics, or tracing platforms.
If we can see what’s wrong in production, why not use it before we get to production? Today, Linkerd has become a key part of our development stack. 

It’s so simple to operate that we don’t need anyone to be “in charge” of maintaining Linkerd — it literally “just works!”

That’s a lot of extra benefits we didn’t consider critical in our evaluation process, yet, after experiencing that, we are totally sold.

The Linkerd community

As you can tell by now, we really like the Linkerd technology. But the community also makes a huge difference. The documentation is excellent and covers every use case we’ve had so far. It also includes those we plan to implement soon, such as traffic splits, canary deployments, and more. If you don’t find an answer in the docs, Slack is a great resource. The community is very helpful and super responsive. You can usually receive an answer within a few hours. People share their thoughts and solutions all the time, which is really nice. 

Increased efficiency, reliability, and performance within one week!

After only one week of work, we were already seeing tangible results. Using Linkerd, we improved performance, efficiency, and reliability of the Salt Security platform. We can now observe and monitor service and RPC-specific metrics and take action in real time. Finally, all service-to-service communication is now encrypted, providing a much higher level of security within our cluster.

In our business, scalability and performance are key. We’re growing rapidly, and so are the demands on our platform. More customers means more ingested real-time traffic. Recently we were able to increase traffic almost 10x without any issues. That elasticity is critical to ensure flawless service for our customers. 

Linkerd also helped us identify excessive calls that had slipped under our radar, leading to wasteful resource consumption. Because we use Linkerd in development, we now discover these excessive calls before they go to production, thus avoiding them altogether. 

Finally, with no one really required to maintain Linkerd, we can devote those precious development hours to things like canary deployments, traffic splits, and chaos engineering, which makes our platform even more robust. With Linkerd and gRPC in place, we are ready to tackle those things next! 

The cloud native effect 

It should be clear by now that avoiding downtime is mission-critical to Salt — there’s little room for error, especially for our team. While a few years ago that may have sounded like an almost insurmountable task, today, cloud-native technologies make that possible. Built entirely on Kubernetes, our microservices communicate via gRPC while Linkerd ensures messages are encrypted. The service mesh also provides deep, real-time insights into what’s going on on the traffic layer, allowing us to remain a step ahead of potential issues. 

Although we set out to solve the load balancing issue with Linkerd, it ended up improving the overall performance of the platform. For us, that means fewer fire drills and more quality time with family and friends. It also means we now get to work on more advanced projects and deliver new features faster — our customers certainly appreciate that.