Case Study: Netflix

Netflix: Increasing Developer Productivity and Defeating the Thundering Herd with gRPC

Company: Netflix      Location: Los Gatos, California     Industry: Streaming media provider

Challenge

Netflix developed its own technology stack for interservice communication using HTTP/1.1, covering about 98% of the total microservices that power the Netflix product. For several years, that stack supported the stellar growth of the company’s streaming media business. But by 2015, the platform team realized that it had also “perpetuated a few architectural patterns that we were struggling with, and at scale were impacting engineering’s productivity,” says Tim Bozarth, Director of Runtime Platform Engineering. Clients for interacting with remote services are often wrapped with handwritten code, which was time-consuming and “created opportunities for issues to arise, for bugs to be introduced, and for additional complexity to breed,” he says. Plus, when a team built a service that defined an API, there wasn’t a clear way to annotate and describe exactly how that API functioned, making it challenging to discover, audit, and understand what APIs were available in the ecosystem. In search of a new solution, the team also wanted service clients to work across languages, with an emphasis on Java and Node.js.

Solution

There were efforts to build an RPC stack internally, but after a month-long evaluation of several technologies, the Runtime Platform team chose to implement and extend gRPC instead. “gRPC was comfortably at the top in terms of encapsulating all of these responsibilities together in one easy-to-consume package,” says Bozarth. gRPC is a high-performance RPC (Remote Procedure Call) framework developed by Google and optimized for the large-scale, multi-platform nature of cloud native computing environments. It connects services across languages, clouds and data centers, and connecting mobile devices to backend servers.

Impact

Developer productivity, which was the team’s biggest driver, got a big boost.  For example, for each client, hundreds of lines of custom cache management code were replaced by just 2-3 lines of configuration in the proto. Creating a client, which could take up to 2-3 weeks, now takes a matter of minutes. As a result, the time to market has been reduced by orders of magnitude. Additionally, the fact that clients no longer contain handwritten code means that there are far fewer errors when interacting with remote services. Latency has also been improved. “We’ve seen an incredible reduction in P99s for gRPC-oriented services,” says Bozarth. “We’ve also seen a squishing and a narrowing of our latency windows consistently across the board.”

“There are a number of people who struggled with the complexity of their clients and the challenges of operations, that they chose to rewrite their applications in gRPC because the value proposition that it made was so substantial.

 

— Tim Bozarth, Director of Platform Engineering, Netflix

For most of the 130 million Netflix subscribers waiting for the next season of Stranger Things, remote procedure calls (RPC) probably don’t mean anything.

 

But for the Runtime Platform team responsible for improving developer productivity at the company, RPC was becoming a bottleneck to building and supporting the high-availability services that Netflix users expect.

Netflix had developed its own technology stack for interservice communication using HTTP/1.1, and “the glue for all service communication” covered about 98% of the total microservices that powered the Netflix product, says Tim Bozarth, Director of Platform Engineering. For several years, that stack supported the stellar growth of the company’s streaming media business.

But by 2015, Bozarth’s team realized that it had also “perpetuated a few architectural patterns that we were struggling with, and at scale were impacting engineering’s productivity,” he says. Clients for interacting with remote services are often wrapped with handwritten code, which was time-consuming and “created opportunities for issues to arise, for bugs to be introduced, and for additional complexity to breed,” he says. Plus, when a team built a service that defined an API, there wasn’t a clear way to annotate and describe exactly how that API functioned or see, audit, and understand what APIs exist for both the service and the ecosystem.

“When we picked gRPC, we were betting that it would get the adoption and a lot of other people building useful things in open source. I think largely that bet has paid off.”

 

— William Thurston, Senior Software Engineer, Netflix

While others at the company considered building a new RPC system internally, Bozarth and his team embarked on a month-long evaluation of several technologies. In the end, they chose gRPC. “It was comfortably at the top in terms of encapsulating all of these responsibilities together in one easy-to-consume package,” says Bozarth. “The things that we cared about the most were the architectural understanding in the IDL (proto) that is packaged as a part of gRPC, and the code generation that is derived from that proto. Those were by far the most interesting for us because they addressed two really key problems that we’d been facing at scale: not having clear APIs as well as lots and lots of handwritten client code.”

In addition to addressing those productivity-oriented problems, the team wanted a solution that wasn’t specifically coupled with Java, since engineers at Netflix had started to use other languages as well, like Node.js, Python, and Ruby—and that promise of cross-language compatibility and code-generation existed with gRPC. The implementation with Java applications went smoothly, with the team spending the first eight months taking the pieces of customization that existed in the company’s own internal RPC stack and porting and transitioning those over into the gRPC environment.

Making gRPC work with other languages took a greater effort. “If you have a Java server and a Node.js client, the cross-language generation and communication work very well from a protocol standpoint,” says Bozarth. “What’s different are the mechanics that exist in the other languages for customization, in terms of actual feature completeness and the idioms. So with Node.js, we had to do a lot of enhancement and a fair amount of wrapping.  It took us almost a year to get the interceptor mechanics merged, but we were able to contribute back the entire interceptor layer to JavaScript. That’s a huge win.” (There’s now substantial traffic between Node.js and Java being done over gRPC at Netflix.)

“We’ve been able to effectively defeat the thundering herd problem by changing how the server limits concurrency adaptively over time leveraging the gRPC mechanics. gRPC made it architecturally simple, and we were able to approach this effort in a way that we couldn’t have before.”

 

— Tim Bozarth, Director of Platform Engineering, Netflix

With gRPC in place, developer productivity, which was always the team’s biggest driver, got a big boost: For each client, for example, hundreds of lines of custom cache management code were eliminated. “We’ve turned a very tedious, error-prone process into what amounts to maybe two or three lines of annotation, extra definition in your proto file, and we just generate those interactions for you,” says Senior Software Engineer William Thurston. Creating a client went from 2-3 weeks to minutes. “You can get started with a running application in a matter of minutes and then get that application working in a matter of hours,” says Bozarth. Time to market, which was typically three weeks before gRPC, has been reduced by orders of magnitude.

Plus, the fact that clients no longer contain handwritten code means that means that a common source of application errors has been eliminated. “It’s virtually bug-free code because it’s heavily vetted and generated, and that increases productivity and lowers the operational burden,” says Thurston.

Latency has also been impacted. “We’ve seen an incredible reduction in P99s for gRPC-oriented services,” says Bozarth. “We’ve also seen a squishing and a narrowing of our latency windows consistently across the board.”

“We believe that gRPC is a really powerful and important foundation for us as we move forward.”

 

— Tim Bozarth, Director of Platform Engineering, Netflix

Today, a huge part of the internal service-to-service communication at Netflix runs on gRPC. “The adoption has been a success and continues to move forward, especially in the Java space,” says Bozarth. All new Java development starts with a gRPC-enabled application. And while there’s no push to rewrite existing applications, he says, “there are a number of people who struggled with the complexity of their clients and the challenges of operations, that they chose to rewrite their applications in gRPC because the value proposition that it made was so substantial.”

The team also found gRPC to be invaluable for a project involving adaptive concurrency limits—a critical issue for any business that requires the highest service availability—which they’ve open sourced. “We’ve been able to effectively defeat the thundering herd problem by changing how the server does concurrency limits adaptively over time using the gRPC mechanics,” says Bozarth. “gRPC made it architecturally simple. We were able to approach this effort in a way that we couldn’t have done before.”

As early adopters of gRPC, Bozarth and Thurston say they’ve benefited from the community as much as they’ve given back. “One of the reasons we picked gRPC is we were making a bet that it would get the adoption and there would be a lot of other people building useful things in open source, and I think largely that has paid off,” says Thurston.

For Netflix, this is the exact place they want to be. “As the industry’s changing and as new, powerful technologies are emerging, we’re fairly early in the adoption curve,” says Bozarth. “But if you’re trying to build a large distributed system, RPC is critical to its long-term success. We believe that gRPC is a really powerful and important foundation for us as we move forward.”