NuBank Offices

How Kubernetes empowered Nubank engineers to deploy 700 times a week

Challenge

As Nubank’s customer base reached 23 million and the engineering team grew from 30 to more than 520 over the past few years, the fintech startup’s immutable infrastructure created challenges. “Our deployments were dependent on spinning a whole stack or cloning our whole infrastructure to iterate all the development,” says Renan Capaverde, Director of Engineering at Nubank. “So it was getting slower and more painful over time.”

Solution

The company had embraced Docker containers early on. “We were certain that Docker was already really beneficial to us, and we wanted to go more into containerization,” says Capaverde. Finding an orchestrator was the next step, and the team chose Kubernetes. The Nubank cloud native platform also includes Prometheus, Thanos, and Grafana for monitoring, and Fluentd for logging.

Impact

The developer experience has been greatly improved; deployment has gone from 90 minutes to 15 minutes for production environments. Today, Nubank engineers are deploying 700 times a week. Plus, the team estimates that Nubank has gained about 30% cost efficiency.

Company

Nubank

Challenges

Efficiency Velocity

Location

Brazil

Cloud Type

Hybrid Multi Public

Product Type

Installer

Published

July 10, 2020

CNCF projects used

Fluentd
Kubernetes
Prometheus

Deployment time

Went from
90 minutes
to 15 minutes

Cost efficiency

Increased 30%

Engineers are deploying 700 times a week

A fintech startup founded in 2013 in Brazil, Nubank was never weighed down by legacy infrastructure.

Early on, the company embraced Docker containers and ran almost all of its infrastructure on AWS.

Which is not to say that Nubank didn’t have its challenges as its customer base reached 23 million and the engineering team grew from 30 to more than 520 over the past few years. “We had an immutable infrastructure, and we were growing very, very fast,” says Renan Capaverde, Director of Engineering at Nubank. “Our deployments were dependent on spinning a whole stack or cloning our whole infrastructure to iterate all the development. So it was getting slower and more painful over time.”

Other pain points included load balancing for applications, and difficulties arising from adding new security group rules in AWS.

“At the end of the day, we were certain that Docker was already really beneficial to us, and we wanted to go more into containerization,” says Capaverde. Finding an orchestrator was the next step. Nubank’s data infrastructure team had already been running a Spark cluster on top of Mesos, and the team looked at several different technologies including Docker Swarm. Ultimately, “it felt like the right choice was Kubernetes,” he says. “Kubernetes had a lot of great abstractions and a lot of support from the community and Cloud Native Computing Foundation, and all the expertise from Google deploying Borg.”

When Nubank began its migration to Kubernetes, “the first thing that we wanted was to empower developers for running their applications,” says Capaverde. “Our applications were already cloud native, so they had a really good architecture for the projects themselves. They were as scalable as we needed.”

Initially, the team considered using Minikube for the developer environment and rolling out Kubernetes for test and staging, then sharding the architecture in production. Then they thought they could accelerate the adoption of Kubernetes early on by migrating the staging environment first. But that plan, which would have involved a single Kubernetes cluster for all the shards in staging, was ultimately scrapped out of concerns about single points of failure, as well as the team’s lack of Kubernetes experience.

“What we did instead was deploy the kops solution for Kubernetes, and we did reverse engineering of the kops confirmations to see what Kubernetes was all about and the ins and outs of configurations,” says Capaverde. “We did our own automation, so we have our own distribution of Kubernetes that’s not kops, that’s not kube-aws, it’s not something else.”

Later, as “we saw the direction that the Kubernetes project was taking,” he says, the team started using kubeadm for provisioning. “We used our tools in production to create a new cluster, so we can see how everything works,” says Software Engineer Yago Nobre. The new shard took on about 20 customers for testing, and after a couple of months, “at the point that we were confident that the solution works, we started to migrate the other shards, one by one.”

The Nubank cloud native platform also includes Prometheus, Thanos, and Grafana for monitoring, and Fluentd for logging.

“With Kubernetes and canary deployments, it’s easier to roll back a change because it’s also faster to deploy. People are shipping more often and with more confidence.”

— RENAN CAPAVERDE, DIRECTOR OF ENGINEERING AT NUBANK

To help engineers use the new platform, the team held extensive training throughout the company for the first year. They also developed a command line interface, called NuCLI, with 500+ shortcuts for automation such as getting logs from applications or hard deploys. “This abstraction was easier for people, and at the beginning of the migration, mitigated the lack of expertise in Kubernetes,” says Capaverde.

Initially, Nubank, a Clojure shop, had some issues with CPU sets and the JVM, and had to be conservative in their resource allocation to avoid risk and move faster. This prevented the cost savings that they expected with Kubernetes. Almost a year and a half later, with further tweaking of outscaling and budgets, the team estimates that Nubank has gained about 30% cost efficiency.

And that’s just the beginning. “We’re just finishing a new rollout of Kubernetes, moving to 1.16, and also a new kernel version that integrates better with the JVM so we can benefit from the container support tag for the JVM, and also tuning requests and limits from applications,” says Capaverde.

There have been other benefits too. “Kubernetes has really great abstractions like readiness probe and life probe,” says Capaverde. “Instead of having to boot, we are using our blue/green strategy for deployments. We have 400+ microservices, so you can imagine deployment when we had to wait for our EC2 instance to boot and then start containers. With Kubernetes we just have to start the containers.”

As a result, deployment has gone from 90 minutes to 15 minutes for production environments. And that, says Nobre, was “the main benefit because it helps the developer experience.” Today, Nubank engineers are deploying 700 times a week. “For a bank you would say that’s insane,” Capaverde says with a laugh. “But it’s not insane because with Kubernetes and canary deployments, it’s easier to roll back a change because it’s also faster to deploy. People are shipping more often and with more confidence.”

“In the year between the staging migration and the prod migration, we got a lot of feedback because people were using it, so we could improve our infrastructure, tooling, and everything before we went to the production environment.”

— YAGO NOBRE, SOFTWARE ENGINEER AT NUBANK

Kubernetes has also facilitated Nubank’s expansion into other countries, which began with Mexico in 2019. “If we were doing everything based on the cloud, we would have to set up new AWS accounts and figure out the limits of the new AWS accounts,” says Capaverde. “Using the Kubernetes abstractions makes it much easier to iterate over infrastructure and iterate over applications much faster, and it’s easier to develop infrastructure.”

Over the past three years, Nubank has gone from having a 5-person platform team handling Kubernetes, Kafka, monitoring, and any number of other things to a 50-person, dedicated foundation team that includes people with deep Kubernetes experience.

In the process, they’ve learned some lessons that they would share with other organizations making the move to Kubernetes. “Share experiences and involve everyone when you start the migration,” says Nobre. “In the year between the staging migration and the prod migration, we got a lot of feedback because people were using it, so we could improve our infrastructure, tooling, and everything before we went to the production environment.”

From Capaverde’s perspective, “people talk about Kubernetes like it’s a silver bullet that will solve your problems. It’s kind of the opposite of that: Kubernetes is a beast, it’s a monster. It helps a lot, but you have to understand it, dominate it, tame it.”

And it’s an ongoing process. For Nubank’s immediate future, that means making decisions about interservice communication and security and improving the efficiency of resource usage. “We are on a very good trajectory,” says Capaverde. “All the decisions that we’ve been making focus on long-term scalability and creating the right abstractions. That makes the next phase that we’re about to tackle more incremental than disruptive. So it’s easier to make continuous progress towards our goals.”