CASE STUDY

Square: How Vitess Enables ‘Near Unlimited Scale’ for Cash App

Challenge

Four years ago, Square branched out into peer-to-peer transactions via its Cash App. After some steady growth, the app rocketed in popularity in 2016, reaching millions of users over just a few months. But its large monolith of a few hundred thousand lines of code was built on the assumption of one single MySQL database and wasn’t designed to scale.

Solution

To solve the long-term scalability problem, they turned to Vitess, the open source database clustering system for horizontal scaling of MySQL. “The nice thing is if we solved it once, then we have a piece of infrastructure that would be the cookie-cutter way we would scale out other systems that are built the same way,” says Engineering Manager Jon Tirsen.

Impact

The first shard split with Vitess took place in November 2017, and “there was less than a second of downtime, so it was not user visible,” says Tirsen. Nowadays, “it’s completely routine. We did ten shard splits last week.” Looking ahead at Cash App’s continued growth, Tirsen says, “You have to keep on working on it, but Vitess does provide you essentially with near unlimited scale.”

INDUSTRY

Financial Services

LOCATION

United States

CLOUD TYPE

Private

CHALLENGES

Scaling

PRODUCT TYPE

Installer

CNCF Projects Used

Vitess

Only 5% of the system had to be changed

10 shard splits a week

Shard splits with less than a second of downtime

For millions of people, from taxi drivers to market vendors to big businesses, Square has made getting paid by credit card much easier since it launched its card reader and mobile app in 2010.

Four years ago, the company branched out into peer-to-peer transactions via its Cash App. After some steady growth, the app rocketed in popularity in 2016, reaching millions of users over just a few months and landing at the top of the App Store’s most popular downloads.

The only problem? “We had a large monolith of a few hundred thousand lines of code that was built on the assumption of one single MySQL database; it was never really designed to scale from the start,” says Engineering Manager Jon Tirsen. With users increasing by the minute, the company had to dedicate more and more expensive hardware for its database. But that clearly wasn’t a long-term solution, so Tirsen’s team of three was tapped to solve the scalability problem for Cash App. “Because we had the growth trajectory, we really needed to solve it very, very quickly to step up to the challenge of the product side,” he says.

The team’s first pass was trying to pull data out into key value storage that they built on top of MySQL. “It’s an inherently scalable storage platform, but less feature rich,” Tirsen says. Plus, this solution would have required rewriting hundreds of thousands of lines of code and thousands of different queries, and “there was just no way we would have had the time to do it,” he says.

So the team proposed solving sharded MySQL themselves. At the time, addressing the problem in the application rather than in the infrastructure layer was the default way to solve scale at Square, as well as at many other companies like Facebook or Pinterest. The team had begun drawing up plans, when someone else at Square suggested they look at Vitess, the open source database clustering system for horizontal scaling of MySQL.

The project was still relatively new, but after Tirsen started digging into the code, he realized it would work for Cash App. Vitess met two of the team’s key requirements: infrastructure or platform layer query routing—sending queries to the right database—and online shard splitting without downtime.

Additionally, “the nice thing is if we solved it once, then we have a piece of infrastructure that would be the cookie-cutter way we would scale out other systems that are built the same way,” says Tirsen. “So that was kind of exciting, and it would mean that we didn’t need to completely rebuild our system. We could just make Vitess work, and the system would work. Maybe we had to change 5% of the system rather than 95% of the system.”

“You have to keep on working on it, but Vitess does provide you
essentially with near unlimited scale.”

— Jon Tirsen, Engineering Manager at Square

Tirsen’s team spent about a year making those changes, as well as adapting Vitess to work inside the overall Square infrastructure. To enable an external database to work with Vitess, the team rebuilt a lot of the shard-splitting workflows.

“The biggest thing we did was we changed the way sharding works,” says Tirsen. “Before, Vitess did shard splits by stopping replication, but because we can’t control our external databases, we changed that to instead use MySQL’s built-in support for consistent snapshots, where you can view the database at a fixed point in time, even if the database is still getting updated. Then you can make a copy of the database based on that consistent snapshot.”

That work has been upstreamed, and is now commonly used throughout the Vitess community. “It’s an amazing bonus that we can work together on solving problems across companies that don’t normally have a formal way of working together,” Tirsen says.

Next, the team practiced the shard splits in staging and testing environments—a lot. “You can practice the entire shard split except the last query reroute, where you basically start writing to the new shard, without affecting any production traffic at all,” says Tirsen.

“While we were doing these very dramatic changes to our architecture to make it scalable, the feature teams were building incredible features on top of Cash App.”

— Jon Tirsen, Engineering Manager at Square

The first shard split with Vitess took place at 5 a.m. ET in November 2017, and “there was less than a second of downtime, so it was not user visible,” says Tirsen. “It was incredibly exciting.” Nowadays, he adds, “it’s completely routine. We did ten shard splits last week.”

Overall, Tirsen says he is proudest of this fact: “We didn’t have to completely change how developers built applications, so while we were doing these very dramatic changes to our architecture to make it scalable, the feature teams were building incredible features on top of Cash App, on top of our platform. So it was like changing the engines of the airplane while it was still in the air flying.” In fact, during that year of work, marquee features like the Cash card were launched.

Looking ahead at Cash App’s continued growth, Tirsen recognizes that solving the problem of scalability will always be a work in progress for his team. They’re currently building a developer platform based on KubernetesPrometheusEnvoyJaeger, and other CNCF technologies, which Tirsen envisions supporting potentially thousands of developers. “Vitess is going to become part of this essential underlying infrastructure to make sure that everything that we build becomes scalable by default,” he says. “You have to keep on working on it, but Vitess does provide you essentially with near unlimited scale.”