BlackRock: Rolling Out Kubernetes in Production in 100 Days
BlackRock operates a very controlled static deployment scheme, which has allowed for scalability over the years. But in their data science division, there was a need for more dynamic access to resources. Says Michael Francis, a Managing Director in BlackRock’s Product Group: “Managing complex Python installations on users’ desktops is really hard because everyone ends up with slightly different environments. We have existing environments that do these things, but we needed to make it real, expansive and scalable. Being able to spin that up on demand, tear it down, make that much more dynamic, became a critical thought process for us. It’s not so much that we had to solve our main core production problem, it’s how do we extend that? How do we evolve?”
“Our goal was: How do you give people tools rapidly without having to install them on their desktop?” says Francis. And the team hit the goal within 100 days. Francis is pleased with the results and says, “We’re going to use this infrastructure for lots of other application workloads as time goes on. It’s not just data science; it’s this style of application that needs the dynamism… What’s interesting is that just having this technology there is changing the way our developers are starting to think about their future development.”
CHALLENGESAvailability, Efficiency, Scaling, Velocity
One of the management objectives for BlackRock’s Product Group employees in 2017 was to “build cool stuff.”
Led by Managing Director Michael Francis, a cross-sectional group of 20 did just that: They rolled out a full production Kubernetes environment and released a new investor research web app on it. In 100 days.
For a company that’s the world’s largest asset manager, “just equipment procurement can take 100 days sometimes, let alone from inception to delivery,” says Karl Wieman, a Senior System Administrator. “It was an aggressive schedule. But it moved the dial.” In fact, the project achieved two goals: It solved a business problem (creating the needed web app) as well as provided real-world, in-production experience with Kubernetes, a cloud-native technology that the company was eager to explore. “It’s not so much that we had to solve our main core production problem, it’s how do we extend that? How do we evolve?” says Francis. The ultimate success of this project, beyond delivering the app, lies in the fact that “we’ve managed to integrate a radically new thought process into a controlled infrastructure that we didn’t want to change.”
After all, in its three decades of existence, BlackRock has “a very well-established environment for managing our compute resources,” says Francis. “We manage large cluster processes on machines, so we do a lot of orchestration and management for our main production processes in a way that’s very cloudish in concept. We’re able to manage them in a very controlled, static deployment scheme, and that has given us a huge amount of scalability.”
Though that works well for the core production, the company has found that some data science workloads require more dynamic access to resources. “It’s a very bursty process,” says Francis, who is head of data for the company’s Aladdin investment management platform division.
Aladdin, which connects the people, information and technology needed for money management in real time, is used internally and is also sold as a platform to other asset managers and insurance companies. “We want to be able to give every investor access to data science, meaning Python notebooks, or even something much more advanced, like a MapReduce engine based on Spark,” says Francis. But “managing complex Python installations on users’ desktops is really hard because everyone ends up with slightly different environments. Docker allows us to flatten that environment.”
Still, challenges remain. “If you have a shared cluster, you get this storming herd problem where everyone wants to do the same thing at the same time,” says Francis. “You could put limits on it, but you’d have to build an infrastructure to define limits for our processes, and the Python notebooks weren’t really designed for that. We have existing environments that do these things, but we needed to make it real, expansive, and scalable. Being able to spin that up on demand, tear it down, and make that much more dynamic, became a critical thought process for us.”
Made up of managers from technology, infrastructure, production operations, development and information security, Francis’s team was able to look at the problem holistically and come up with a solution that made sense for BlackRock. “Our initial straw man was that we were going to build everything using Ansible and run it all using some completely different distributed environment,” says Francis. “That would have been absolutely the wrong thing to do. Had we gone off on our own as the dev team and developed this solution, it would have been a very different product. And it would have been very expensive. We would not have gone down the route of running under our existing orchestration system. Because we don’t understand it. These guys [in operations and infrastructure] understand it. Having the multidisciplinary team allowed us to get to the right solutions and that actually meant we didn’t build anywhere near the amount we thought we were going to end up building.”
In search of a solution in which they could manage usage on a user-by-user level, Francis’s team gravitated to Red Hat’s OpenShift Kubernetes offering. The company had already experimented with other cloud-native environments, but the team liked that Kubernetes was open source, and “we felt the winds were blowing in the direction of Kubernetes long term,” says Francis. “Typically we make technology choices that we believe are going to be here in 5-10 years’ time, in some form. And right now, in this space, Kubernetes feels like the one that’s going to be there.” Adds Uri Morris, Vice President of Production Operations: “When you see that the non-Google committers to Kubernetes overtook the Google committers, that’s an indicator of the momentum.”
Once that decision was made, the major challenge was figuring out how to make Kubernetes work within BlackRock’s existing framework. “It’s about understanding how we can operate, manage and support a platform like this, in addition to tacking it onto our existing technology platform,” says Project Manager Michael Maskallis. “All the controls we have in place, the change management process, the software development lifecycle, onboarding processes we go through—how can we do all these things?”
The first (anticipated) speed bump was working around issues behind BlackRock’s corporate firewalls. “One of our challenges is there are no firewalls in most open source software,” says Francis. “So almost all install scripts fail in some bizarre way, and pulling down packages doesn’t necessarily work.” The team ran into these types of problems using Minikube and did a few small pushes back to the open source project.
“My message to other enterprises like us is you can actually integrate Kubernetes into an existing, well-orchestrated machinery. You don’t have to throw out everything you do. And using Kubernetes made a complex problem significantly easier.”
— MICHAEL FRANCIS, MANAGING DIRECTOR at BLACKROCK
There were also questions about service discovery. “You can think of Aladdin as a cloud of services with APIs between them that allows us to build applications rapidly,” says Francis. “It’s all on a proprietary message bus, which gives us all sorts of advantages but at the same time, how does that play in a third party [platform]?”
Another issue they had to navigate was that in BlackRock’s existing system, the messaging protocol has different instances in the different development, test and production environments. While Kubernetes enables a more DevOps-style model, it didn’t make sense for BlackRock. “I think what we are very proud of is that the ability for us to push into production is still incredibly rapid in this [new] infrastructure, but we have the control points in place, and we didn’t have to disrupt everything,” says Francis. “A lot of the cost of this development was thinking how best to leverage our internal tools. So it was less costly than we actually thought it was going to be.”
The project leveraged tools associated with the messaging bus, for example. “The way that the Kubernetes cluster will talk to our internal messaging platform is through a gateway program, and this gateway program already has built-in checks and throttles,” says Morris. “We can use them to control and potentially throttle the requests coming in from Kubernetes’s very elastic infrastructure to the production infrastructure. We’ll continue to go in that direction. It enables us to scale as we need to from the operational perspective.”
The solution also had to be complementary with BlackRock’s centralized operational support team structure. “The core infrastructure components of Kubernetes are hooked into our existing orchestration framework, which means that anyone in our support team has both control and visibility to the cluster using the existing operational tools,” Morris explains. “That means that I don’t need to hire more people.”
With those points established, the team created a procedure for the project: “We rolled this out first to a development environment, then moved on to a testing environment and then eventually to two production environments, in that sequential order,” says Maskallis. “That drove a lot of our learning curve. We have all these moving parts, the software components on the infrastructure side, the software components with Kubernetes directly, the interconnectivity with the rest of the environment that we operate here at BlackRock, and how we connect all these pieces. If we came across issues, we fixed them, and then moved on to the different environments to replicate that until we eventually ended up in our production environment where this particular cluster is supposed to live.”
The team had weekly one-hour working sessions with all the members (who are located around the world) participating, and smaller breakout or deep-dive meetings focusing on specific technical details. Possible solutions would be reported back to the group and debated the following week. “I think what made it a successful experiment was people had to work to learn, and they shared their experiences with others,” says Vice President and Software Developer Fouad Semaan. Then, Francis says, “We gave our engineers the space to do what they’re good at. This hasn’t been top-down.”
“Typically we make technology choices that we believe are going
to be here in 5-10 years’ time, in some form. And right now, in this space,
Kubernetes feels like the one that’s going to be there.”
— MICHAEL FRANCIS, MANAGING DIRECTOR at BLACKROCK
They were led by one key axiom: To stay focused and avoid scope creep. This meant that they wouldn’t use features that weren’t in the core of Kubernetes and Docker. But if there was a real need, they’d build the features themselves. Luckily, Francis says, “Because of the rapidity of the development, a lot of things we thought we would have to build ourselves have been rolled into the core product. [The package manager Helm is one example]. People have similar problems.”
By the end of the 100 days, the app was up and running for internal BlackRock users. The initial capacity of 30 users was hit within hours, and quickly increased to 150. “People were immediately all over it,” says Francis. In the next phase of this project, they are planning to scale up the cluster to have more capacity.
Even more importantly, they now have in-production experience with Kubernetes that they can continue to build on—and a complete framework for rolling out new applications. “We’re going to use this infrastructure for lots of other application workloads as time goes on. It’s not just data science; it’s this style of application that needs the dynamism,” says Francis. “Is it the right place to move our core production processes onto? It might be. We’re not at a point where we can say yes or no, but we felt that having real production experience with something like Kubernetes at some form and scale would allow us to understand that. I think we’re 6-12 months away from making a [large scale] decision. We need to gain experience of running the system in production, we need to understand failure modes and how best to manage operational issues.”
For other big companies considering a project like this, Francis says commitment and dedication are key: “We got the signoff from [senior management] from day one, with the commitment that we were able to get the right people. If I had to isolate what makes something complex like this succeed, I would say senior hands-on people who can actually drive it make a huge difference.” With that in place, he adds, “My message to other enterprises like us is you can actually integrate Kubernetes into an existing, well-orchestrated machinery. You don’t have to throw out everything you do. And using Kubernetes made a complex problem significantly easier.”