Babylon‘s mission is to put accessible and affordable healthcare services in the hands of every person on earth.

Since its launch in the U.K. in 2013, the startup has facilitated millions of digital consultations around the world, and that’s just the start. “We try to combine different types of technology with the medical expertise that we have in-house to build products that will help patients manage and understand their health, and also help doctors be more efficient at what they do,” says Jérémie Vallée, AI Infrastructure Lead at Babylon.

A large number of these products leverage machine learning and artificial intelligence, and in 2019, researchers hit a pain point. “We have some servers in-house where our researchers were doing a lot of AI experiments and some training of models, and we came to a point where we didn’t have enough compute in-house to run a particular experiment,” says Vallée.

Babylon had migrated its user-facing applications to a Kubernetes platform in 2018, “and we had a lot of Kubernetes knowledge thanks to the migration,” he adds. To optimize some of the models that had been created, the team turned to Kubeflow, a toolkit for machine learning on Kubernetes. “We tried to create a Kubernetes core server, we deployed Kubeflow, and we orchestrated the whole experiment, which ended up being a really good success,” he says.

Based on that experience, Vallée’s team was tasked with building a self-service platform to help Babylon’s AI teams become more efficient, and by extension help get products to market faster. “Kubernetes is a great platform for machine learning because it comes with all the scheduling and scalability that you need,” says Vallée.

The need to keep data in every country in which Babylon operates requires a multi-region, multi-cloud strategy, and some countries might not even have a public cloud provider at all. “We wanted to make this platform portable so that we can run training jobs anywhere,” he says. “Kubernetes offered a base layer that allows you to deploy the platform outside of the cloud provider, and then deploy whatever tooling you need. That was a very good selling point for us.”

Once the team decided to build the Babylon AI Research platform on top of Kubernetes, they referred to the Cloud Native Landscape to build out the stack: Prometheus and Grafana for monitoring; an Istio service mesh to control the network on the training platform and control what access all of the workflows would have; Helm to deploy the stack; and Flux to manage the GitOps part of the pipeline.

The cloud native AI platform has had a huge impact at Babylon. The first research projects run on the platform mostly involved machine learning and natural language processing. These experiments required a huge amount of compute-1600 CPU, 3.2 TB RAM-which was much more than Babylon had in-house. Plus, access to compute used to take hours, or sometimes even days, depending on how busy the platform team was. “Now, with Kubernetes and the self-service platform that we provide, it’s pretty much instantaneous,” says Vallée.

Another important type of work that’s done on the platform is clinical validation for new applications such as Babylon’s Symptom Checker, which calculates the probability of a disease given the evidence input by the user. “Being in healthcare, we want all of our models to be safe before they’re going to hit production,” says Vallée. Using Argo for GitOps “enabled us to scale the process massively.”

For more on Babylon’s cloud native journey, read the full case study.