image showing console of Mercedes-Benz
Case Study

Mercedes-Benz

How Mercedes-Benz Expanded Its Kubernetes Fleet Management with Cluster API to Public Clouds

Challenge

Mercedes-Benz is one of the most successful automotive companies in the world. As a 100% subsidiary of Mercedes-Benz, Mercedes-Benz Tech Innovation GmbH creates digital products and software solutions in collaboration with Mercedes-Benz. Kubernetes is a crucial element of our technology stack and we have relied on it since version v0.9. However, since there was no existing Kubernetes fleet management solution available at the time of our adoption, we had to create our own solution.

Solution

After researching our options, we decided to migrate to Cluster API, a Kubernetes project that provides a declarative API for provisioning and managing Kubernetes clusters. Cluster API allows us to manage our clusters using Kubernetes itself. It provides a more consistent experience for our team of 30 developers and platform engineers, as it reduces the number of necessary frameworks and tools to get the job done.

Impact

Adopting Cluster API has been transformational for our Kubernetes platform. We have been able to eliminate snowflakes, reduce deployment time, perform rolling upgrades, onboard new clouds faster, standardize our infrastructure, and leverage the power of open source.

Industry:
Location:
Cloud Type:
Published:
June 12, 2023

Projects used

Kubernetes

By the numbers

6000+ Kubernetes Nodes

Nearly 1000 Kubernetes Clusters

15 minutes

Automated Cluster Creation & everything in Self-service

100%

Faster Rolling Upgrades

How Mercedes-Benz, in collaboration with Mercedes-Benz Tech Innovation, Expanded Its Kubernetes Fleet Management with Cluster API to Public Clouds

Challenge

Mercedes-Benz is one of the most successful automotive companies in the world. As a 100% subsidiary of Mercedes-Benz, Mercedes-Benz Tech Innovation GmbH creates digital products and software solutions in collaboration with Mercedes-Benz. Kubernetes is a crucial element of our technology stack and we have relied on it since version v0.9. However, since there was no existing Kubernetes fleet management solution available at the time of our adoption, we had to create our own solution.

“We are thrilled to have had the opportunity to work closely with our colleagues from Mercedes-Benz Tech Innovation for the last 7+ years. With confidence, we look forward to future prospects and further development within our Kubernetes environments. As our collaboration continues, we will expand the feature set for our customers, providing a more efficient and user-friendly experience.”

Thorsten Schenk-Trautmann, Product Owner CaaS, Mercedes-Benz Group AG

Our organizations’ Kubernetes fleet, running on a managed OpenStack based on-premises IaaS, had grown rapidly. We had been using a self-written Terraform pipeline to provision the infrastructure and Kubernetes clusters, but it had become overly complex and almost impossible to manage. Therefore, we needed a more streamlined solution. Additionally, we wanted to be better prepared for public cloud environments like AWS, as we were planning to offer a multi-cloud setup.

Solution

After researching our options, we decided to migrate to Cluster API, a Kubernetes project that provides a declarative API for provisioning and managing Kubernetes clusters. Cluster API allows us to manage our clusters using Kubernetes itself. It provides a more consistent experience for our team of 30 developers and platform engineers, as it reduces the number of necessary frameworks and tools to get the job done. Additionally, Cluster API is designed to work across different cloud providers and infrastructures, which means that we can use the same tools and workflows to manage our clusters regardless of where they are running.

Impact

Adopting Cluster API has been transformational for our Kubernetes platform. One of the most significant benefits has been the use of reconciliation loops, which have replaced our legacy cluster lifecycle with Terraform and classic pipelines. This has enabled us to always know the state of our infrastructure and eliminate the possibility of snowflakes or persistent manual changes. With reconciliation loops in place, we have seen a notable reduction in the time it takes to deploy new changes to our Kubernetes clusters, resulting in faster delivery times for our users.

“With Cluster API, we now have a reliable way to ensure that the infrastructure state always matches the desired configuration. This was not always the case with our legacy fleet management using Terraform, where a state mismatch could have occured over time. Thanks to Cluster API, this mismatch has been eliminated, leading to a more stable and predictable infrastructure.”

Francesco Della Coletta, Reliability Engineer, Mercedes-Benz Tech Innovation GmbH

Additionally, we have successfully tested rolling upgrades which only need to be triggered once, thereby reducing the need for restarts of pipelines. This has been a major win for us as we can ensure high availability of our services, avoid downtime and minimize disruptions to our users.

“Kubernetes rolling upgrades are now much smoother thanks to Cluster API. This is particularly advantageous for users who rely on PodDisruptionBudgets (PDB) to ensure application availability. In cases where a Pod fails to become ready during a failover, our legacy pipeline required the upgrade job to be re-run after the issue was resolved. However, with Cluster API and its continuous reconciliation loops, the upgrade process can resume immediately upon the Pod becoming ready, thereby minimizing downtime and reducing the risk of failed upgrades.”

Tobias Giese, Software Engineer, Mercedes-Benz Tech Innovation GmbH

The ability to use pluggable Cluster API cloud providers has been another game-changer for us. We were able to onboard new public clouds with much greater ease and speed, enabling us to scale our Kubernetes infrastructure quickly and efficiently.

Another significant benefit of Cluster API has been its open source nature. By leveraging an open source project, we have been able to focus on our core business while relying on the contributions and support of the broader community. The collaborative nature of the project has also allowed us to learn from others in the industry, gain new insights, and build better solutions. We have been able to contribute to the open source and CNCF community, which has allowed us to give back and help others in the industry.

“Being part of the greater Kubernetes ecosystem, especially inside SIG cluster-lifecycle, and thereby following their community principles, did not only shape our own internal way of working but also gave us the chance to be an active contributor rather than a passive consumer of software.”

Johannes Frey, Software Engineer, Mercedes-Benz Tech Innovation GmbH

Finally, Cluster API has enabled us to scale our Kubernetes infrastructure significantly. We started with a fleet of 200 clusters using Terraform, which was becoming unmanageable. Today, we have almost 1000 clusters, all managed efficiently through Cluster API and multiple management clusters. We can now manage our infrastructure with ease, knowing that we have a scalable and reliable system in place.

In conclusion, the benefits of Cluster API have been numerous for us. We have been able to eliminate snowflakes, reduce deployment time, perform rolling upgrades, onboard new clouds faster, standardize our infrastructure, and leverage the power of open source.

If you want to learn how we have migrated from Terraform to Cluster API with zero downtime, take a look at our talk at KubeCon 2022:


Our Philosophy – Everything as a Self-Service

At Mercedes-Benz and its subsidiary companies, such as Mercedes-Benz Tech Innovation, we prioritize shorter time to market, and we believe in empowering our teams with self-service solutions. To achieve this, we have embraced a FOSS first strategy. By adopting this self-service and FOSS first approach, we have been able to reduce the time required for routine infrastructure tasks and focus on delivering value to our users. In the end, this would allow users of our platform the freedom to concentrate on their core business vs. loosing time in tackling infrastructure-related challenges.

Watch our keynote at KubeCon + CloudNativeCon 2022 in Valencia for more insights into how our work with Kubernetes has evolved over the last decade:

Everything referred to as self-service means that we strive to make everything available through APIs and UIs, providing our users with the flexibility to manage their own resources as they see fit. For example, we handle load balancers on-premises through the OpenStack Cloud Controller Manager (OCCM), enabling our users to easily provision and manage load balancers via Kubernetes Services of type LoadBalancer.

Similarly, our approach to firewalls is to maintain a deny-all policy within the cluster, while allowing users to open ports through NetworkPolicies as needed. In addition, we added the capability to restrict ingress traffic to load balancers from specific IP CIDRs using loadBalancerSourceRanges for Kubernetes Services of type LoadBalancer.

To further enhance self-service, we have developed an API that provides users with an interface to manage their own clusters. This API is supported by a user-friendly UI that allows for an easy management of clusters.

Screen shot showing Mercedez Benz' self-service portal

Please note: All values are example values and do not reflect actual cluster information.

Lastly, to ensure that our clusters are secure and adhere to best practices, we use tools like our custom CaaS Inspector, which is based on Popeye, to inspect our clusters and identify areas for improvement. Based on our experience and industry best practices, the CaaS Inspector calculates a score for the clusters, providing insight into the overall condition of the clusters’ configuration.

Screenshot showing Mercedes-Benz' CaaS inspector tool

What’s Next?

Going forward, we are excited about the next steps for our usage of Cluster API, especially when leveraging Cluster API Runtime SDK to further enhance the capabilities of our cluster management system.

In addition, we are exploring GitOps with FluxCD as a means of deploying add-ons like metrics exporters or custom controllers to the workload clusters. While currently relying on Ansible for this purpose, GitOps would allow us to reconcile these add-ons for more efficient version control and management.

Furthermore, we are planning on building a custom cluster controller that will handle all necessary meta information in CRDs for the clusters, as well as, the infrastructure that must be created before Cluster API can provision clusters. This will provide us with even more flexibility and control over our cluster management system and help us to further optimize our workflows while increasing the stability and scalability of our Kubernetes fleet.