LY Corporation

How LY Corporation Scaled Kubernetes from 5 to 1,300+ Clusters with Full Automation on Private Infrastructure

Executive Summary

LY Corporation built a fully automated private Kubernetes platform that grew from 5 production clusters in July 2017 to more than 1,300 clusters and 40,000+ nodes today. By using Kubernetes Custom Resource Definitions (CRDs) and custom controllers as an infrastructure control plane, the company cut provisioning from weeks to hours, eliminated manual recovery, and enabled 15 engineers to operate at this scale reliably and securely.

Company Overview

LY Corporation is a Tokyo-based technology company operating large-scale digital services. As engineering teams expanded and service demand increased, the company needed an infrastructure platform capable of scaling predictably while maintaining strong reliability and security standards.

Challenges:

Automation, Efficiency, Reliability, Scaling, Security

Industry:

Software

Location:

Japan

Cloud Type:

On-prem, Private

Published:

July 2, 2026

Projects used

By the numbers

1,300+

Kubernetes clusters managed

40,000+

nodes operated on private infrastructure

engineers running the platform at hyperscale

Challenge: Scaling Infrastructure Without Linear Team Growth

Operating Kubernetes at large scale required more than deploying clusters. Manual provisioning and ad-hoc operations would not scale, and planned maintenance could not be disruptive.

As internal adoption increased, several challenges became evident:

Provisioning environments often took more than a week
Releases required cross-team coordination
Nodes remained unpatched for extended periods, increasing security risk
Recovering failed nodes could take several days
Maintenance and upgrades risked service interruption at scale

Infrastructure operations were limiting business velocity, and simply expanding the operations team was not sustainable. LY needed a declarative, automated model that could scale without proportional growth in operational overhead—with zero-downtime operations as a first-class requirement.

Solution: Growing with Kubernetes from Day One

LY began evaluating Kubernetes in 2015 to run it at scale on private infrastructure. In 2016, the platform team prototyped a Kubernetes-native way to describe and reconcile cluster state using ThirdPartyResource (TPR).

In 2017, as Custom Resource Definitions (CRDs) became available, the team moved to a production-ready design and launched with 5 Kubernetes clusters in July 2017.

The team designed a custom resource called KubernetesCluster to represent the desired state of each cluster. Custom controllers continuously reconcile that state on the company’s OpenStack-based private cloud by:

Provisioning virtual machines
Configuring load balancers
Bootstrapping control plane components
Creating and scaling worker nodes

Provisioning a new cluster became as simple as applying a resource definition.

This Kubernetes-native model let the team automate cluster lifecycle and operations at scale, and it made zero-downtime cluster upgrades and maintenance possible. Upgrades use health-gated rolling node replacement—new nodes are added and must pass load-balancer health checks before old ones are drained and removed—while the control plane and etcd run as redundant replicas rotated one node at a time, so the Kubernetes API and running services stay available throughout.

What began as 5 clusters steadily expanded—first to dozens, then to hundreds—as Kubernetes became the default runtime for new services. Today, the platform manages more than 1,300 clusters and 40,000+ nodes.

Built-In Observability, Reliability, and Developer Experience

Rather than delivering vanilla Kubernetes, LY provides clusters with a curated set of production-ready add-ons deployed by default, improving developer experience and operational consistency.

Each cluster includes:

Prometheus with node-exporter and kube-state-metrics, automatically collecting metrics
Alertmanager with predefined alert rules
Centralized logging pipelines
Ingress controllers and cert-manager for automated TLS certificate issuance and rotation

These defaults ensure consistent observability, reliability, and security across all clusters. To prevent configuration drift and long-lived security exposure, nodes are automatically recreated every 3–4 months in alignment with Kubernetes lifecycle policies, eliminating manual recovery procedures.

Results: Operating at Hyperscale with a Small Platform Team

Nearly a decade after beginning Kubernetes evaluation, LY operates one of Japan’s largest private, on-premise Kubernetes environments.

Operational Impact

1,300+ Kubernetes clusters
40,000+ nodes under management
15 engineers operating at hyperscale
Provisioning reduced from weeks to hours
Zero manual node recovery
100% lifecycle-aligned node rotation every 3–4 months

Beyond operational efficiency, the platform changed how product teams build and ship. Moving the first product to run in production on the platform from VMs to Kubernetes required re-architecting it as a Twelve-Factor app—and that investment paid off across several dimensions:

Scaling for spikes: campaigns drive traffic several to tens of times higher than normal (up to ~2,000 req/s). Scaling out used to take days of manual work; now the team only changes the desired number of nodes or Pods and the platform provisions the capacity automatically—with no hands-on effort.
Release speed: with a release flow that runs entirely through GitHub and CI/CD, the time from merging a pull request to a completed release dropped from several hours to about 10 minutes.
On-demand environments: standing up a purpose-built environment (e.g., performance testing) now takes only creating another KubernetesCluster resource.
No more manual recovery: the platform health-checks the underlying VMs and automatically recreates and rejoins a node on failure, freeing the team from VM-failure on-call.

The first product team’s success made the platform the de-facto standard for services across the company. More broadly, developers adopted cloud native practices such as graceful shutdown with SIGTERM, accelerating deployment cycles and reducing production risk.

“Treating Kubernetes as our infrastructure control plane didn’t just let a small team run 40,000+ nodes reliably and securely—it turned infrastructure into a self-service platform where product teams scale and release on their own.”
Shota Yoshimura, Senior Platform Engineer, LY Corporation

Lessons Learned

Treat Kubernetes as a programmable platform, not just an orchestrator
Design for zero-downtime maintenance early—at scale it becomes a requirement, not an option
Standardize observability and security to improve developer experience
Eliminate manual recovery before scale makes it unsustainable

Future Plans and Community Engagement

LY continues to expand automation and strengthen its security posture while exploring deeper integration with emerging cloud native technologies. The company shares its operational experience at events such as KubeCon and participates in Cloud Native Community Japan, contributing back to the ecosystem and remaining committed to collaboration and knowledge sharing.