Case Study

OpenAI

How OpenAI Reduced Fluent Bit CPU Usage by 50% and Freed 30,000 Cores

This case study won the CNCF Case Study Contest and was presented as a keynote at KubeCon + CloudNativeCon North America on Wednesday, November 12th, 2025, by Fabian Ponce, Member of Technical Staff at OpenAI.

Challenge: When logging infrastructure hit its limits

OpenAI processes over 9 petabytes of logs daily, which is critical for an organization where every CPU cycle matters for AI research and production inference. Everything runs on Kubernetes, with each node running multiple DaemonSets: Fluent Bit, OpenTelemetry Collector, DataDog agents, and more. Due to OpenAI’s explosive growth, some workloads were shipping logs twice to different log stores, creating what Fabian Ponce, Member of Technical Staff at OpenAI, describes as “a splattering of DaemonSets.”

The busiest hosts started hitting Linux CFS throttling events. When logging DaemonSets get throttled, logs get dropped. In AI research, where every log line might contain critical information, lost logs are unacceptable. The team couldn’t allocate more CPU, and infrastructure was at capacity.

The solution: Finding the root cause with perf

The discovery

During Ponce’s first month on the observability team, he used perf, a Linux profiling tool, to profile their Fluent Bit deployment.

The report revealed something unexpected: rather than spending most of the CPU time on string processing, the stat syscall was consuming more CPU time than anything else. The root cause lay in how Fluent Bit monitors log files. By default, it uses Linux’s inotify API to detect file changes. However, inotify events don’t include how much data was written, so Fluent Bit immediately calls stat to check file size after every event and record that into its database so that it knows where it left off and how much data is available to read.

At OpenAI’s scale, with pods flushing logs line-by-line, this creates a syscall storm; millions of unnecessary syscalls with threads constantly spinning at processor speed.

The implementation

The solution was remarkably simple: disable inotify entirely and revert to stat-based polling, a one-line configuration change: inotify: false. Deployed to a test compute cluster on Ponce’s first production deploy, CPU usage dropped by 50% immediately.

The team rolled out the change fleet-wide, tuning stat polling aggressively, as frequently as once per second for their chattiest pods. Even at that high polling interval, they were making orders of magnitude fewer syscalls than with inotify.

“We reduced CPU usage of Fluent Bit by 50% across our entire fleet, providing much-needed capacity to our entire research and applied infrastructure. This is the power of the CNCF ecosystem, when you deeply understand these tools, you can optimize them in ways that benefit everyone.”

Fabian Ponce, Member of Technical Staff, OpenAI

Because the CPU is shared across Kubernetes nodes, reducing Fluent Bit’s footprint improved performance for all co-located workloads. Those 30,000 cores went back to running ChatGPT, serving inference requests, and running AI experiments.

Contributing back

While disabling inotify solved OpenAI’s immediate problem, there’s a better long-term solution. Ponce explains: “What Fluent Bit really needs is just a window. I received the inotify event, I’ll schedule a stat sometime within like 500 milliseconds.” This debouncing approach would maintain inotify’s reactivity while eliminating the syscall explosion.

After discussing this with Fluent Bit maintainers, they expressed receptiveness to the enhancement. OpenAI plans to contribute this upstream so the entire CNCF community can benefit.

Industry:
AI
Location:
Cloud Type:
Published:
November 12, 2025

Projects used

By the numbers

50% reduction

in Fluent Bit CPU utilization

30,000 CPU cores

returned to production capacity

9+ petabytes

of logs processed daily

Impact: Building OLogs: A petabyte-scale platform on CNCF foundations

OpenAI’s optimization enabled OLogs, with their internal log platform processing 9+ petabytes daily, built entirely on CNCF projects:

The platform provides OLogs Query Language (“OQL”) for quick searches and full SQL access for complex analysis. The team is also building “wide events” to store high-cardinality data that would explode traditional time-series databases.

“You can run perf in 15 minutes if you have root on a system,” Ponce emphasizes. This optimization shows you don’t need petabyte scale to benefit, just the right combination of log volume and continuous flushing behavior.

Certain issues only appear at extreme scale, but when you discover edge cases and contribute fixes back, you improve tools for everyone.

“The cloud era has increased the amount of distributed systems thinking a lot, and that’s been great for reliability and scalability, but there is no substitute for knowing how to optimize for hardware.”

Fabian Ponce, Member of Technical Staff, OpenAI

Looking ahead: Scaling observability in-house

OpenAI is reducing third-party observability costs by bringing more capabilities in-house on the CNCF stack. The observability team has grown from 7-8 engineers to 16, functioning as a data infrastructure team with CNCF projects as their foundation.

The 50% CPU reduction and 30,000 freed cores came from deep engagement with the CNCF ecosystem. OpenAI understood how Fluent Bit worked under the hood and tuned it for their environment. Now they’re contributing that knowledge back so the next organization encountering this edge case won’t need weeks of investigation. That’s the cloud native promise: excellent tools that scale, communities that collaborate, and improvements that lift all boats.


About OpenAI’s Observability Team: OpenAI’s observability team has grown from 7-8 engineers to 16, serving engineers across research and applied teams. They focus on processing petabyte-scale telemetry in CPU-constrained environments, while maintaining reliability for mission critical AI workloads.