Infosys Ltd. Client

Establishing unified observability and governance across hybrid platforms with OpenTelemetry

Introduction

A globally renowned financial institution modernized a large application estate from mainframes and vendor platforms to cloud‑native services (including Java and Python) running on Kubernetes across private and public clouds. Applications are deployed across two geographically separated data centers for high availability, while batch processing relies on AutoSys with Job Information Language (JIL) files stored in GitHub/Bitbucket and metadata in PostgreSQL. The end-to-end customer journey depends heavily on multiple third-party systems, each operating its own monitoring or observability platform. Applications span Kubernetes and non-Kubernetes environments and expose interfaces through REST, SOAP, and command-line access, further complicating unified observability, increasing operational overhead and elongating incident timelines.

Industry:

Financial Services

Cloud Type:

Hybrid

Published:

April 7, 2026

Projects used

Challenges

The organization faced vendor lock‑in and rising costs from multi‑tool sprawl and uncontrolled telemetry ingestion/retention. Hybrid visibility was limited: legacy platforms and Kubernetes estates were observed separately, dashboards proliferated, and alerting was largely reactive, generating more than half of alerts as noise on legacy systems. Logs, metrics, traces, and events were not adequately correlated, pushing mean‑time‑to‑detect (MTTD)/ mean‑time‑to‑resolve (MTTR) higher during peak periods. Incident remediation depended on manual scripts and scattered runbooks (Jive/Confluence), while long‑running batch jobs exposed little intermediate state, complicating RCA across mainframe and external dependencies. There was also no synthetic monitoring for legacy or third‑party systems, incomplete service maps of end‑to‑end customer journeys, and a lack of centralized, persona‑based views with Service Level Indicators (SLIs) and Service Level Objectives (SLOs)governance.

Solution

In late 2025, after repeated batch job failures across multiple technology stacks, Infosys suggested adopting a single unified observability and automation platform to replace scattered monitoring tools and manual troubleshooting. With one implementation, the team consolidated logs, metrics, traces, events, service maps, and synthetic monitoring—giving every batch system a common operational backbone.

A phased rollout enabled smooth migration from legacy and cloud environments. As visibility gaps surfaced, the team built custom instrumentation to track intermediate job states and improve correlation across mainframe and distributed dependencies.

Rolebased dashboards became the primary troubleshooting and reporting interface: executives received highlevel service health views, SREs used deep operational insights for RCA, and developers accessed steplevel failure details.

The platform also introduced autohealing by triggering predefined Ansible remediation scripts whenever known error patterns appeared, resolving many recurring issues without human intervention. With synthetic monitoring and end-to-end maps of all batch flows, failures can now be detected earlier and resolved faster than ever before.

Before/after

Dimension	Before	After
Tooling	Multiple vendors with tight coupling and high license/storage costs	OpenTelemetry first pipelines with backend portability and right‑sized retention
Signal Correlation	Logs, metrics, and traces handled in silos	Trace ID correlation across signals and events
Alerting	Reactive and noisy; RCA driven by war‑rooms	Actionable alerts with noise suppression and faster RCA
Coverage	Either Kubernetes or legacy, but rarely both	Unified visibility across K8s, non‑K8s, batch, and third‑party
Dashboards	URL sprawl and non‑persona views	Centralized, persona‑based dashboards
Governance	Limited SLI/SLO tracking	Reliability governance with trend reporting

Impact

By consolidating on OpenTelemetry and correlating signals end‑to‑end, the platform reduced alert noise by roughly 50%, with anomaly suppression eliminating an additional 70–85%; MTTD and MTTRmean‑time‑to‑resolve improved by about 40%, incident escalations fell by ~35%, SRE and developer productivity rose by ~30%, and overall costs declined by ~20% due to vendor‑neutral backends all while sustaining 99.99% availability across multi‑data‑center Kubernetes.

KPI Table

KPI	Delta improvement	Measurement Source
Alert Noise (%)	↓ ~50%	Alert logs, Rule hits
MTTD (min)	↓ ~40%	Signal to alert timeline
MTTR (min)	↓ ~40%	Incident system
Incident Escalations	↓ ~35%	On‑call escalation logs
Platform Uptime	99.99%	SLO dashboard
SRE/Dev Productivity	↑ ~30%	Time‑on‑call, MTTK, toil surveys
Cost to Observe ($/month)	↓ ~20%	Licensing + storage BOM

Looking forward

The enterprise will continue strengthening its end-to-end observability platform by expanding capabilities across each layer of architecture. Within both data centers, the footprint of OpenTelemetry Collector Gateways and Cluster Receivers will grow to support higher volume batch workloads and provide richer instrumentation for longrunning, computeintensive processes. All telemetry logs, metrics, traces, and events will continue shift away from disparate APM tools such as Dynatrace, DataDog, and Splunk, toward a single OpenTelemetrynative pipeline backed by Prometheus, Loki/Elasticsearch, Jaeger, and a homebuilt analytical data mart for long-term storage and financial data monetization.

As the architecture matures, the initiative will deliver a fully unified, proactive, AIdriven, and self-correcting observability ecosystem—transforming complex hybrid batch operations into a predictable, resilient, cost-efficient, and insight-rich environment for the entire enterprise.