Observability and Governance at Scale in Financial Services with Prometheus
Introduction
A leading U.S. financial services company offering life, disability, and long-term care insurance, annuities, and wealth management operates under stringent security and compliance mandates. Its engineering ecosystem spans over 1,000 GitLab projects leveraging Terraform, AWS CDK, Kubernetes, Cron jobs, and Control-M workloads. Infosys was engaged to build a unified observability and governance platform using Prometheus and Grafana, consolidating fragmented tooling into a single pane of glass for metrics, compliance, and automated incident response.
The Challenge: Fragmented tooling and compliance risk at scale
The client’s platform engineering team managed a sprawling cloud-native estate with over 1,000 GitLab projects, hundreds of Kubernetes workloads, Kafka streams, Databricks and Spark jobs, and Aurora databases, all under heavy regulatory scrutiny. Reliability, security, and compliance had become increasingly complex to maintain.
Logs and metrics were scattered across Kafka, GitLab, Splunk, Aurora, and container registries, with no single view of system health or compliance posture. CI/CD compliance checks were inconsistently enforced, and vulnerabilities in containers and packages took nearly two weeks to remediate. Manual API key rotations, hand-driven compliance checks, and the absence of automated task routing created persistent audit risks.
Non-compliant deployments were slipping through, vulnerability remediation lagged behind SLAs, and governance teams relied on manual processes that didn’t scale. The business needed a unified, automated approach and it had to be built on open-source, cloud-native foundations to avoid vendor lock-in.
The Solution: Prometheus as a unified governance platform
The Infosys team designed and implemented a unified observability and governance platform combining Prometheus for scalable metrics collection and alerting with Grafana as a centralized command center for visualization, policy enforcement, and automated actions.
The team executed a phased implementation:
- Data integration: Connected telemetry from Kafka, Databricks, Kubernetes, Aurora, GitLab, Splunk, and container registries. Prometheus scrapes metrics from GitLab, Kubernetes, and Spark workloads. AWS CloudWatch feeds infrastructure-level metrics (CPU, memory, network, Aurora DB performance). GitLab pipeline logs flow through the Grafana Agent into Loki for centralized log querying.
- Dashboards and analytics: Built Grafana dashboards for CI/CD compliance, vulnerability tracking, infrastructure health, Aurora cost optimization, and Kubernetes resource utilization all powered by Prometheus metrics and Redshift-based trend analytics from Splunk and container registry data.
- Alerting and automation: Configured Prometheus alert rules and Grafana alerting to trigger webhooks for automated task creation. Alerts flow to an Action Items Service that auto-creates Jira and ServiceNow tickets, with status synced back to Grafana dashboards. Notifications also reach teams via Slack and email.
- Security and access control: Implemented SSO, RBAC, and audit logging to ensure secure, role-based access to dashboards and governance workflows.
Solution architecture
The architecture integrates data sources, ETL and analytics, and visualization/alerting into a unified observability and governance platform.
Data sources
- GitLab Pipelines: CI/CD logs and metrics for build and deployment compliance
- AWS Services: EC2, RDS (Aurora), and other resources monitored via CloudWatch
- Databricks & Spark: Data processing jobs and Spark metrics for performance insights
- Kafka Streams: Event-driven data pipelines for real-time observability
- Kubernetes Workloads: Container and cluster state metrics for application health
ETL and analytics
- Centralized Data Warehouse: GitLab logs and Splunk data processed through ETL pipelines into Amazon Redshift for compliance and trend analytics
- Metrics Collection: Prometheus scrapes from GitLab, Kubernetes, and Spark; AWS CloudWatch provides infrastructure metrics
- Data Processing: Databricks jobs transform raw logs and metrics for advanced analytics; Spark metrics integrated for workload optimization
Visualization and notifications
- Grafana Dashboards: Unified view of CI/CD compliance, infrastructure health, and vulnerability trends with custom panels for Aurora cost optimization and Kubernetes resource utilization
- Alerting & Incident Response: Prometheus alert rules + Grafana alerts for SLA breaches, compliance failures, and vulnerabilities; notifications via Slack, email, Jira, and ServiceNow.
Technology stack
- Grafana (Dashboards, Alerting, RBAC, SSO)
- Prometheus / Grafana Agent (Metrics Collection and Alerts)
- Loki (Log Backend)
- GitLab (CI/CD Pipelines)
- Splunk → Redshift (Log ETL & Analytics)
- AWS Aurora + CloudWatch Exporter (DB Metrics)
- Kubernetes (Workload Metrics)
- Container Registry (Image Scan CVEs)
- Action Items Service + DB (Task Creation, SLA Tracking)
- Jira / ServiceNow / Email / Chat (Notifications)
The impact: From Fragmented Tooling to Automated Governance
The results were immediate and measurable:
- 95% pipeline compliance achieved, reducing non-compliant deployments by over 60%. Prometheus alert rules catch compliance violations in real time, and Grafana alerts auto-create remediation tickets before code reaches production.
- Median vulnerability fix time reduced to 3 days, down from nearly 2 weeks. Container image CVEs and package vulnerabilities are surfaced in Grafana dashboards with automated ticket creation and SLA tracking.
- ~18% reduction in Aurora database costs through performance and usage optimization dashboards that correlate Prometheus metrics with CloudWatch data, enabling right-sizing and query tuning.
- 1,000+ GitLab projects monitored, cutting observability blind spots by 75%. Engineers now have a single Grafana dashboard to answer compliance, performance, and security questions across the entire estate.
- 50%+ reduction in manual governance work: Automated alerting, ticketing, and task assignment replaced hand-driven compliance checks and manual API key rotation tracking.
A unified platform for observability and governance
By consolidating metrics, logs, and compliance insights into a single platform built on Prometheus and Grafana, the organization replaced fragmented monitoring tools and manual governance processes with centralized visibility and automated workflows. Engineering and governance teams can now monitor over 1,000 GitLab projects through unified Grafana dashboards, while Prometheus-driven alerts automatically trigger remediation workflows through Jira, ServiceNow, and Slack.
This approach significantly improved compliance enforcement, reduced vulnerability remediation time, and optimized infrastructure costs while decreasing manual governance effort. Built on open-source, cloud-native technologies running on Kubernetes and AWS, the platform also provides a scalable foundation for expanding observability and governance as the organization’s cloud environment continues to grow.