Modern software delivery is no longer constrained by application code — it is constrained by the platform that runs it. This article presents the design of a cloud-native Internal Developer Platform (IDP) built on Kubernetes and CNCF ecosystem tools, demonstrating how Infrastructure as Code (IaC), GitOps, and security-first pipelines can be combined into a cohesive, operationally consistent platform. While some implementations use managed AKS, the architectural patterns apply equally to any CNCF-conformant Kubernetes distribution.
Modern distributed systems commonly face the following operational challenges that motivated this platform design: Deployment inconsistencies across environments caused by manual processes Lack of infrastructure versioning and drift control, leading to environment divergence. Hardcoded secrets and weak security posture embedded in CI/CD pipelines Inefficient scaling strategies that generate unnecessary cost overhead Limited disaster recovery and rollback mechanisms when deployments fail Fragmented observability making root cause analysis slow and unreliable The architecture described here directly addresses each of these gaps through declarative, automated, and policy-driven controls.
Design principles
The platform follows CNCF-aligned principles that guided every architectural decision:
- Declarative infrastructure — all resources are version-controlled and reproducible
- GitOps-based deployment using Argo CD — Git is the single source of truth for cluster
- Immutable infrastructure and containerised workloads — no manual changes to running systems
- Security-by-design across Design time threat modeling, CI/CD and runtime
- Observability as a core platform capability not an optional post deployment module.
- Separation of concerns across infrastructure, platform, and application layers through modular design.
High-level architecture
The platform is structured into three logical layers with clear separation of responsibilities; collapsing these layers early introduced significant maintenance complexity. This is actually reflected in repository source code for building infrastructure, platform and applications. The Infrastructure Layer bootstraps the ArgoCD GitOps controller. Once initialized, ArgoCD manages the system by continuously monitoring and reconciling both Platform Components and Application Layer resources to match the desired state defined in Git.

Figure 1: End-to-End Cloud-Native Platform Architecture
1. Infrastructure layer
Responsible for provisioning all cloud resources using Terraform, structured into reusable modules:
- Virtual Networks (VNet), subnets, and Network Security Groups
- Managed Kubernetes Cluster
- Container Registry
- Identity, access configurations and Secret Stores
2. Platform layer
Built on Kubernetes and CNCF ecosystem tools, installed and managed declaratively in separate repository or in separate directories:
- Argo CD — GitOps controller for continuous reconciliation
- Istio — service mesh for traffic control, mTLS, and service-level observability
- Prometheus — metrics collection and alerting
- Grafana — dashboards and visualization
- Loki — centralised log aggregation
- Kyverno — Policy as Code enforcement at admission time
3. Application layer
Microservices deployed as containerised workloads, independently managed through Git:
- Independently deployable services with no shared deployment schedules
- Helm-based packaging for consistent environment promotion
- Git-driven deployment lifecycle with full audit trail
End-to-end deployment workflow
The platform implements a multi-stage delivery workflow that enforces strict separation between application build, security validation, and infrastructure provisioning. This section illustrates how the workflow propagates starting from static code analysis through build to deployment.

Stage 1: Platform prerequisites
The workflow begins with a minimal set of foundational components required to run the automation and pipelines.
- A container image registry for storing versioned and signed artifacts
- A Terraform remote backend for state management and team collaboration
- A secure cloud provider service connection for pipeline execution
Stage 2: Application pipeline
The application pipeline triggers on every commit to application repositories (Java or Angular services). Its responsibility is to produce a secure, validated, and deployable container image. Each change flows through the following stages:
- Source code build and compilation
- Unit and integration testing
- Static Application Security Testing (SAST) e.g. Code Analysis
- Dependency vulnerability scanning using Trivy
- Container image creation
- Image signing using Cosign to ensure integrity and provenance
- Publishing the signed image to the container registry
Only verified, versioned, and tamper-evident artifacts are introduced into the platform. The pipeline configuration below shows the Cosign signing steps used in the pipeline.
Cosign image signing and verification
# Stage 1: Build the container image
- task: Docker@2
displayName: 'Build Container Image'
inputs:
command: build
repository: $(ACR_NAME).azurecr.io/$(IMAGE_NAME)
tags: $(Build.BuildId)
# Stage 2: Fetch OIDC token via Workload Identity Federation
- task: AzureCLI@2
displayName: 'Fetch OIDC Token'
inputs:
azureSubscription: '$(SERVICE_CONNECTION)'
scriptType: bash
scriptLocation: inlineScript
addSpnToEnvironment: true
inlineScript: |
echo "##vso[task.setvariable variable=AZURE_FEDERATED_TOKEN;issecret=true]$AZURE_FEDERATED_TOKEN"
# Stage 3: Sign image with Cosign (keyless via Azure Pipelines OIDC)
- script: |
cosign sign \
--yes \
--identity-token=$AZURE_FEDERATED_TOKEN \
$(ACR_NAME).azurecr.io/$(IMAGE_NAME):$(Build.BuildId)
displayName: 'Sign Image with Cosign'
env:
AZURE_FEDERATED_TOKEN: $(AZURE_FEDERATED_TOKEN)
Stage 3: Security validation pipeline
Before any deployment or infrastructure change is executed, a dedicated security validation pipeline enforces an additional trust boundary. This pipeline validates both artifacts and deployment configurations:
- Verification of container image signatures using Cosign
- Image vulnerability scanning using Trivy against a defined severity threshold
- Kubernetes manifest validation using KubeSec to detect insecure configuration patterns
Only workloads that pass all three checks are considered compliant and eligible for deployment.
Stage 4: Infrastructure provisioning pipeline
Once security validation succeeds, the infrastructure provisioning pipeline is triggered to run. This stage establishes the Kubernetes foundation:
- Provisioning of virtual networking (VNets, subnets, routing)
- Deployment of a managed k8s cluster with auto-scaling node pools
- Installation of Argo CD as the GitOps controller, one of the platform components.
- Bootstrapping of Argo CD Application CRDs
- Connecting infrastructure Git repositories to Argo CD
The Terraform cluster module below reflects the configuration used, including Key Vault integration via the CSI driver and Calico network policy enforcement:
Terraform k8s cluster Module (modules/aks/main.tf)
resource "azurerm_kubernetes_cluster" "main" {
name = var.cluster_name
resource_group_name = var.resource_group_name
default_node_pool {
name = "system"
auto_scaling_enabled = true
min_count = 2
max_count = 10
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
}
key_vault_secrets_provider {
secret_rotation_enabled = true
}
}
Stage 5: GitOps deployment model
After infrastructure provisioning, the platform follows a GitOps model where Git is the single source of truth. Argo CD continuously reconciles platform and application layer components through monitoring Kubernetes manifests and Helm charts . Changes pushed to Git are automatically applied to live clusters, ensuring the cluster stays in sync. This enables:
- Automated reconciliation without manual kubectl use
- Full auditability via Git history and sync status
- Easy rollbacks using standard Git workflows
The Argo CD Application CRD below shows how a microservice is configured for automated sync with self-healing and pruning enabled:
Argo CD Application CRD — Automated GitOps Sync
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: microservice-api
namespace: argocd
labels:
app.kubernetes.io/part-of: internal-developer-platform
spec:
project: default
source:
repoURL: https://github.com/your-org/gitops-repo
targetRevision: main
path: apps/microservice-api/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Remove resources deleted from Git
selfHeal: true # Revert manual changes to cluster
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Stage 6: Runtime request flow
Once the infrastructure and application workloads are deployed, external users send requests to the cloud load balancer, which forwards traffic to the API Gateway or Ingress layer. The gateway maps URLs and services to the appropriate Kubernetes Services, which then distribute requests across healthy application Pods for processing and response delivery.
Security architecture
Security is treated as a cross-cutting concern integrated throughout the entire platform lifecycle — not a layer applied after deployment. The approach spans supply chain integrity, policy enforcement, runtime protection, and secret management.

1. Supply chain security
Security begins at the artifact level by ensuring only trusted and verified components enter the system:
- Trivy scans container images and dependencies for known vulnerabilities
- KubeSec validates Kubernetes manifests to identify insecure config early in the lifecycle
- Cosign provides cryptographic signing and verification of container images, ensuring integrity and provenance through keyless signing via OIDC
Together, these controls ensure only scanned, validated, and signed artifacts are eligible for deployment.
2. Policy enforcement with Kyverno
At the cluster level, Kyverno enforces policies at admission time, preventing non-compliant workloads from being scheduled. The policy below enforces one of our baseline standards — disallowing the use of the latest image tag across all pods:
Kyverno ClusterPolicy — Disallow Latest Tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-latest-tag
annotations:
policies.kyverno.io/title: Disallow Latest Tag
policies.kyverno.io/description: >-
Require image tags to be pinned to a specific version.
The 'latest' tag is mutable and can lead to unpredictable deployments.
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-image-tag
match:
any:
- resources:
kinds:
- Pod
validate:
message: >-
The 'latest' image tag is not allowed. Specify a versioned tag.
pattern:
spec:
containers:
- image: "*:*"
- name: disallow-latest-tag
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Image tag 'latest' is not permitted."
pattern:
spec:
containers:
- image: "!*:latest"
3. Runtime Security
Pre-deployment controls are necessary but not sufficient. Runtime security mechanisms monitor system behaviour and detect anomalies during workload execution:
- Falco provides real-time detection of suspicious activity within containers and the host environment, with alerts integrated into the observability stack
- AppArmor enforces kernel-level security profiles to restrict container capabilities and reduce the attack surface
4. Secrets management
Sensitive data is managed outside application and deployment artifacts to eliminate exposure risk:
- Key Vault, integrated via the CSI Secrets Store driver, provides secure and dynamic secret injection into workloads at pod startup
- Secrets are never stored in Git repositories or embedded within container images
- Secret rotation is handled centrally in Key Vault and picked up automatically by running workloads
This approach ensures secret management remains centralised, auditable, and secure by design.
5. Networking and traffic management
The networking layer combines Kubernetes-native primitives with Istio’s service mesh capabilities to provide secure, observable, and policy-driven traffic management:
- Kubernetes Services expose workloads internally with stable DNS-based discovery
- Azure Load Balancer provides external ingress with DDoS protection at the network perimeter
- Istio manages traffic routing, mTLS encryption between services, and service-level observability
- Calico CNI enforces network policies, restricting lateral movement between namespaces
A key lesson from Istio mTLS was that enabling Strict mode cluster-wide too early caused connectivity issues because not all workloads had sidecars injected. Istio supports Permissive mode (accepts both plaintext and mTLS) and Strict mode (enforces only mTLS). The fix was to start in Permissive mode and then gradually apply PeerAuthentication in Strict mode per namespace, only after confirming full sidecar injection in each namespace.
Observability stack
Observability is implemented as a unified system with three complementary signals, all feeding into a shared Grafana interface:
| Tool | Signal Type | Primary Use |
| Prometheus | Metrics | Resource utilisation, SLO tracking, alerting |
| Grafana | Visualisation | Dashboards, SLA reporting, incident response |
| Loki | Logs | Centralised log aggregation, correlation with traces |
We adopted Prometheus, Grafana, and Loki to align with a Kubernetes-native observability model. Prometheus handles metrics, Loki handles log aggregation using lightweight label-based indexing, and Grafana provides a unified visualization layer. This reduces operational cost and complexity compared to maintaining a separate Elasticsearch and Kibana stack
Infrastructure as code strategy
Terraform is structured into modular components that reflect the modular platform layers, enabling independent versioning and testing of each:
| Module | Responsibility |
| network | VNet, subnets, NSGs, peering configurations |
| Managed k8s Cluster | K8s cluster, node pools, RBAC, Key Vault integration |
| security | Policies, Defender for Containers, audit logging |
| platform-services | Argo CD, Istio, Prometheus, Grafana, Loki, Kyverno |
Environment separation is handled using per-environment variable files:
- dev.tfvars — reduced node counts, relaxed policies, faster iteration
- staging.tfvars — production-equivalent topology with synthetic load testing
- prod.tfvars — full node pools, strict policies, backup schedules enabled
This structure ensures reusability, consistency across environments, and controlled environment-specific customisation without duplicating module code.
Key outcomes
The following outcomes were observed in our internal lab and staging environments after full platform adoption.
| Metric | Observed Change |
| Deployment reliability | Improved to ~95% success rate (from ~70% with manual processes) |
| Infrastructure provisioning time | Reduced from hours/days to under 15 minutes via Terraform automation |
| Deployment frequency | Increased from weekly to multiple releases per day |
| Configuration drift incidents | Near-zero — eliminated by GitOps continuous reconciliation |
| Pre-production vulnerability detection | 80% of findings caught before reaching staging |
| Manual kubectl operations | Reduced to near-zero for routine deployments |
Challenges and lessons learned
Navigating the CNCF ecosystem showed the risk of adopting too many overlapping tools early. The key lesson was to let architecture drive tooling decisions and defer additions like OpenTelemetry until the platform stabilized. Maintaining clear separation between infrastructure, platform, and application layers was essential for long-term maintainability. Early coupling of tools such as Argo CD and Istio with application code increased complexity and was later corrected by splitting repositories into different folders. GitOps improved consistency and traceability but introduced synchronization issues during repository restructuring. These were resolved using Argo CD app-of-apps and application health checks. Moving security checks earlier in the pipeline—using Trivy and KubeSec immediately after build—improved feedback speed and reduced late-stage failures.
Conclusion
This architecture shows how Kubernetes and CNCF tools can be combined to build a secure, automated, and scalable platform, where the real value comes from how deployment, security, and observability work together as a system. The core design decisions are to establish clear layer separation early, integrate security from the start, and adopt GitOps with Argo CD from day one. Future improvements focus on multi-cluster management with Argo CD ApplicationSets, stronger policy enforcement using Kyverno, deeper zero-trust networking via Istio, and adding distributed tracing through OpenTelemetry integrated into the observability stack.