Modern software delivery is no longer constrained by application code — it is constrained by the platform that runs it. This article presents the design of a cloud-native Internal Developer Platform (IDP) built on Kubernetes and CNCF ecosystem tools, demonstrating how Infrastructure as Code (IaC), GitOps, and security-first pipelines can be combined into a cohesive, operationally consistent platform. While some implementations use managed AKS, the architectural patterns apply equally to any CNCF-conformant Kubernetes distribution.

Modern distributed systems commonly face the following operational challenges that motivated this platform design: Deployment inconsistencies across environments caused by manual processes Lack of infrastructure versioning and drift control, leading to environment divergence. Hardcoded secrets and weak security posture embedded in CI/CD pipelines Inefficient scaling strategies that generate unnecessary cost overhead Limited disaster recovery and rollback mechanisms when deployments fail Fragmented observability making root cause analysis slow and unreliable The architecture described here directly addresses each of these gaps through declarative, automated, and policy-driven controls.

Design principles

The platform follows CNCF-aligned principles that guided every architectural decision:

High-level architecture

The platform is structured into three logical layers with clear separation of responsibilities; collapsing these layers early introduced significant maintenance complexity. This is actually reflected in repository source code for building infrastructure, platform and applications. The Infrastructure Layer bootstraps the ArgoCD GitOps controller. Once initialized, ArgoCD manages the system by continuously monitoring and reconciling both Platform Components and Application Layer resources to match the desired state defined in Git.

Figure 1: End-to-End Cloud-Native Platform Architecture

Figure 1: End-to-End Cloud-Native Platform Architecture

1. Infrastructure layer

Responsible for provisioning all cloud resources using Terraform, structured into reusable modules:

2. Platform layer

Built on Kubernetes and CNCF ecosystem tools, installed and managed declaratively in separate repository or in separate directories:

3. Application layer

Microservices deployed as containerised workloads, independently managed through Git:

End-to-end deployment workflow

The platform implements a multi-stage delivery workflow that enforces strict separation between application build, security validation, and infrastructure provisioning. This section illustrates how the workflow propagates starting from static code analysis through build to deployment.

 Figure 2: Cluster Architecture with End-to-End Pipeline Flow — Application, Security, and Infrastructure

Stage 1: Platform prerequisites

The workflow begins with a minimal set of foundational components required to run the automation and pipelines.

Stage 2: Application pipeline

The application pipeline triggers on every commit to application repositories (Java or Angular services). Its responsibility is to produce a secure, validated, and deployable container image. Each change flows through the following stages:

Only verified, versioned, and tamper-evident artifacts are introduced into the platform. The pipeline configuration below shows the Cosign signing steps used in the pipeline.

Cosign image signing and verification 

# Stage 1: Build the container image
- task: Docker@2
  displayName: 'Build Container Image'
  inputs:
    command: build
    repository: $(ACR_NAME).azurecr.io/$(IMAGE_NAME)
    tags: $(Build.BuildId)

# Stage 2: Fetch OIDC token via Workload Identity Federation
- task: AzureCLI@2
  displayName: 'Fetch OIDC Token'
  inputs:
    azureSubscription: '$(SERVICE_CONNECTION)'
    scriptType: bash
    scriptLocation: inlineScript
    addSpnToEnvironment: true
    inlineScript: |
      echo "##vso[task.setvariable variable=AZURE_FEDERATED_TOKEN;issecret=true]$AZURE_FEDERATED_TOKEN"

# Stage 3: Sign image with Cosign (keyless via Azure Pipelines OIDC)
- script: |
    cosign sign \
      --yes \
      --identity-token=$AZURE_FEDERATED_TOKEN \
      $(ACR_NAME).azurecr.io/$(IMAGE_NAME):$(Build.BuildId)
  displayName: 'Sign Image with Cosign'
  env:
    AZURE_FEDERATED_TOKEN: $(AZURE_FEDERATED_TOKEN) 

Stage 3: Security validation pipeline

Before any deployment or infrastructure change is executed, a dedicated security validation pipeline enforces an additional trust boundary. This pipeline validates both artifacts and deployment configurations:

Only workloads that pass all three checks are considered compliant and eligible for deployment.

Stage 4: Infrastructure provisioning pipeline

Once security validation succeeds, the infrastructure provisioning pipeline is triggered to run. This stage establishes the Kubernetes foundation:

The Terraform cluster module below reflects the configuration used, including Key Vault integration via the CSI driver and Calico network policy enforcement:

Terraform k8s cluster Module (modules/aks/main.tf)

resource "azurerm_kubernetes_cluster" "main" {
  name                = var.cluster_name
  resource_group_name = var.resource_group_name

  default_node_pool {
    name                 = "system"
    auto_scaling_enabled = true
    min_count            = 2
    max_count            = 10
  }

  identity { 
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }

  key_vault_secrets_provider {
    secret_rotation_enabled = true
  }
}

Stage 5: GitOps deployment model

After infrastructure provisioning, the platform follows a GitOps model where Git is the single source of truth. Argo CD continuously reconciles platform and application layer components through monitoring Kubernetes manifests and Helm charts . Changes pushed to Git are automatically applied to live clusters, ensuring the cluster stays in sync. This enables:

The Argo CD Application CRD below shows how a microservice is configured for automated sync with self-healing and pruning enabled:

Argo CD Application CRD — Automated GitOps Sync

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: microservice-api
  namespace: argocd
  labels:
    app.kubernetes.io/part-of: internal-developer-platform
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/gitops-repo
    targetRevision: main
    path: apps/microservice-api/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Remove resources deleted from Git
      selfHeal: true     # Revert manual changes to cluster
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Stage 6: Runtime request flow

Once the infrastructure and application workloads are deployed, external users send requests to the cloud load balancer, which forwards traffic to the API Gateway or Ingress layer. The gateway maps URLs and services to the appropriate Kubernetes Services, which then distribute requests across healthy application Pods for processing and response delivery.

Security architecture

Security is treated as a cross-cutting concern integrated throughout the entire platform lifecycle — not a layer applied after deployment. The approach spans supply chain integrity, policy enforcement, runtime protection, and secret management.

Figure 3: Security Controls Across the Delivery Lifecycle

1. Supply chain security

Security begins at the artifact level by ensuring only trusted and verified components enter the system:

Together, these controls ensure only scanned, validated, and signed artifacts are eligible for deployment.

2. Policy enforcement with Kyverno

At the cluster level, Kyverno enforces policies at admission time, preventing non-compliant workloads from being scheduled. The policy below enforces one of our baseline standards — disallowing the use of the latest image tag across all pods:

Kyverno ClusterPolicy — Disallow Latest Tag

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
  annotations:
    policies.kyverno.io/title: Disallow Latest Tag
    policies.kyverno.io/description: >-
      Require image tags to be pinned to a specific version.
      The 'latest' tag is mutable and can lead to unpredictable deployments.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-image-tag
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: >-
          The 'latest' image tag is not allowed. Specify a versioned tag.
        pattern:
          spec:
            containers:
            - image: "*:*"
    - name: disallow-latest-tag
      match:
        any:
        - resources:
            kinds:
              - Pod
      validate:
        message: "Image tag 'latest' is not permitted."
        pattern:
          spec:
            containers:
            - image: "!*:latest"

3. Runtime Security

Pre-deployment controls are necessary but not sufficient. Runtime security mechanisms monitor system behaviour and detect anomalies during workload execution:

4. Secrets management

Sensitive data is managed outside application and deployment artifacts to eliminate exposure risk:

This approach ensures secret management remains centralised, auditable, and secure by design.

5. Networking and traffic management

The networking layer combines Kubernetes-native primitives with Istio’s service mesh capabilities to provide secure, observable, and policy-driven traffic management:

A key lesson from Istio mTLS was that enabling Strict mode cluster-wide too early caused connectivity issues because not all workloads had sidecars injected. Istio supports Permissive mode (accepts both plaintext and mTLS) and Strict mode (enforces only mTLS). The fix was to start in Permissive mode and then gradually apply PeerAuthentication in Strict mode per namespace, only after confirming full sidecar injection in each namespace.

Observability stack

Observability is implemented as a unified system with three complementary signals, all feeding into a shared Grafana interface:

ToolSignal TypePrimary Use
PrometheusMetricsResource utilisation, SLO tracking, alerting
GrafanaVisualisationDashboards, SLA reporting, incident response
LokiLogsCentralised log aggregation, correlation with traces

We adopted Prometheus, Grafana, and Loki to align with a Kubernetes-native observability model. Prometheus handles metrics, Loki handles log aggregation using lightweight label-based indexing, and Grafana provides a unified visualization layer. This reduces operational cost and complexity compared to maintaining a separate Elasticsearch and Kibana stack

Infrastructure as code strategy

Terraform is structured into modular components that reflect the modular platform layers, enabling independent versioning and testing of each:

ModuleResponsibility
networkVNet, subnets, NSGs, peering configurations
Managed k8s ClusterK8s cluster, node pools, RBAC, Key Vault integration
securityPolicies, Defender for Containers, audit logging
platform-servicesArgo CD, Istio, Prometheus, Grafana, Loki, Kyverno

Environment separation is handled using per-environment variable files:

This structure ensures reusability, consistency across environments, and controlled environment-specific customisation without duplicating module code.

Key outcomes

The following outcomes were observed in our internal lab and staging environments after full platform adoption.

MetricObserved Change
Deployment reliabilityImproved to ~95% success rate (from ~70% with manual processes)
Infrastructure provisioning timeReduced from hours/days to under 15 minutes via Terraform automation
Deployment frequencyIncreased from weekly to multiple releases per day
Configuration drift incidentsNear-zero — eliminated by GitOps continuous reconciliation
Pre-production vulnerability detection80% of findings caught before reaching staging
Manual kubectl operationsReduced to near-zero for routine deployments

Challenges and lessons learned

Navigating the CNCF ecosystem showed the risk of adopting too many overlapping tools early. The key lesson was to let architecture drive tooling decisions and defer additions like OpenTelemetry until the platform stabilized. Maintaining clear separation between infrastructure, platform, and application layers was essential for long-term maintainability. Early coupling of tools such as Argo CD and Istio with application code increased complexity and was later corrected by splitting repositories into different folders. GitOps improved consistency and traceability but introduced synchronization issues during repository restructuring. These were resolved using Argo CD app-of-apps and application health checks. Moving security checks earlier in the pipeline—using Trivy and KubeSec immediately after build—improved feedback speed and reduced late-stage failures.

Conclusion

This architecture shows how Kubernetes and CNCF tools can be combined to build a secure, automated, and scalable platform, where the real value comes from how deployment, security, and observability work together as a system. The core design decisions are to establish clear layer separation early, integrate security from the start, and adopt GitOps with Argo CD from day one. Future improvements focus on multi-cluster management with Argo CD ApplicationSets, stronger policy enforcement using Kyverno, deeper zero-trust networking via Istio, and adding distributed tracing through OpenTelemetry integrated into the observability stack.