Stage 1 - Governed AKS Delivery Foundation

Stage 1 runbook / live terminal loop

bootstrap script teardown script .env.example service.yaml dashboard ConfigMaps

Business context

A regulated payment workflow creates the problem. The service provides visibility. Stage 1 proves the delivery model.

Exceptions happen

Not every transaction stays on the automated path. Some payments require exception handling because of thresholds, missing metadata, suspicious duplicates, or blocked destinations.

Status must be visible

Operators and internal consumers need to know whether a case is pending review, under investigation, approved, or rejected.

Delivery must be governed

The service is delivered as a Spring Boot API on Azure AKS with clear Infrastructure, Platform, and Application ownership, shared observability, and documented troubleshooting.

The result is credible

The result is an observable operating baseline: controlled delivery, team accountability, runtime signals, and evidence strong enough for a technical discussion.

flowchart TB classDef control fill:#4a4a4a,color:#ffffff,stroke:#2f2f2f,stroke-width:1px; classDef repo fill:#f2ebff,color:#1d231f,stroke:#8b6ee8,stroke-width:1px; classDef internal fill:#eefae9,color:#1d231f,stroke:#73b86e,stroke-width:1px; classDef env fill:#dcebf7,color:#1d231f,stroke:#5a94c8,stroke-width:2px,stroke-dasharray: 5 4; classDef azure fill:#fff2e8,color:#1d231f,stroke:#d98a52,stroke-width:1px; classDef cluster fill:#eef4ff,color:#1d231f,stroke:#5f84f5,stroke-width:2px; classDef workload fill:#1e1e1e,color:#ffffff,stroke:#222222,stroke-width:1px; classDef obs fill:#fff8f2,color:#1d231f,stroke:#ee9656,stroke-width:1px; subgraph CD["Controlled Delivery Path"] direction TB GH["GitHub Repository"]:::repo --> GHA["GitHub Actions"]:::repo GHA --> OIDC["Microsoft Entra ID / OIDC"]:::repo OIDC --> HELM["Terraform / Helm Delivery"]:::repo end style CD fill:#4a4a4a,stroke:#4a4a4a,color:#ffffff subgraph IA["Internal Access Path"] direction TB DEV["Internal Dev"]:::internal --> CLI["Azure CLI / kubectl context"]:::internal CLI --> PF["kubectl port-forward"]:::internal end style IA fill:#4a4a4a,stroke:#4a4a4a,color:#ffffff subgraph AZ["Azure Dev Environment"] direction LR subgraph TFSTATE["Resource Groupe: rg-stage1-tfstate"] direction TB SA["Azure Storage Account"]:::azure STATE["Terraform Remote State"]:::azure SA --> STATE end subgraph APPRG["Resource Groupe: rg-stage1-aks / Private Virtual Network"] direction TB subgraph AKS["AKS Platform"] direction LR APP["Java Spring Boot Payment Exception Review API"]:::workload GRAF["Prometheus / Grafana"]:::obs end DB["Azure PostgreSQL"]:::azure AKS --> DB end end style AZ fill:#dcebf7,stroke:#5a94c8,stroke-width:2px,stroke-dasharray: 5 4 style AKS fill:#eef4ff,stroke:#5f84f5,stroke-width:2px HELM --> AKS PF --> AKS

Business and operating context in one view: controlled delivery path, internal access path, Azure dev environment, AKS platform, PostgreSQL, Terraform state, and shared observability.

Why this project exists

This project connects a real market signal to a concrete build: governed AKS delivery, documented end to end.

Market signal

Recruiter conversations and Montreal job analysis pointed to the same pattern: regulated teams increasingly expect Azure, Kubernetes, CI/CD, observability, and platform ownership from day one.

Stack mobility matters

A narrow production stack can be valuable, but it can also limit future options when hiring demand shifts. Kubernetes and governed delivery patterns create a more portable base.

Complementary proof

Theory gives the map. Practice proves the route. Stage 1 connects both by turning Kubernetes concepts into a governed AKS delivery foundation with CI/CD, observability, FinOps, and operational evidence.

Built at human scale

Stage 1 is not presented as a perfect enterprise platform. It is a documented, one-person, 125h34 delivery process with evidence strong enough for a hiring-manager discussion.

Three-team operating model

Stage 1 is built around a clean separation of responsibility rather than one blended DevOps persona.

Infrastructure

Owns Azure foundations, AKS, PostgreSQL base hosting pattern, remote Terraform backend, and cloud access prerequisites.

Terraform

AKS

OIDC

Platform

Owns Kubernetes bootstrap resources, namespace standards, workload delivery conventions, shared observability, and platform dashboards.

Namespaces

Helm

Grafana

Application

Owns the Spring Boot service, Docker image, Helm chart values, probes, metrics exposure, and service troubleshooting.

Java

Docker

Actuator

Infrastructure path: backend bootstrap, Azure foundation, AKS, and cloud prerequisites.

Platform path: governed Kubernetes resources and shared runtime conventions.

Application path: build, package, and deploy the Spring Boot service inside the governed boundary.

Controlled delivery journey

The important thing is not just that the app runs, but that each layer has a clear delivery path and ownership boundary.

1

Infrastructure

Provision Azure and AKS through the Infrastructure-owned Terraform path.

2

Platform

Create the governed Kubernetes application boundary and runtime prerequisites.

3

Application

Build, push, and deploy the Spring Boot image through the approved Helm workflow.

4

Observability

Install the shared monitoring stack and wire the dashboard delivery model.

Infrastructure and platform lifecycle dependency map

Infrastructure as Code lifecycle map: Terraform, Kubernetes resources, Helm releases, and dashboard ConfigMaps are sequenced intentionally so infrastructure, platform, observability, and application changes reconcile in the right order.

Application runtime

Stage 1 does not stop at deployment. The service exposes a clear operational run path for internal consumers and platform verification.

Client / Internal consumer

Service status and delivery identity GET /api/payment-exceptions/service-status

Returns the service status, version, validation mode, and environment-facing delivery metadata.

Payment exception lifecycle state GET /api/payment-exceptions/{id}/status

Returns the simulated payment exception progression used to prove the business path.

RECEIVED VALIDATING PENDING_REVIEW APPROVED REJECTED ESCALATED

Configuration validation signal GET /api/payment-exceptions/config-check

Surfaces whether runtime configuration resolves the expected values for the service.

Operational actuator endpoints /actuator/*

/actuator/health for readiness and liveness checks
/actuator/info for runtime metadata
/actuator/prometheus for Prometheus scraping

Why this matters

This run path is the operational contract of the Stage 1 service. It gives the Application team a business-facing API, gives the Platform team health and scrape endpoints, and gives shared observability a reliable signal source.

application/docs/openapi.yaml

Business path

Metrics path

Probe path

Dashboard path

Grafana dashboard for the backend service

Runtime signals are not abstract. They feed the service dashboard and make the deployment observable once the app is live in AKS.

FinOps choices were explicit

Stage 1 was designed to stay credible without pretending cloud cost does not matter. The repo documents both the technical baseline and the cost-control tradeoffs.

Design choices

Cost-aware by design

The Azure path is not treated as always-on by default. Local validation exists specifically to reduce cloud spend and shorten iteration time before using AKS.

The AKS-managed resource group shown here costs about CA$1.16 per day while it exists.
Using the local kind cluster for most iterations avoids carrying that Azure runtime cost during everyday development.
The repo explicitly recommends destroying AKS and Azure resources after use, so you do not keep paying that daily platform cost outside active work windows.
The remote Terraform backend stays almost negligible at < CA$0.01 per day, so the expensive part is the live runtime, not the state layer.

ADR-backed

Baseline vs fallback

The VM-size decision is documented in the ADRs. The repo keeps a stronger default while still acknowledging cheaper fallback options for personal Azure subscriptions.

Standard_D2als_v6 stays the default Stage 1 baseline.
Standard_B2als_v2 is documented as the lower-cost fallback when quota and region availability allow it, specifically to make the project more accessible on tighter personal budgets.
In Canada Central Linux pricing, Standard_B2als_v2 is about $30.51/month per node versus about $65.41/month for Standard_D2als_v6, a roughly 53% node-compute reduction.
The cost screenshots show the practical target state: about CA$3.59 forecast for the month when the estate is kept minimal and destroyed after use instead of left running continuously.
Quota-awareness and subscription-budget realism are part of the delivery story, even when the repo keeps a stronger default baseline.

Cost analysis by Azure service: roughly CA$0.60 for virtual machines, CA$0.47 for storage, CA$0.10 for virtual network, and about CA$0.01 for Azure DNS. The platform is small enough to reason about cost component by component.

Low monthly Azure infrastructure cost view

FinOps proof point: the captured Azure view shows CA$1.17 actual cost at that moment and a CA$3.59 monthly forecast, which supports the strategy of local-first development plus teardown of unused cloud runtime.

Tech stack evolution

Stage 1 is delivery credibility. Stage 2 is governance and shared-platform maturity. Stage 3 is enterprise architecture scale and hybrid realism.

Stage 1

Governed delivery foundation

One stateful regulated internal service delivered safely through a controlled AKS path.

Formalized environments

Local

Dev

Scope: one governed internal service boundary, not yet a shared multi-tenant platform.

AKS

Terraform

GitHub Actions

GitHub

GitHub Releases

Docker

GHCR

Helm

Kustomize

Spring Boot

Azure Storage backend

Federated OIDC

Azure PostgreSQL

Local PostgreSQL

Linux / Ubuntu

Prometheus

Grafana

Stage 2

Governed shared platform

More tenants, stronger controls, centralized secrets, governed dependency updates, GitOps-style reconciliation, and broader operational governance.

Formalized environments

Dev

Prod

Isolation: multi-team and multi-tenant shared platform with namespace and policy boundaries.

OpenShift

Dependabot

ArgoCD

Vault

ElasticSearch

Kibana

Ansible

Stage 3

Enterprise hybrid platform

Identity maturity, broader observability layering, service-mesh traffic governance, hybrid-cloud credibility, and large-scale governance patterns.

Formalized environments

Local

Dev

Prod

Hybrid scope: Azure, AWS, and on-premises operating patterns standardized under one enterprise model.

AWS

Azure

OpenShift

OpenShift Service Mesh

EKS

Okta

Microsoft Entra ID

Active Directory / AD DS

Terragrunt

DataDog

Thanos

CloudWatch

On-premises

GitHub Actions by team

Workflow files stay flat for GitHub, but the naming now makes accountability obvious.

Infrastructure

infrastructure-azure-provision.yml
infrastructure-azure-destroy.yml

Platform

platform-kubernetes-resources-provision.yml
platform-kubernetes-resources-destroy.yml
platform-observability-provision.yml
platform-observability-destroy.yml
platform-observability-validate.yml

Application

application-app-ci.yml
application-app-deploy.yml
application-app-destroy.yml

GitHub Actions deployment workflow with Helm

Delivery proof in GitHub Actions: the repo does not stop at diagrams, it drives a real Helm-based application deployment path.

Stage 1 observability proves the platform is alive

The immediate goal is simple: prove that AKS, Prometheus, and Grafana are up, scraping, and exposing usable platform signals. Stage 2 is where the focus shifts toward richer application metrics, logs, and alerting.

Stage 1

First prove the infrastructure and platform are up

If Grafana can already show Kubernetes, node, and namespace-level signals in AKS, then the cloud foundation, cluster monitoring stack, and shared platform path are working.

bootstrap the shared Prometheus and Grafana stack
confirm built-in dashboards render real AKS node and namespace data
use curated platform dashboards only as fast proof, not as the end state

Stage 2

Then go deeper into application signals

Once the platform path is stable, observability becomes more application-centered: request metrics, business-health dashboards, logs, and alerting contracts.

service-specific metrics such as request volume, error split, and DB pool usage
logs and alerting become first-class operational assets
the backend dashboard becomes one proof point inside a broader application-observability layer

Stage 1 proof: Grafana already exposes useful AKS dashboards out of the box. That is enough to prove the shared stack is alive.

Node-level visibility is a strong platform signal: infrastructure, exporters, and Prometheus ingestion are all working together.

Stage 2 direction: once the base is proven, the next layer is application metrics, then logs, then alerts around real service behavior.

IaC all the way to observability

The platform is not assembled manually from the console. Terraform creates the cloud and Kubernetes foundation, Helm reconciles the shared stack, and the dashboards live as versioned JSON in the repo. Kustomize turns them into Kubernetes ConfigMaps, and Grafana mounts them through Helm values.

Helm mounts into Grafana

infrastructure/azure/terraform/aks.tf
resource "azurerm_kubernetes_cluster" "aks" {
  name                = var.aks_cluster_name
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name
  dns_prefix          = var.dns_prefix

  default_node_pool {
    name       = "default"
    node_count = var.node_count
    vm_size    = var.vm_size
  }
}

infrastructure/azure/terraform/postgresql.tf
resource "azurerm_postgresql_flexible_server" "app" {
  name                = var.postgres_server_name
  resource_group_name = azurerm_resource_group.aks.name
  sku_name            = var.postgres_sku_name
}

infrastructure/azure/terraform/aks.tf infrastructure/azure/terraform/postgresql.tf

platform/kubernetes-resources/observability/grafana/dashboards/payment-exception-review-platform-runtime.json
{
  "title": "Payment Exception Review - Platform Runtime",
  "uid": "payment-exception-review-platform-runtime",
  "timezone": "utc",
  "panels": [
    {
      "title": "Dev namespace",
      "type": "stat",
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "targets": [
        {
          "editorMode": "code",
          "expr": "up{namespace=\"payment-exception-review-local\"}",
          "refId": "A"
        }
      ]
    }
  ]
}

platform/kubernetes-resources/observability/grafana/dashboards/payment-exception-review-platform-runtime.json

platform/kubernetes-resources/observability/grafana/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring

configMapGenerator:
  - name: payment-exception-review-platform-runtime-dashboard
    files:
      - payment-exception-review-platform-runtime.json=dashboards/payment-exception-review-platform-runtime.json

  - name: kubernetes-networking-namespace-pods-curated
    files:
      - kubernetes-networking-namespace-pods-curated.json=dashboards/kubernetes-networking-namespace-pods-curated.json

generatorOptions:
  disableNameSuffixHash: true

platform/kubernetes-resources/observability/grafana/kustomization.yaml

platform/kubernetes-resources/observability/grafana/kube-prometheus-stack-grafana-values.yaml
grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: payment-exception-review
          folder: Payment Exception Review
          type: file
          options:
            path: /var/lib/grafana/dashboards/payment-exception-review

  extraConfigmapMounts:
    - name: payment-exception-review-platform-runtime
      configMap: payment-exception-review-platform-runtime-dashboard
      mountPath: /var/lib/grafana/dashboards/payment-exception-review/payment-exception-review-platform-runtime.json

platform/kubernetes-resources/observability/grafana/kube-prometheus-stack-grafana-values.yaml

platform/kubernetes-resources/observability/scripts/cluster/install_shared_observability_stack.sh
# Custom dashboard ConfigMaps must exist before Grafana starts.
"$SYNC_RELIABILITY_DASHBOARDS_SCRIPT"
kubectl apply -k "$DASHBOARD_KUSTOMIZE_DIR"

HELM_VALUES_ARGS=(
  -f "$VALUES_FILE_PROMETHEUS"
  -f "$VALUES_FILE_GRAFANA"
  -f "$VALUES_FILE_ALERTMANAGER"
  -f "$TEMP_VALUES"
)

helm upgrade --install "$RELEASE_NAME" \
  prometheus-community/kube-prometheus-stack \
  --namespace "$MONITORING_NAMESPACE" \
  --create-namespace \
  "${HELM_VALUES_ARGS[@]}" \
  --wait \
  --timeout "$OBSERVABILITY_HELM_TIMEOUT"

platform/kubernetes-resources/observability/scripts/cluster/install_shared_observability_stack.sh

Troubleshooting is part of the asset, not an afterthought

The strongest operational signal in this repo is that the real failure modes were captured and explained.

Helm install timeout Differentiate generic rollout delay from an actual blocked release. Open scenario Prometheus query issue Showed how observability failures can come from the query contract, not only the app. Open scenario Grafana datasource validation Captured the local-vs-runtime mismatch around localhost assumptions. Open scenario Local PVC permission failure Turned a repeated Grafana local failure into a durable local-only design fix. Open scenario Dashboard no-data case Distinguished dashboard wiring issues from real metric coverage gaps. Open scenario Grafana v2 vs classic JSON Documented the exact schema mismatch between modern export format and file provisioning. Open scenario

Scenario 1 evidence

Helm was blocked, not just slow

The release did not merely take longer than expected. Helm stayed in pending-upgrade, returned context deadline exceeded, and Grafana pods were still not initializing.

STATUS: pending-upgrade
Error: context deadline exceeded

kubectl get pods -n monitoring
kube-prometheus-stack-grafana-...   0/3   PodInitializing

Scenario 4 evidence

Grafana was the only unhealthy component

Prometheus, Alertmanager, and the operator were healthy, but Grafana stayed stuck in Init:CrashLoopBackOff. The init container logs showed permission errors under /var/lib/grafana.

chown: /var/lib/grafana/pdf: Permission denied
chown: /var/lib/grafana/png: Permission denied
chown: /var/lib/grafana/csv: Permission denied

Port-forward and curl proof for the Helm application

Fix-oriented proof: once the platform and application path are healthy again, port-forwarding and curl checks confirm the service is reachable and behaving correctly.

Stage 1 outcome

Executive summary: one governed delivery path, one replayable bootstrap, one measurable cloud-cost story, and one credible internal service running on AKS.

20 min Bootstrap

< CA$4 Per month

1 script Main entrypoint

13 min Teardown

3-team ownership model 11 GitHub Actions workflows AKS + managed PostgreSQL Dashboards as code 7 troubleshooting scenarios 125h34 recorded build effort

The core result is operational simplicity with enterprise structure: more than 125 hours of recorded Stage 1 build work compressed into a replayable path that bootstraps the Terraform backend, AKS foundation, Kubernetes runtime, and shared observability baseline; validates the service path; then tears the cloud runtime down when it is not needed.

Executive takeaways

Time: full bootstrap is replayable in about 20 minutes instead of spread across manual team handoffs.
Effort: the Stage 1 platform represents 125h34 of recorded project work translated into a repeatable delivery foundation.
Cost: the demo footprint is forecast below CA$4/month when the runtime is destroyed after use.
Governance: Infrastructure, Platform, and Application responsibilities are separated in code and workflows.
Evidence: release, package, dashboards, runbooks, and teardown path are all captured as project assets.

Explore the full repository

Release proof

Stage 1 is frozen as a versioned release

The release points to stage1-v1.0.0 and commit 5d88fbb, so the proof is stable even after Stage 2 evolves the repository.

Package proof

The Spring Boot service is published to GHCR

The container package is public, versioned by image tags, and shows 40 downloads, proving the CI path produced a reusable artifact.

Governed AKS delivery for a regulated internal service.

Business context

Exceptions happen

Status must be visible

Delivery must be governed

The result is credible

Why this project exists

Market signal

Stack mobility matters

Complementary proof

Built at human scale

Three-team operating model

Infrastructure

Platform

Application

Controlled delivery journey

Infrastructure

Platform

Application

Observability

Application runtime

Why this matters

FinOps choices were explicit

Cost-aware by design

Baseline vs fallback

Tech stack evolution

Governed delivery foundation

Governed shared platform

Enterprise hybrid platform

GitHub Actions by team

Infrastructure

Platform

Application

Stage 1 observability proves the platform is alive

First prove the infrastructure and platform are up

Then go deeper into application signals

IaC all the way to observability

Troubleshooting is part of the asset, not an afterthought

Helm was blocked, not just slow

Grafana was the only unhealthy component

Stage 1 outcome

Executive takeaways

Stage 1 is frozen as a versioned release

The Spring Boot service is published to GHCR