Stage 1 / Regulated Payment Exception Review Platform
Stage 1 Video Narrative

Governed AKS delivery for a regulated internal service.

A concise presentation page for the Payment Exception Review Status API: infrastructure, platform, application delivery, observability, and real troubleshooting, without demo-bloat or slide clutter.

View GitHub repository
Stage 1 runbook / live terminal loop

            

Business context

A regulated payment workflow creates the problem. The service provides visibility. Stage 1 proves the delivery model.

Exceptions happen

Not every transaction stays on the automated path. Some payments require exception handling because of thresholds, missing metadata, suspicious duplicates, or blocked destinations.

Status must be visible

Operators and internal consumers need to know whether a case is pending review, under investigation, approved, or rejected.

Delivery must be governed

The service is delivered as a Spring Boot API on Azure AKS with clear Infrastructure, Platform, and Application ownership, shared observability, and documented troubleshooting.

The result is credible

The result is an observable operating baseline: controlled delivery, team accountability, runtime signals, and evidence strong enough for a technical discussion.

flowchart TB classDef control fill:#4a4a4a,color:#ffffff,stroke:#2f2f2f,stroke-width:1px; classDef repo fill:#f2ebff,color:#1d231f,stroke:#8b6ee8,stroke-width:1px; classDef internal fill:#eefae9,color:#1d231f,stroke:#73b86e,stroke-width:1px; classDef env fill:#dcebf7,color:#1d231f,stroke:#5a94c8,stroke-width:2px,stroke-dasharray: 5 4; classDef azure fill:#fff2e8,color:#1d231f,stroke:#d98a52,stroke-width:1px; classDef cluster fill:#eef4ff,color:#1d231f,stroke:#5f84f5,stroke-width:2px; classDef workload fill:#1e1e1e,color:#ffffff,stroke:#222222,stroke-width:1px; classDef obs fill:#fff8f2,color:#1d231f,stroke:#ee9656,stroke-width:1px; subgraph CD["Controlled Delivery Path"] direction TB GH["GitHub Repository"]:::repo --> GHA["GitHub Actions"]:::repo GHA --> OIDC["Microsoft Entra ID / OIDC"]:::repo OIDC --> HELM["Terraform / Helm Delivery"]:::repo end style CD fill:#4a4a4a,stroke:#4a4a4a,color:#ffffff subgraph IA["Internal Access Path"] direction TB DEV["Internal Dev"]:::internal --> CLI["Azure CLI / kubectl context"]:::internal CLI --> PF["kubectl port-forward"]:::internal end style IA fill:#4a4a4a,stroke:#4a4a4a,color:#ffffff subgraph AZ["Azure Dev Environment"] direction LR subgraph TFSTATE["Resource Groupe: rg-stage1-tfstate"] direction TB SA["Azure Storage Account"]:::azure STATE["Terraform Remote State"]:::azure SA --> STATE end subgraph APPRG["Resource Groupe: rg-stage1-aks / Private Virtual Network"] direction TB subgraph AKS["AKS Platform"] direction LR APP["Java Spring Boot Payment Exception Review API"]:::workload GRAF["Prometheus / Grafana"]:::obs end DB["Azure PostgreSQL"]:::azure AKS --> DB end end style AZ fill:#dcebf7,stroke:#5a94c8,stroke-width:2px,stroke-dasharray: 5 4 style AKS fill:#eef4ff,stroke:#5f84f5,stroke-width:2px HELM --> AKS PF --> AKS
Business and operating context in one view: controlled delivery path, internal access path, Azure dev environment, AKS platform, PostgreSQL, Terraform state, and shared observability.

Why this project exists

This project connects a real market signal to a concrete build: governed AKS delivery, documented end to end.

Market signal

Recruiter conversations and Montreal job analysis pointed to the same pattern: regulated teams increasingly expect Azure, Kubernetes, CI/CD, observability, and platform ownership from day one.

Stack mobility matters

A narrow production stack can be valuable, but it can also limit future options when hiring demand shifts. Kubernetes and governed delivery patterns create a more portable base.

Complementary proof

Theory gives the map. Practice proves the route. Stage 1 connects both by turning Kubernetes concepts into a governed AKS delivery foundation with CI/CD, observability, FinOps, and operational evidence.

Built at human scale

Stage 1 is not presented as a perfect enterprise platform. It is a documented, one-person, 125h34 delivery process with evidence strong enough for a hiring-manager discussion.

Three-team operating model

Stage 1 is built around a clean separation of responsibility rather than one blended DevOps persona.

Infrastructure

Owns Azure foundations, AKS, PostgreSQL base hosting pattern, remote Terraform backend, and cloud access prerequisites.

Terraform AKS OIDC

Platform

Owns Kubernetes bootstrap resources, namespace standards, workload delivery conventions, shared observability, and platform dashboards.

Namespaces Helm Grafana

Application

Owns the Spring Boot service, Docker image, Helm chart values, probes, metrics exposure, and service troubleshooting.

Java Docker Actuator
Infrastructure bootstrap path
Infrastructure path: backend bootstrap, Azure foundation, AKS, and cloud prerequisites.
Platform provision path
Platform path: governed Kubernetes resources and shared runtime conventions.
Application delivery path
Application path: build, package, and deploy the Spring Boot service inside the governed boundary.

Controlled delivery journey

The important thing is not just that the app runs, but that each layer has a clear delivery path and ownership boundary.

1

Infrastructure

Provision Azure and AKS through the Infrastructure-owned Terraform path.

2

Platform

Create the governed Kubernetes application boundary and runtime prerequisites.

3

Application

Build, push, and deploy the Spring Boot image through the approved Helm workflow.

4

Observability

Install the shared monitoring stack and wire the dashboard delivery model.

Infrastructure and platform lifecycle dependency map
Infrastructure as Code lifecycle map: Terraform, Kubernetes resources, Helm releases, and dashboard ConfigMaps are sequenced intentionally so infrastructure, platform, observability, and application changes reconcile in the right order.

Application runtime

Stage 1 does not stop at deployment. The service exposes a clear operational run path for internal consumers and platform verification.

Client / Internal consumer
Service status and delivery identity GET /api/payment-exceptions/service-status

Returns the service status, version, validation mode, and environment-facing delivery metadata.

Payment exception lifecycle state GET /api/payment-exceptions/{id}/status

Returns the simulated payment exception progression used to prove the business path.

RECEIVED VALIDATING PENDING_REVIEW APPROVED REJECTED ESCALATED
Configuration validation signal GET /api/payment-exceptions/config-check

Surfaces whether runtime configuration resolves the expected values for the service.

Operational actuator endpoints /actuator/*
  • /actuator/health for readiness and liveness checks
  • /actuator/info for runtime metadata
  • /actuator/prometheus for Prometheus scraping

Why this matters

This run path is the operational contract of the Stage 1 service. It gives the Application team a business-facing API, gives the Platform team health and scrape endpoints, and gives shared observability a reliable signal source.

Business path Metrics path Probe path Dashboard path
Grafana dashboard for the backend service
Runtime signals are not abstract. They feed the service dashboard and make the deployment observable once the app is live in AKS.

FinOps choices were explicit

Stage 1 was designed to stay credible without pretending cloud cost does not matter. The repo documents both the technical baseline and the cost-control tradeoffs.

Design choices

Cost-aware by design

The Azure path is not treated as always-on by default. Local validation exists specifically to reduce cloud spend and shorten iteration time before using AKS.

  • The AKS-managed resource group shown here costs about CA$1.16 per day while it exists.
  • Using the local kind cluster for most iterations avoids carrying that Azure runtime cost during everyday development.
  • The repo explicitly recommends destroying AKS and Azure resources after use, so you do not keep paying that daily platform cost outside active work windows.
  • The remote Terraform backend stays almost negligible at < CA$0.01 per day, so the expensive part is the live runtime, not the state layer.
ADR-backed

Baseline vs fallback

The VM-size decision is documented in the ADRs. The repo keeps a stronger default while still acknowledging cheaper fallback options for personal Azure subscriptions.

  • Standard_D2als_v6 stays the default Stage 1 baseline.
  • Standard_B2als_v2 is documented as the lower-cost fallback when quota and region availability allow it, specifically to make the project more accessible on tighter personal budgets.
  • In Canada Central Linux pricing, Standard_B2als_v2 is about $30.51/month per node versus about $65.41/month for Standard_D2als_v6, a roughly 53% node-compute reduction.
  • The cost screenshots show the practical target state: about CA$3.59 forecast for the month when the estate is kept minimal and destroyed after use instead of left running continuously.
  • Quota-awareness and subscription-budget realism are part of the delivery story, even when the repo keeps a stronger default baseline.
Azure cost analysis by service
Cost analysis by Azure service: roughly CA$0.60 for virtual machines, CA$0.47 for storage, CA$0.10 for virtual network, and about CA$0.01 for Azure DNS. The platform is small enough to reason about cost component by component.
Low monthly Azure infrastructure cost view
FinOps proof point: the captured Azure view shows CA$1.17 actual cost at that moment and a CA$3.59 monthly forecast, which supports the strategy of local-first development plus teardown of unused cloud runtime.

Tech stack evolution

Stage 1 is delivery credibility. Stage 2 is governance and shared-platform maturity. Stage 3 is enterprise architecture scale and hybrid realism.

Stage 1

Governed delivery foundation

One stateful regulated internal service delivered safely through a controlled AKS path.

Formalized environments
Local Dev
Scope: one governed internal service boundary, not yet a shared multi-tenant platform.
AKS Terraform GitHub Actions GitHub GitHub Releases Docker GHCR Helm Kustomize Spring Boot Azure Storage backend Federated OIDC Azure PostgreSQL Local PostgreSQL Linux / Ubuntu Prometheus Grafana
Stage 2

Governed shared platform

More tenants, stronger controls, centralized secrets, governed dependency updates, GitOps-style reconciliation, and broader operational governance.

Formalized environments
Dev Prod
Isolation: multi-team and multi-tenant shared platform with namespace and policy boundaries.
OpenShift Dependabot ArgoCD Vault ElasticSearch Kibana Ansible
Stage 3

Enterprise hybrid platform

Identity maturity, broader observability layering, service-mesh traffic governance, hybrid-cloud credibility, and large-scale governance patterns.

Formalized environments
Local Dev Prod
Hybrid scope: Azure, AWS, and on-premises operating patterns standardized under one enterprise model.
AWS Azure OpenShift OpenShift Service Mesh EKS Okta Microsoft Entra ID Active Directory / AD DS Terragrunt DataDog Thanos CloudWatch On-premises

GitHub Actions by team

Workflow files stay flat for GitHub, but the naming now makes accountability obvious.

Infrastructure

  • infrastructure-azure-provision.yml
  • infrastructure-azure-destroy.yml

Platform

  • platform-kubernetes-resources-provision.yml
  • platform-kubernetes-resources-destroy.yml
  • platform-observability-provision.yml
  • platform-observability-destroy.yml
  • platform-observability-validate.yml

Application

  • application-app-ci.yml
  • application-app-deploy.yml
  • application-app-destroy.yml
GitHub Actions deployment workflow with Helm
Delivery proof in GitHub Actions: the repo does not stop at diagrams, it drives a real Helm-based application deployment path.

Stage 1 observability proves the platform is alive

The immediate goal is simple: prove that AKS, Prometheus, and Grafana are up, scraping, and exposing usable platform signals. Stage 2 is where the focus shifts toward richer application metrics, logs, and alerting.

Stage 1

First prove the infrastructure and platform are up

If Grafana can already show Kubernetes, node, and namespace-level signals in AKS, then the cloud foundation, cluster monitoring stack, and shared platform path are working.

  • bootstrap the shared Prometheus and Grafana stack
  • confirm built-in dashboards render real AKS node and namespace data
  • use curated platform dashboards only as fast proof, not as the end state
Stage 2

Then go deeper into application signals

Once the platform path is stable, observability becomes more application-centered: request metrics, business-health dashboards, logs, and alerting contracts.

  • service-specific metrics such as request volume, error split, and DB pool usage
  • logs and alerting become first-class operational assets
  • the backend dashboard becomes one proof point inside a broader application-observability layer
Auto-created Grafana dashboards in AKS
Stage 1 proof: Grafana already exposes useful AKS dashboards out of the box. That is enough to prove the shared stack is alive.
Grafana node explorer dashboard
Node-level visibility is a strong platform signal: infrastructure, exporters, and Prometheus ingestion are all working together.
Custom backend dashboard in Grafana
Stage 2 direction: once the base is proven, the next layer is application metrics, then logs, then alerts around real service behavior.

IaC all the way to observability

The platform is not assembled manually from the console. Terraform creates the cloud and Kubernetes foundation, Helm reconciles the shared stack, and the dashboards live as versioned JSON in the repo. Kustomize turns them into Kubernetes ConfigMaps, and Grafana mounts them through Helm values.

Helm mounts into Grafana
infrastructure/azure/terraform/aks.tf
resource "azurerm_kubernetes_cluster" "aks" {
  name                = var.aks_cluster_name
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name
  dns_prefix          = var.dns_prefix

  default_node_pool {
    name       = "default"
    node_count = var.node_count
    vm_size    = var.vm_size
  }
}

infrastructure/azure/terraform/postgresql.tf
resource "azurerm_postgresql_flexible_server" "app" {
  name                = var.postgres_server_name
  resource_group_name = azurerm_resource_group.aks.name
  sku_name            = var.postgres_sku_name
}
platform/kubernetes-resources/observability/grafana/dashboards/payment-exception-review-platform-runtime.json
{
  "title": "Payment Exception Review - Platform Runtime",
  "uid": "payment-exception-review-platform-runtime",
  "timezone": "utc",
  "panels": [
    {
      "title": "Dev namespace",
      "type": "stat",
      "datasource": {
        "type": "prometheus",
        "uid": "prometheus"
      },
      "targets": [
        {
          "editorMode": "code",
          "expr": "up{namespace=\"payment-exception-review-local\"}",
          "refId": "A"
        }
      ]
    }
  ]
}
platform/kubernetes-resources/observability/grafana/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring

configMapGenerator:
  - name: payment-exception-review-platform-runtime-dashboard
    files:
      - payment-exception-review-platform-runtime.json=dashboards/payment-exception-review-platform-runtime.json

  - name: kubernetes-networking-namespace-pods-curated
    files:
      - kubernetes-networking-namespace-pods-curated.json=dashboards/kubernetes-networking-namespace-pods-curated.json

generatorOptions:
  disableNameSuffixHash: true
platform/kubernetes-resources/observability/grafana/kube-prometheus-stack-grafana-values.yaml
grafana:
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: payment-exception-review
          folder: Payment Exception Review
          type: file
          options:
            path: /var/lib/grafana/dashboards/payment-exception-review

  extraConfigmapMounts:
    - name: payment-exception-review-platform-runtime
      configMap: payment-exception-review-platform-runtime-dashboard
      mountPath: /var/lib/grafana/dashboards/payment-exception-review/payment-exception-review-platform-runtime.json
platform/kubernetes-resources/observability/scripts/cluster/install_shared_observability_stack.sh
# Custom dashboard ConfigMaps must exist before Grafana starts.
"$SYNC_RELIABILITY_DASHBOARDS_SCRIPT"
kubectl apply -k "$DASHBOARD_KUSTOMIZE_DIR"

HELM_VALUES_ARGS=(
  -f "$VALUES_FILE_PROMETHEUS"
  -f "$VALUES_FILE_GRAFANA"
  -f "$VALUES_FILE_ALERTMANAGER"
  -f "$TEMP_VALUES"
)

helm upgrade --install "$RELEASE_NAME" \
  prometheus-community/kube-prometheus-stack \
  --namespace "$MONITORING_NAMESPACE" \
  --create-namespace \
  "${HELM_VALUES_ARGS[@]}" \
  --wait \
  --timeout "$OBSERVABILITY_HELM_TIMEOUT"

Troubleshooting is part of the asset, not an afterthought

The strongest operational signal in this repo is that the real failure modes were captured and explained.

Scenario 1 evidence

Helm was blocked, not just slow

The release did not merely take longer than expected. Helm stayed in pending-upgrade, returned context deadline exceeded, and Grafana pods were still not initializing.

STATUS: pending-upgrade
Error: context deadline exceeded

kubectl get pods -n monitoring
kube-prometheus-stack-grafana-...   0/3   PodInitializing
Scenario 4 evidence

Grafana was the only unhealthy component

Prometheus, Alertmanager, and the operator were healthy, but Grafana stayed stuck in Init:CrashLoopBackOff. The init container logs showed permission errors under /var/lib/grafana.

chown: /var/lib/grafana/pdf: Permission denied
chown: /var/lib/grafana/png: Permission denied
chown: /var/lib/grafana/csv: Permission denied
Port-forward and curl proof for the Helm application
Fix-oriented proof: once the platform and application path are healthy again, port-forwarding and curl checks confirm the service is reachable and behaving correctly.

Stage 1 outcome

Executive summary: one governed delivery path, one replayable bootstrap, one measurable cloud-cost story, and one credible internal service running on AKS.

20 min Bootstrap
< CA$4 Per month
1 script Main entrypoint
13 min Teardown
3-team ownership model 11 GitHub Actions workflows AKS + managed PostgreSQL Dashboards as code 7 troubleshooting scenarios 125h34 recorded build effort

The core result is operational simplicity with enterprise structure: more than 125 hours of recorded Stage 1 build work compressed into a replayable path that bootstraps the Terraform backend, AKS foundation, Kubernetes runtime, and shared observability baseline; validates the service path; then tears the cloud runtime down when it is not needed.

Executive takeaways

  • Time: full bootstrap is replayable in about 20 minutes instead of spread across manual team handoffs.
  • Effort: the Stage 1 platform represents 125h34 of recorded project work translated into a repeatable delivery foundation.
  • Cost: the demo footprint is forecast below CA$4/month when the runtime is destroyed after use.
  • Governance: Infrastructure, Platform, and Application responsibilities are separated in code and workflows.
  • Evidence: release, package, dashboards, runbooks, and teardown path are all captured as project assets.
Explore the full repository
GitHub release proof for Stage 1 v1.0.0
Release proof

Stage 1 is frozen as a versioned release

The release points to stage1-v1.0.0 and commit 5d88fbb, so the proof is stable even after Stage 2 evolves the repository.

GHCR package proof for the Spring Boot service
Package proof

The Spring Boot service is published to GHCR

The container package is public, versioned by image tags, and shows 40 downloads, proving the CI path produced a reusable artifact.