Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save sla-te/6a3b01e6c909f0d3ac6b8a89cb15ea47 to your computer and use it in GitHub Desktop.

Select an option

Save sla-te/6a3b01e6c909f0d3ac6b8a89cb15ea47 to your computer and use it in GitHub Desktop.
APP Azure Infrastructure — Spec + Implementation Plan (25 reviews, production-ready)

APP Azure Infrastructure Design

Migration of the APP application from on-premises Docker/MicroK8s to Azure Kubernetes Service with managed backing services.

Context

APP is a Java 17 / Vue.js microservices application for document processing with computational job execution. It currently runs on Docker on BTC servers with MicroK8s for orchestration, GitLab CI/CD, Oracle databases, and Minio object storage.

The organization has decided to migrate APP to Azure with a migration-first philosophy: minimize code changes, get running on Azure, then optimize.

Application Modules

Module Tech Port Purpose
APP-Backend Java 17 8080 Core backend, serves the Vue.js frontend
APP-Frontend Vue.js via backend User-facing SPA
APP-Storage Java 17 8081 /storage/api/v1/... Storage microservice
APP-Importer Java 17 8082 /importer/api/v1/... File import, forwards to Storage service

Supporting workloads:

  • Job pipeline: Job-Initializer → ScriptRunner → Job-Collector executing Matlab Runtime and Python scripts in container pods
  • Jobs are primarily user-triggered (upload/click → run), with occasional scheduled batch runs

Architecture: AKS-Centric

All workloads run in AKS. Managed Azure services provide databases, object storage, secrets, and SFTP ingestion. This approach was chosen because:

  • The team already runs MicroK8s with NGINX ingress — the K8s mental model transfers directly
  • The job pipeline requires K8s anyway (Jobs/CronJobs), so splitting workloads across platforms adds unnecessary complexity
  • At the expected scale (~1k-5k daily users), PaaS alternatives offer no meaningful advantage over AKS

AKS Cluster (one per environment)

Two static node pools plus Karpenter-managed job nodes:

Node Pool Workloads Scaling
System NGINX Ingress (App Routing Add-on), cert-manager, External Secrets Operator Fixed
App APP Backend, Frontend, Storage Svc, Importer Svc Fixed or mild HPA
Monitoring Grafana, Prometheus, Loki, Alloy (tainted: dedicated=monitoring:NoSchedule) Fixed

Job Execution: AKS Node Auto-Provisioning (Karpenter)

User-uploaded scripts (Matlab, Python, etc.) have unpredictable resource needs — from 50MB/2vCPU to 500GB/80vCPU. Static node pools cannot serve this range. Instead, AKS Node Auto-Provisioning (Karpenter) dynamically provisions right-sized VMs based on each job pod's resources.requests.

  • Allowed VM families: D-series (balanced compute), E-series (memory-optimized)
  • Max per node: 96 vCPU, 672 GiB RAM (E96as_v5 ceiling)
  • Scale from zero: nodes are provisioned on demand and terminated when idle (consolidation policy)
  • Taint: workload=job:NoSchedule — only job pods with matching toleration schedule here
  • Resource requests: set per job, either by user input (explicit CPU/RAM) or system estimation — application-level decision, not infrastructure

Ingress: AKS App Routing Add-on (Managed NGINX)

Azure-managed NGINX ingress controller. Chosen because:

  • Same NGINX Ingress syntax the team already uses on MicroK8s — configs transfer as-is
  • Azure manages the controller lifecycle (upgrades, scaling, HA)
  • Free — included with AKS
  • Auto-integrates with Azure DNS and Key Vault for TLS certificates

Observability: Grafana + Prometheus + Loki

The existing Grafana + Prometheus stack migrates into AKS on a dedicated monitoring node pool (tainted dedicated=monitoring:NoSchedule). Loki is added for log aggregation, Alloy for log collection. Azure Container Insights (basic) supplements with API server logs in Azure Portal.

Component Purpose Deployment
Prometheus Metrics scraping + alerting AKS monitoring pool
Grafana Dashboards AKS monitoring pool
Loki Log aggregation (Azure Blob backend) AKS monitoring pool
Alloy Log collection (DaemonSet) All nodes
Container Insights API server + audit logs Azure Log Analytics

Networking & Security

VNet Layout: 10.0.0.0/16 (West Europe)

Subnet CIDR Purpose Access Control
aks-subnet 10.0.0.0/20 All AKS node pools NSG: 80/443 from Internet + LB health probes
db-subnet 10.0.16.0/24 PostgreSQL Flexible Server (VNet-integrated) NSG: 5432 from aks-subnet only
storage-pe-subnet 10.0.17.0/24 Blob Storage private endpoint NSG: from aks-subnet only
sftp-subnet 10.0.18.0/24 Blob SFTP endpoint (public-facing) NSG: port 22 from whitelisted IPs only

Traffic Flow

Internet → Azure Load Balancer → NGINX Ingress (App Routing) → K8s Services
                                                                    ↓
                                                    Backend / Storage / Importer
                                                         ↓           ↓
                                                    PostgreSQL    Blob Storage

All managed services connect via private endpoints or VNet integration. No public IPs except the Ingress load balancer. SFTP endpoint is public but restricted to whitelisted source IPs.

Encryption

  • TLS 1.2+ on all connections (ingress, database, blob storage)
  • PostgreSQL: SSL enforced
  • Blob Storage: AES-256 encryption at rest (Azure-managed keys)
  • AKS: etcd encrypted at rest

Identity & Access

  • AKS Workload Identity for all service-to-Azure authentication (Key Vault, Blob Storage, PostgreSQL)
  • No passwords or connection strings in environment variables or ConfigMaps
  • Azure RBAC for cluster administrator access
  • Separate K8s namespaces per application concern (app, jobs, monitoring)

Network Policies

  • Default deny all ingress and egress
  • Explicit allow-list:
    • Backend ↔ Storage Svc ↔ Importer Svc (inter-service)
    • All services → PostgreSQL (5432)
    • Importer + Storage Svc → Blob Storage
    • Job pods: cluster-internal only (deny internet unless specifically required)

SFTP Access

  • Azure Blob Storage SFTP endpoint with local user accounts
  • SSH key authentication only (no passwords)
  • IP whitelist via NSG on sftp-subnet

Data & Storage

PostgreSQL Flexible Server

Single Azure Database for PostgreSQL Flexible Server instance per environment, hosting three databases:

Database Schemas Used By
app_db APP_ADMIN, APP_USER Backend service
storage_db STORAGE_ADMIN, STORAGE_USER Storage service
importer_db IMPORTER_ADMIN, IMPORTER_USER Importer service

Sizing:

Sizing (start small, scale via tfvars):

Environment SKU Backup Retention Redundancy
Dev Burstable B1ms 7 days None
PreProd Burstable B1ms 7 days None
Prod Burstable B1ms 35 days None (requires GP tier)

Scale-up path: change postgres_sku_name to GP_Standard_D2s_v3 and enable postgres_geo_redundant_backup — both require General Purpose tier.

Oracle → PostgreSQL migration:

  • Liquibase changelogs require one-time review for Oracle-specific SQL (sequences, data types, PL/SQL)
  • JPA/Hibernate dialect switch from Oracle12cDialect to PostgreSQLDialect
  • Liquibase migrations run as K8s init containers on each service deployment (same pattern as today)

Azure Blob Storage

S3-compatibility layer enabled initially for migration-first approach — existing Java S3 client code works with an endpoint swap. Native Azure Blob SDK adoption deferred to a later phase.

Container Purpose Access
sftp-ingest Landing zone for SFTP uploads SFTP users write, Importer reads
shared-storage Internal object storage (replaces Minio) Storage Svc + Importer read/write
job-artifacts Matlab/Python job inputs and outputs Job pods read/write

Storage redundancy:

  • All environments: LRS (locally redundant) — upgrade Prod to ZRS when scale justifies it

No automatic deletion lifecycle policies. Data is retained until explicitly removed.

Repository Structure

Repositories are split by concern and change cadence. The application consists of multiple existing source repos (not listed here — they predate this infrastructure design). This section defines the infrastructure and deployment repos, plus the integration contract that each application repo must follow.

Infrastructure & Deployment Repos (created by this plan)

Repo Purpose ArgoCD GH Actions Owner
app-infrastructure Terraform (Azure resources) No Plan/apply Infra
app-deployment Helm charts + K8s manifests + monitoring (GitOps state) Yes Helm lint on PR Shared
app-job-images Matlab Runtime + Python runner Dockerfiles No Build runner images App/Data

Application Source Repos (existing, not managed by this plan)

The application consists of multiple existing repos. Each repo that produces a deployable container image must follow the integration contract below. The exact repo list should be filled in during onboarding.

Repo Image(s) Produced Helm Values File Notes
<TBD: backend repo> app-backend values-backend.yaml Serves Vue.js frontend
<TBD: storage repo> app-storage values-storage.yaml
<TBD: importer repo> app-importer values-importer.yaml
<TBD: additional repos> <image-name> values-<name>.yaml Add rows as needed

Integration Contract for Application Repos

Each application source repo must:

  1. Build an OCI image and push to GHCR tagged with the git SHA
  2. Update its image tag in app-deployment via cross-repo push (using a GitHub App token with write access to app-deployment)
  3. Include [skip ci] in the commit message to avoid triggering the app-deployment lint workflow
  4. Own its Helm values file in app-deployment (e.g., values-backend.yaml) — this is where service-specific config lives (ports, env vars, resource requests, Liquibase config)
  5. Follow the branch strategy: develop → Dev, main → PreProd, manual ArgoCD sync → Prod

A reusable GitHub Actions workflow template for steps 1-3 should be provided in app-deployment under .github/workflow-templates/ for application repos to adopt.

CI/CD: GitHub Enterprise + Actions + ArgoCD

Pipeline Flow

App source repo: Push → Build & Test → Build OCI Image → Push to GHCR → Cross-repo update tag in app-deployment
app-deployment repo: ArgoCD detects tag change → syncs to AKS
app-infrastructure repo: Terraform plan/apply (separate lifecycle)

Environment Promotion

Trigger Target
Push to feature branch (any app repo) Build + test only (no deploy)
Merge to develop (any app repo) Push image, update tag in app-deployment → ArgoCD auto-syncs Dev
Merge to main (any app repo) Push image, update tag in app-deployment → ArgoCD auto-syncs PreProd
Manual sync in ArgoCD Promote to Prod

Key Decisions

  • Container registry: GitHub Container Registry (GHCR), images tagged with git SHA
  • Deployment method: ArgoCD watches app-deployment repo for Helm value changes and auto-syncs. Application repos build images and push updated tags to app-deployment via GitHub App token (cross-repo).
  • Secrets: GitHub Actions OIDC → Azure Workload Identity Federation (no stored Azure credentials in GitHub)
  • Database migrations: Liquibase runs as a Helm pre-upgrade hook Job (not init container) — prevents race conditions in multi-replica deploys
  • Job RBAC: Backend service account has Role + RoleBinding to create/manage K8s Jobs in jobs namespace
  • Job container images: Matlab Runtime and Python runner images in separate app-job-images repo (different build lifecycle)
  • Branch strategy: develop + main (matches current team workflow)
  • Adding new services: create a new values-<name>.yaml in app-deployment, add an ArgoCD Application manifest, and wire the source repo's CI to push image tags — no infrastructure changes needed
  • K8s backup: Velero with daily scheduled backups to Azure Blob Storage (168h retention)

Environments & Cost

Node Sizing (start small, scale via tfvars)

Dev/PreProd start single-node. Prod gets 2 nodes each for zero-downtime maintenance. Scale up by changing tfvars.

Dev PreProd Prod
AKS control plane Standard (required by Karpenter) Standard Standard
AKS system pool 1x Standard_D2s_v5 1x Standard_D2s_v5 2x Standard_D2s_v5
AKS app pool 1x Standard_D2s_v5 1x Standard_D2s_v5 2x Standard_D2s_v5
AKS monitoring pool 1x Standard_D2s_v5 1x Standard_D2s_v5 1x Standard_D2s_v5
AKS job nodes Karpenter (0→N on demand) Karpenter (0→N) Karpenter (0→N)
PostgreSQL Burstable B1ms Burstable B1ms Burstable B1ms
Blob Storage LRS LRS ZRS
Key Vault Standard Standard Standard

Job worker pods get their own PVCs via job-scratch StorageClass (StandardSSD_LRS, WaitForFirstConsumer binding for zone-awareness with Karpenter).

Monthly Cost Estimate (West Europe, pay-as-you-go)

Component Dev PreProd Prod
AKS nodes (system+app+monitoring) ~€120 ~€120 ~€180
AKS control plane (Standard) ~€60 ~€60 ~€60
PostgreSQL ~€15 ~€15 ~€25
Blob Storage ~€5 ~€5 ~€12
Key Vault ~€5 ~€5 ~€5
Load Balancer ~€20 ~€20 ~€20
NAT Gateway ~€40 ~€40 ~€40
Log Analytics (basic) ~€10 ~€10 ~€10
Total ~€275/mo ~€275/mo ~€352/mo

Notes:

  • Current scale target: ~50-100 daily users, built ready to scale
  • Job pool nodes only incur cost when jobs are running (auto-scale from zero)
  • Grafana/Prometheus/Loki run on the system pool at no extra Azure cost
  • GHCR storage is included with GitHub Enterprise
  • Scale-up path: change SKU/node counts in tfvars, terraform apply — no re-architecture needed
  • When scaling: add nodes, move PostgreSQL to General Purpose, enable geo-redundant backup

Out of Scope

  • Application code changes beyond Oracle→PostgreSQL dialect and Minio→Blob endpoint swap
  • Azure Blob SDK migration (deferred — S3-compat layer used initially)
  • WAF (Web Application Firewall) — can be added later via Azure Front Door if needed
  • Multi-region / disaster recovery beyond geo-redundant database backups
  • User authentication / identity provider integration (assumed handled at application level)

APP Azure Infrastructure Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Provision Azure infrastructure for the APP application migration from on-prem Docker/MicroK8s to AKS with managed backing services, deployed via Terraform, Helm, and ArgoCD.

Architecture: AKS-Centric with dedicated node pools (system, app, monitoring) + Karpenter NAP for jobs, managed PostgreSQL Flexible Server, Azure Blob Storage (dual accounts), ArgoCD GitOps deployment. Three environments: Dev, PreProd, Prod.

Tech Stack: Terraform (AzureRM provider), Helm 3, ArgoCD, GitHub Actions, Azure CLI (az)

Spec: docs/superpowers/specs/2026-04-13-hera-azure-infrastructure-design.md

Reviewed: 6 rounds × 4 cloud specialists + 1 backend architect = 25 reviews, 40+ fixes applied.


Handover Instructions

This plan was designed WITHOUT access to the application source code. The infrastructure side (Tasks 1-9) is complete and reviewed. The application-facing side (Tasks 10-13) uses assumptions that need validation against the real codebase.

What to execute as-is

Tasks 1-9 (infrastructure + K8s base config) are infra-only — they create Azure resources, networking, and K8s manifests with no dependency on application code. These have been through 25 specialist reviews and can be executed immediately:

  • Task 1: Terraform scaffolding (app-infrastructure repo)
  • Task 2: Networking module (VNet, subnets, NSGs, NAT Gateway)
  • Task 3: Database module (PostgreSQL Flexible Server)
  • Task 4: Storage module (dual Blob Storage accounts)
  • Task 5: Key Vault module
  • Task 6: AKS module (cluster, node pools, Karpenter NAP)
  • Task 7: Root module wiring + environment tfvars
  • Task 8: K8s base config (namespaces, network policies, RBAC, Karpenter CRDs)
  • Task 9: Observability stack (Prometheus, Grafana, Loki, Alloy, Velero)

What needs adaptation after reviewing source code

Tasks 10-13 need adjustment based on the actual application repos:

  • Task 10 (Helm chart): Verify service ports match reality (assumed 8080/8081/8082). Check health endpoint paths (/actuator/health/* assumed — may differ). Adjust resource requests based on actual service profiles. Map Liquibase changelog paths to real locations in the Docker images.
  • Task 11 (Terraform CI): Execute as-is — no app dependency.
  • Task 12 (App CI template): Adapt the workflow template to each real source repo. Fill in actual IMAGE_NAME, DOCKER_CONTEXT, and HELM_VALUES_FILE values. Add repo-specific build/test commands.
  • Task 13 (Bootstrap): Execute after Tasks 1-9 are applied. App deployment validation (Helm install) depends on Task 10 adjustments.

Assumptions to validate against source code

Assumption Where used What to check
Services listen on 8080, 8081, 8082 Helm values, network policies Actual ports in Dockerfiles / Spring Boot config
Health endpoints at /actuator/health/liveness and /readiness deployment.yaml Actual health check paths
Spring Boot with SPRING_DATASOURCE_URL env var Helm values Actual env var names for DB connection
Liquibase changelogs at /liquibase/changelog.xml migration-job.yaml Actual changelog location in Docker image
Backend creates K8s Jobs for script execution RBAC, network policies How job creation actually works in the code
Frontend bundled in backend Docker image No separate frontend deployment Verify — may need separate Helm values
3 databases: app_db, storage_db, importer_db Terraform database module Actual database names and schema requirements
Services communicate via HTTP (Backend → Storage, Importer → Storage) Network policies Actual inter-service call patterns

Adding new services

When onboarding an additional application repo:

  1. Add a values-<name>.yaml in app-deployment/helm/app-service/
  2. Add an ArgoCD Application manifest in app-deployment/argocd/
  3. Copy the CI workflow template to the source repo, set 3 env vars
  4. If the service needs DB access, add a database in the Terraform database module and a network policy egress rule
  5. No other infrastructure changes needed

Repository Split

Repo Purpose ArgoCD GH Actions Owner
app-infrastructure Terraform (Azure resources) No Plan/apply Infra
app Java 17 + Vue.js source code No Build, test, push images App
app-deployment Helm + K8s manifests + monitoring (GitOps state) Yes Helm lint on PR Shared
app-job-images Matlab Runtime + Python runner Dockerfiles No Build runner images App/Data

File Structures

Repo 1: app-infrastructure

app-infrastructure/
├── terraform/
│   ├── main.tf                      # Root module — wires all child modules
│   ├── variables.tf                 # Root input variables
│   ├── outputs.tf                   # Root outputs
│   ├── providers.tf                 # AzureRM + AzAPI provider config
│   ├── versions.tf                  # Required providers + Terraform version
│   ├── backend.tf                   # Azure Storage remote state backend (partial config)
│   ├── environments/
│   │   ├── dev.tfvars
│   │   ├── preprod.tfvars
│   │   └── prod.tfvars
│   └── modules/
│       ├── networking/              # VNet, subnets, NSGs, NAT Gateway, private DNS
│       ├── database/                # PostgreSQL Flexible Server, databases
│       ├── storage/                 # Dual Blob Storage (internal + SFTP), private endpoint
│       ├── keyvault/                # Key Vault + private endpoint
│       └── aks/                     # AKS cluster, node pools, NAP, Container Insights
├── .github/workflows/
│   └── terraform.yml                # Plan on PR, apply on merge
└── README.md

Application Source Repos (existing — NOT created by this plan)

The application consists of multiple existing repos (backend, frontend, storage, importer, etc.). Each repo that produces a deployable container image must adopt the integration contract:

  1. Build OCI image → push to GHCR tagged with git SHA
  2. Cross-repo push updated image tag to app-deployment (via GitHub App token)
  3. Include [skip ci] in the tag-update commit

A reusable workflow template is provided in app-deployment/.github/workflow-templates/ for adoption.

Fill in the actual repo names and image mappings during onboarding:

Repo Image(s) Helm Values File
<TBD> app-backend values-backend.yaml
<TBD> app-storage values-storage.yaml
<TBD> app-importer values-importer.yaml
... ... ...

Repo 3: app-deployment (ArgoCD watches this)

app-deployment/
├── helm/
│   └── app-service/
│       ├── Chart.yaml
│       ├── values.yaml
│       ├── values-backend.yaml
│       ├── values-storage.yaml
│       ├── values-importer.yaml
│       └── templates/
│           ├── _helpers.tpl
│           ├── deployment.yaml
│           ├── service.yaml
│           ├── ingress.yaml
│           ├── hpa.yaml
│           ├── serviceaccount.yaml
│           └── migration-job.yaml   # Liquibase as pre-upgrade hook
├── k8s/
│   ├── namespaces.yaml
│   ├── pdbs.yaml
│   ├── resource-quotas.yaml
│   ├── rbac/
│   │   └── job-creator.yaml         # Role + RoleBinding for Backend → jobs namespace
│   ├── karpenter/
│   │   ├── job-nodepool.yaml
│   │   ├── job-nodeclass.yaml
│   │   └── job-storageclass.yaml
│   └── network-policies/
│       ├── default-deny.yaml
│       ├── app-services.yaml
│       ├── db-access.yaml
│       ├── blob-access.yaml
│       ├── jobs.yaml
│       └── monitoring.yaml
├── monitoring/
│   ├── prometheus-values.yaml
│   ├── loki-values.yaml
│   └── install.sh
├── argocd/
│   ├── app-backend.yaml             # ArgoCD Application per service
│   ├── app-storage.yaml
│   ├── app-importer.yaml
│   ├── base-config.yaml             # K8s base manifests (namespaces, netpol, etc.)
│   └── monitoring.yaml              # Observability stack
├── .github/
│   ├── workflows/
│   │   └── lint.yml                 # Helm lint on PR
│   └── workflow-templates/
│       └── build-and-update-tag.yml # Reusable workflow for app source repos
└── README.md

Repo 4: app-job-images

app-job-images/
├── matlab-runner/
│   └── Dockerfile
├── python-runner/
│   ├── Dockerfile
│   └── requirements.txt
├── .github/workflows/
│   └── build-runners.yml            # Build + push to GHCR on change
└── README.md

Task-to-repo mapping: Tasks 1-7 → app-infrastructure. Tasks 8-9 → app-deployment. Task 10 → app-deployment. Tasks 11 → app-infrastructure. Task 12 → app-deployment. Task 13 → cross-repo bootstrap.


Task 1: Repository Scaffolding and Terraform Bootstrap

Files:

  • Create: terraform/versions.tf

  • Create: terraform/providers.tf

  • Create: terraform/backend.tf

  • Create: terraform/variables.tf

  • Create: terraform/main.tf (empty root, populated in Task 7)

  • Create: terraform/outputs.tf (empty root, populated in Task 7)

  • Create: .gitignore

  • Step 1: Create the repository and .gitignore

mkdir -p app-infrastructure/terraform/modules app-infrastructure/terraform/environments
cd app-infrastructure
git init

.gitignore:

# Terraform
*.tfstate
*.tfstate.*
.terraform/
crash.log
override.tf
override.tf.json
*_override.tf
*_override.tf.json
*.tfplan

# IDE
.idea/
.vscode/
*.swp

# OS
.DS_Store
  • Step 2: Write versions.tf

terraform/versions.tf:

terraform {
  required_version = ">= 1.9.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
    azapi = {
      source  = "azure/azapi"
      version = "~> 2.0"
    }
  }
}
  • Step 3: Write providers.tf

terraform/providers.tf:

provider "azurerm" {
  features {
    key_vault {
      purge_soft_delete_on_destroy = false
    }
    resource_group {
      prevent_deletion_if_contains_resources = true
    }
  }

  subscription_id = var.subscription_id
}

provider "azapi" {}
  • Step 4: Write backend.tf

terraform/backend.tf:

# Partial backend configuration. The state file key is passed per-environment
# at init time via: terraform init -backend-config "key=app-${ENV}.tfstate"
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-app-tfstate"
    storage_account_name = "stappinfratfstate"
    container_name       = "tfstate"
  }
}
  • Step 5: Write root variables.tf with shared variables

terraform/variables.tf:

variable "subscription_id" {
  description = "Azure subscription ID"
  type        = string
}

variable "environment" {
  description = "Environment name: dev, preprod, or prod"
  type        = string
  validation {
    condition     = contains(["dev", "preprod", "prod"], var.environment)
    error_message = "Environment must be dev, preprod, or prod."
  }
}

variable "location" {
  description = "Azure region"
  type        = string
  default     = "westeurope"
}

variable "project_name" {
  description = "Project identifier used in resource naming"
  type        = string
  default     = "app"
}

variable "sftp_allowed_ips" {
  description = "List of IP addresses allowed to connect via SFTP"
  type        = list(string)
  default     = []
}

variable "aks_system_pool_count" {
  description = "Number of nodes in the AKS system pool"
  type        = number
  default     = 1
}

variable "aks_app_pool_count" {
  description = "Number of nodes in the AKS app pool"
  type        = number
  default     = 1
}

variable "aks_app_pool_max_count" {
  description = "Max nodes for AKS app pool autoscaler (0 = no autoscaling)"
  type        = number
  default     = 0
}

variable "node_auto_provisioning_enabled" {
  description = "Enable Karpenter-based Node Auto-Provisioning for job workloads"
  type        = bool
  default     = true
}

variable "postgres_sku_name" {
  description = "PostgreSQL Flexible Server SKU"
  type        = string
  default     = "B_Standard_B1ms"
}

variable "postgres_backup_retention_days" {
  description = "PostgreSQL backup retention in days"
  type        = number
  default     = 7
}

variable "postgres_geo_redundant_backup" {
  description = "Enable geo-redundant backup for PostgreSQL"
  type        = bool
  default     = false
}

variable "storage_replication_type" {
  description = "Blob Storage replication: LRS or ZRS"
  type        = string
  default     = "LRS"
  validation {
    condition     = contains(["LRS", "ZRS"], var.storage_replication_type)
    error_message = "Must be LRS or ZRS."
  }
}
  • Step 6: Create empty root main.tf and outputs.tf

terraform/main.tf:

# Root module — child modules wired in Task 7
locals {
  resource_prefix = "${var.project_name}-${var.environment}"
  common_tags = {
    project     = var.project_name
    environment = var.environment
    managed_by  = "terraform"
  }
}

terraform/outputs.tf:

# Root outputs — populated in Task 7
  • Step 7: Validate the bootstrap
cd terraform
terraform init -backend=false && terraform validate && terraform fmt -check -recursive

Expected: all pass with no errors.

  • Step 8: Commit
git add .
git commit -m "feat: scaffold terraform project with providers, backend, and root variables"

Task 2: Networking Module

Files:

  • Create: terraform/modules/networking/main.tf

  • Create: terraform/modules/networking/variables.tf

  • Create: terraform/modules/networking/outputs.tf

  • Step 1: Write networking variables.tf

terraform/modules/networking/variables.tf:

variable "resource_prefix" {
  description = "Prefix for resource names"
  type        = string
}

variable "location" {
  description = "Azure region"
  type        = string
}

variable "resource_group_name" {
  description = "Name of the resource group"
  type        = string
}

variable "vnet_address_space" {
  description = "VNet address space"
  type        = list(string)
  default     = ["10.0.0.0/16"]
}

variable "sftp_allowed_ips" {
  description = "IP addresses allowed to connect via SFTP"
  type        = list(string)
  default     = []
}

variable "nat_gateway_enabled" {
  description = "Enable NAT Gateway for AKS subnet egress"
  type        = bool
  default     = true
}

variable "tags" {
  description = "Resource tags"
  type        = map(string)
  default     = {}
}
  • Step 2: Write networking main.tf

terraform/modules/networking/main.tf:

locals {
  aks_subnet_cidr        = "10.0.0.0/20"
  db_subnet_cidr         = "10.0.16.0/24"
  storage_pe_subnet_cidr = "10.0.17.0/24"
  sftp_subnet_cidr       = "10.0.18.0/24"
}

resource "azurerm_virtual_network" "main" {
  name                = "vnet-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  address_space       = var.vnet_address_space
  tags                = var.tags
}

# --- Subnets ---

resource "azurerm_subnet" "aks" {
  name                 = "snet-aks"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [local.aks_subnet_cidr]
}

resource "azurerm_subnet" "db" {
  name                 = "snet-db"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [local.db_subnet_cidr]

  delegation {
    name = "postgresql-delegation"
    service_delegation {
      name    = "Microsoft.DBforPostgreSQL/flexibleServers"
      actions = ["Microsoft.Network/virtualNetworks/subnets/join/action"]
    }
  }
}

resource "azurerm_subnet" "storage_pe" {
  name                              = "snet-storage-pe"
  resource_group_name               = var.resource_group_name
  virtual_network_name              = azurerm_virtual_network.main.name
  address_prefixes                  = [local.storage_pe_subnet_cidr]
  private_endpoint_network_policies = "Disabled"
}

resource "azurerm_subnet" "sftp" {
  name                 = "snet-sftp"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [local.sftp_subnet_cidr]
}

# --- NAT Gateway (for AKS egress) ---

resource "azurerm_public_ip" "nat" {
  count               = var.nat_gateway_enabled ? 1 : 0
  name                = "pip-nat-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  allocation_method   = "Static"
  sku                 = "Standard"
  tags                = var.tags
}

resource "azurerm_nat_gateway" "main" {
  count                   = var.nat_gateway_enabled ? 1 : 0
  name                    = "natgw-${var.resource_prefix}"
  location                = var.location
  resource_group_name     = var.resource_group_name
  sku_name                = "Standard"
  idle_timeout_in_minutes = 10
  tags                    = var.tags
}

resource "azurerm_nat_gateway_public_ip_association" "main" {
  count                = var.nat_gateway_enabled ? 1 : 0
  nat_gateway_id       = azurerm_nat_gateway.main[0].id
  public_ip_address_id = azurerm_public_ip.nat[0].id
}

resource "azurerm_subnet_nat_gateway_association" "aks" {
  count          = var.nat_gateway_enabled ? 1 : 0
  subnet_id      = azurerm_subnet.aks.id
  nat_gateway_id = azurerm_nat_gateway.main[0].id
}

# --- NSGs ---

resource "azurerm_network_security_group" "aks" {
  name                = "nsg-aks-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  security_rule {
    name                       = "AllowWebInbound"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_ranges    = ["80", "443"]
    source_address_prefix      = "Internet"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "AllowLBProbes"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "AzureLoadBalancer"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "AllowKubeletAPI"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "10250"
    source_address_prefix      = "VirtualNetwork"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "AllowNodePortRange"
    priority                   = 130
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "30000-32767"
    source_address_prefix      = "AzureLoadBalancer"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "db" {
  name                = "nsg-db-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  security_rule {
    name                       = "AllowPostgresFromAKS"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "5432"
    source_address_prefix      = local.aks_subnet_cidr
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "storage_pe" {
  name                = "nsg-storage-pe-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  security_rule {
    name                       = "AllowFromAKS"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = local.aks_subnet_cidr
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "sftp" {
  name                = "nsg-sftp-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  dynamic "security_rule" {
    for_each = length(var.sftp_allowed_ips) > 0 ? [1] : []
    content {
      name                       = "AllowSFTPFromWhitelist"
      priority                   = 100
      direction                  = "Inbound"
      access                     = "Allow"
      protocol                   = "Tcp"
      source_port_range          = "*"
      destination_port_range     = "22"
      source_address_prefixes    = var.sftp_allowed_ips
      destination_address_prefix = "*"
    }
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

# --- NSG Associations ---

resource "azurerm_subnet_network_security_group_association" "aks" {
  subnet_id                 = azurerm_subnet.aks.id
  network_security_group_id = azurerm_network_security_group.aks.id
}

resource "azurerm_subnet_network_security_group_association" "db" {
  subnet_id                 = azurerm_subnet.db.id
  network_security_group_id = azurerm_network_security_group.db.id
}

resource "azurerm_subnet_network_security_group_association" "storage_pe" {
  subnet_id                 = azurerm_subnet.storage_pe.id
  network_security_group_id = azurerm_network_security_group.storage_pe.id
}

resource "azurerm_subnet_network_security_group_association" "sftp" {
  subnet_id                 = azurerm_subnet.sftp.id
  network_security_group_id = azurerm_network_security_group.sftp.id
}

# --- Private DNS Zone for PostgreSQL ---

resource "azurerm_private_dns_zone" "postgres" {
  name                = "privatelink.postgres.database.azure.com"
  resource_group_name = var.resource_group_name
  tags                = var.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "postgres" {
  name                  = "link-postgres"
  resource_group_name   = var.resource_group_name
  private_dns_zone_name = azurerm_private_dns_zone.postgres.name
  virtual_network_id    = azurerm_virtual_network.main.id
}
  • Step 3: Write networking outputs.tf

terraform/modules/networking/outputs.tf:

output "vnet_id" {
  value = azurerm_virtual_network.main.id
}

output "vnet_name" {
  value = azurerm_virtual_network.main.name
}

output "aks_subnet_id" {
  value = azurerm_subnet.aks.id
}

output "db_subnet_id" {
  value = azurerm_subnet.db.id
}

output "storage_pe_subnet_id" {
  value = azurerm_subnet.storage_pe.id
}

output "sftp_subnet_id" {
  value = azurerm_subnet.sftp.id
}

output "postgres_private_dns_zone_id" {
  value = azurerm_private_dns_zone.postgres.id
}

output "nat_gateway_id" {
  value = var.nat_gateway_enabled ? azurerm_nat_gateway.main[0].id : null
}
  • Step 4: Validate the module
cd terraform
terraform init -backend=false
terraform validate
terraform fmt -check -recursive

Expected: all pass.

  • Step 5: Commit
git add terraform/modules/networking/
git commit -m "feat: add networking module — VNet, subnets, NSGs, NAT gateway, private DNS"

Task 3: Database Module

Files:

  • Create: terraform/modules/database/main.tf

  • Create: terraform/modules/database/variables.tf

  • Create: terraform/modules/database/outputs.tf

  • Step 1: Write database variables.tf

terraform/modules/database/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "db_subnet_id" {
  description = "Subnet ID for PostgreSQL VNet integration"
  type        = string
}

variable "private_dns_zone_id" {
  description = "Private DNS zone ID for PostgreSQL"
  type        = string
}

variable "sku_name" {
  description = "PostgreSQL SKU name"
  type        = string
  default     = "B_Standard_B1ms"
}

variable "storage_mb" {
  description = "Storage in MB"
  type        = number
  default     = 32768
}

variable "backup_retention_days" {
  description = "Backup retention in days"
  type        = number
  default     = 7
}

variable "geo_redundant_backup_enabled" {
  description = "Enable geo-redundant backups"
  type        = bool
  default     = false
}

variable "administrator_login" {
  description = "PostgreSQL administrator login name"
  type        = string
  default     = "pgadmin"
}

# Bootstrap only. Password is stored in Terraform state. Post-provision: rotate
# immediately, store in Key Vault, and migrate to Azure AD authentication for
# all service connections. Set via TF_VAR_postgres_admin_password environment
# variable — never in tfvars.
variable "administrator_password" {
  description = "PostgreSQL administrator password"
  type        = string
  sensitive   = true
}

variable "tags" {
  type    = map(string)
  default = {}
}
  • Step 2: Write database main.tf

terraform/modules/database/main.tf:

resource "azurerm_postgresql_flexible_server" "main" {
  name                          = "psql-${var.resource_prefix}"
  resource_group_name           = var.resource_group_name
  location                      = var.location
  version                       = "16"
  administrator_login           = var.administrator_login
  administrator_password        = var.administrator_password
  sku_name                      = var.sku_name
  storage_mb                    = var.storage_mb
  backup_retention_days         = var.backup_retention_days
  geo_redundant_backup_enabled  = var.geo_redundant_backup_enabled
  delegated_subnet_id           = var.db_subnet_id
  private_dns_zone_id           = var.private_dns_zone_id
  public_network_access_enabled = false
  zone                          = "1"
  tags                          = var.tags

  authentication {
    password_auth_enabled         = true
    active_directory_auth_enabled = true
  }
}

resource "azurerm_postgresql_flexible_server_configuration" "require_ssl" {
  name      = "require_secure_transport"
  server_id = azurerm_postgresql_flexible_server.main.id
  value     = "ON"
}

resource "azurerm_postgresql_flexible_server_database" "app" {
  name      = "app_db"
  server_id = azurerm_postgresql_flexible_server.main.id
  charset   = "UTF8"
  collation = "en_US.utf8"
}

resource "azurerm_postgresql_flexible_server_database" "storage" {
  name      = "storage_db"
  server_id = azurerm_postgresql_flexible_server.main.id
  charset   = "UTF8"
  collation = "en_US.utf8"
}

resource "azurerm_postgresql_flexible_server_database" "importer" {
  name      = "importer_db"
  server_id = azurerm_postgresql_flexible_server.main.id
  charset   = "UTF8"
  collation = "en_US.utf8"
}
  • Step 3: Write database outputs.tf

terraform/modules/database/outputs.tf:

output "server_id" {
  value = azurerm_postgresql_flexible_server.main.id
}

output "server_fqdn" {
  value = azurerm_postgresql_flexible_server.main.fqdn
}

output "server_name" {
  value = azurerm_postgresql_flexible_server.main.name
}

output "database_names" {
  value = {
    app      = azurerm_postgresql_flexible_server_database.app.name
    storage  = azurerm_postgresql_flexible_server_database.storage.name
    importer = azurerm_postgresql_flexible_server_database.importer.name
  }
}
  • Step 4: Validate
cd terraform
terraform validate
terraform fmt -check -recursive
  • Step 5: Commit
git add terraform/modules/database/
git commit -m "feat: add database module — PostgreSQL Flexible Server with 3 databases"

Task 4: Storage Module

Files:

  • Create: terraform/modules/storage/main.tf

  • Create: terraform/modules/storage/variables.tf

  • Create: terraform/modules/storage/outputs.tf

  • Step 1: Write storage variables.tf

terraform/modules/storage/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "replication_type" {
  description = "LRS or ZRS"
  type        = string
  default     = "LRS"
}

variable "storage_pe_subnet_id" {
  description = "Subnet ID for Blob Storage private endpoint"
  type        = string
}

variable "vnet_id" {
  description = "VNet ID for private DNS zone link"
  type        = string
}

variable "sftp_allowed_ips" {
  description = "IP addresses allowed to connect via SFTP"
  type        = list(string)
  default     = []
}

variable "tags" {
  type    = map(string)
  default = {}
}
  • Step 2: Write storage main.tf

terraform/modules/storage/main.tf:

# --- Internal Blob Storage Account (private, no public access) ---

resource "azurerm_storage_account" "internal" {
  name                          = substr(replace("st${var.resource_prefix}int", "-", ""), 0, 24)
  resource_group_name           = var.resource_group_name
  location                      = var.location
  account_tier                  = "Standard"
  account_replication_type      = var.replication_type
  account_kind                  = "StorageV2"
  min_tls_version               = "TLS1_2"
  public_network_access_enabled = false
  tags                          = var.tags

  blob_properties {
    versioning_enabled = true
  }
}

# --- SFTP Storage Account (public for SFTP, IP-restricted) ---

resource "azurerm_storage_account" "sftp" {
  name                          = substr(replace("st${var.resource_prefix}sftp", "-", ""), 0, 24)
  resource_group_name           = var.resource_group_name
  location                      = var.location
  account_tier                  = "Standard"
  account_replication_type      = var.replication_type
  account_kind                  = "StorageV2"
  min_tls_version               = "TLS1_2"
  public_network_access_enabled = true
  is_hns_enabled                = true
  sftp_enabled                  = true
  tags                          = var.tags

  network_rules {
    default_action = "Deny"
    ip_rules       = var.sftp_allowed_ips
  }

  blob_properties {
    versioning_enabled = true
  }
}

# --- Blob Containers ---

resource "azurerm_storage_container" "shared_storage" {
  name                  = "shared-storage"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "job_artifacts" {
  name                  = "job-artifacts"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "loki_chunks" {
  name                  = "loki-chunks"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "velero_backups" {
  name                  = "velero-backups"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "sftp_ingest" {
  name                  = "sftp-ingest"
  storage_account_id    = azurerm_storage_account.sftp.id
  container_access_type = "private"
}

# --- Private Endpoint for Internal Blob ---

resource "azurerm_private_endpoint" "blob" {
  name                = "pe-blob-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  subnet_id           = var.storage_pe_subnet_id
  tags                = var.tags

  private_service_connection {
    name                           = "psc-blob-${var.resource_prefix}"
    private_connection_resource_id = azurerm_storage_account.internal.id
    subresource_names              = ["blob"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "blobdns"
    private_dns_zone_ids = [azurerm_private_dns_zone.blob.id]
  }
}

resource "azurerm_private_dns_zone" "blob" {
  name                = "privatelink.blob.core.windows.net"
  resource_group_name = var.resource_group_name
  tags                = var.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "blob" {
  name                  = "link-blob"
  resource_group_name   = var.resource_group_name
  private_dns_zone_name = azurerm_private_dns_zone.blob.name
  virtual_network_id    = var.vnet_id
}
  • Step 3: Write storage outputs.tf

terraform/modules/storage/outputs.tf:

output "internal_storage_account_id" {
  value = azurerm_storage_account.internal.id
}

output "internal_storage_account_name" {
  value = azurerm_storage_account.internal.name
}

output "primary_blob_endpoint" {
  value = azurerm_storage_account.internal.primary_blob_endpoint
}

output "sftp_storage_account_id" {
  value = azurerm_storage_account.sftp.id
}

output "sftp_storage_account_name" {
  value = azurerm_storage_account.sftp.name
}

output "sftp_endpoint" {
  value = azurerm_storage_account.sftp.primary_blob_endpoint
}

output "container_names" {
  value = {
    sftp_ingest    = azurerm_storage_container.sftp_ingest.name
    shared_storage = azurerm_storage_container.shared_storage.name
    job_artifacts  = azurerm_storage_container.job_artifacts.name
    loki_chunks    = azurerm_storage_container.loki_chunks.name
    velero_backups = azurerm_storage_container.velero_backups.name
  }
}
  • Step 4: Validate
cd terraform
terraform validate
terraform fmt -check -recursive
  • Step 5: Commit
git add terraform/modules/storage/
git commit -m "feat: add storage module — dual accounts (internal + SFTP), containers, private endpoint"

Task 5: Key Vault Module

Files:

  • Create: terraform/modules/keyvault/main.tf

  • Create: terraform/modules/keyvault/variables.tf

  • Create: terraform/modules/keyvault/outputs.tf

  • Step 1: Write keyvault variables.tf

terraform/modules/keyvault/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "tenant_id" {
  description = "Azure AD tenant ID"
  type        = string
}

variable "keyvault_pe_subnet_id" {
  description = "Subnet ID for Key Vault private endpoint"
  type        = string
}

variable "vnet_id" {
  description = "VNet ID for private DNS zone link"
  type        = string
}

variable "tags" {
  type    = map(string)
  default = {}
}
  • Step 2: Write keyvault main.tf

terraform/modules/keyvault/main.tf:

# RBAC authorization is the default in AzureRM 4.x (enable_rbac_authorization removed).
# AKS role assignments (Key Vault Secrets User) work with the default RBAC mode.
resource "azurerm_key_vault" "main" {
  name                          = "kv-${var.resource_prefix}"
  location                      = var.location
  resource_group_name           = var.resource_group_name
  tenant_id                     = var.tenant_id
  sku_name                      = "standard"
  purge_protection_enabled      = true
  soft_delete_retention_days    = 90
  public_network_access_enabled = false
  tags                          = var.tags
}

# --- Private Endpoint for Key Vault ---

resource "azurerm_private_endpoint" "keyvault" {
  name                = "pe-kv-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  subnet_id           = var.keyvault_pe_subnet_id
  tags                = var.tags

  private_service_connection {
    name                           = "psc-kv-${var.resource_prefix}"
    private_connection_resource_id = azurerm_key_vault.main.id
    subresource_names              = ["vault"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "kvdns"
    private_dns_zone_ids = [azurerm_private_dns_zone.keyvault.id]
  }
}

resource "azurerm_private_dns_zone" "keyvault" {
  name                = "privatelink.vaultcore.azure.net"
  resource_group_name = var.resource_group_name
  tags                = var.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "keyvault" {
  name                  = "link-keyvault"
  resource_group_name   = var.resource_group_name
  private_dns_zone_name = azurerm_private_dns_zone.keyvault.name
  virtual_network_id    = var.vnet_id
}
  • Step 3: Write keyvault outputs.tf

terraform/modules/keyvault/outputs.tf:

output "key_vault_id" {
  value = azurerm_key_vault.main.id
}

output "key_vault_name" {
  value = azurerm_key_vault.main.name
}

output "key_vault_uri" {
  value = azurerm_key_vault.main.vault_uri
}

output "key_vault_private_endpoint_ip" {
  value = azurerm_private_endpoint.keyvault.private_service_connection[0].private_ip_address
}
  • Step 4: Validate
cd terraform
terraform validate
terraform fmt -check -recursive
  • Step 5: Commit
git add terraform/modules/keyvault/
git commit -m "feat: add keyvault module with private endpoint"

Task 6: AKS Module

Files:

  • Create: terraform/modules/aks/main.tf

  • Create: terraform/modules/aks/variables.tf

  • Create: terraform/modules/aks/outputs.tf

  • Step 1: Write AKS variables.tf

terraform/modules/aks/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "aks_subnet_id" {
  description = "Subnet ID for AKS nodes"
  type        = string
}

variable "system_pool_vm_size" {
  description = "VM size for system pool (override in prod.tfvars for larger nodes)"
  type        = string
  default     = "Standard_D2s_v5"
}

variable "system_pool_count" {
  description = "Node count for system pool"
  type        = number
  default     = 1
}

variable "app_pool_count" {
  description = "Node count for app pool"
  type        = number
  default     = 1
}

variable "app_pool_max_count" {
  description = "Max node count for app pool autoscaler (0 = no autoscaling)"
  type        = number
  default     = 0
}

variable "node_auto_provisioning_enabled" {
  description = "Enable Karpenter-based Node Auto-Provisioning for job workloads"
  type        = bool
  default     = true
}

variable "key_vault_id" {
  description = "Key Vault ID for App Routing TLS integration"
  type        = string
}

variable "tags" {
  type    = map(string)
  default = {}
}
  • Step 2: Write AKS main.tf

terraform/modules/aks/main.tf:

resource "azurerm_kubernetes_cluster" "main" {
  name                = "aks-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  dns_prefix          = "aks-${var.resource_prefix}"
  kubernetes_version  = "1.31"
  sku_tier            = "Standard"  # Required for Node Auto-Provisioning (Karpenter)
  tags                = var.tags

  # Uncomment for production: lifecycle { prevent_destroy = true }

  # System node pool
  default_node_pool {
    name                = "system"
    vm_size             = var.system_pool_vm_size
    node_count          = var.system_pool_count
    vnet_subnet_id      = var.aks_subnet_id
    os_disk_size_gb     = 50
    temporary_name_for_rotation = "systemtmp"

    node_labels = {
      "nodepool" = "system"
    }
  }

  identity {
    type = "SystemAssigned"
  }

  oidc_issuer_enabled       = true
  workload_identity_enabled = true

  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    service_cidr      = "10.1.0.0/16"
    dns_service_ip    = "10.1.0.10"
  }

  # App Routing Add-on (Managed NGINX)
  web_app_routing {
    dns_zone_ids = []
  }

  key_vault_secrets_provider {
    secret_rotation_enabled  = true
    secret_rotation_interval = "2m"
  }

  # Node Auto-Provisioning (Karpenter) for job workloads
  # Dynamically provisions right-sized VMs based on pod resource requests
  node_provisioning_profile {
    enabled = var.node_auto_provisioning_enabled
  }
}

# --- App Node Pool ---

resource "azurerm_kubernetes_cluster_node_pool" "app" {
  name                  = "app"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D2s_v5"
  node_count            = var.app_pool_count
  min_count             = var.app_pool_max_count > 0 ? var.app_pool_count : null
  max_count             = var.app_pool_max_count > 0 ? var.app_pool_max_count : null
  auto_scaling_enabled  = var.app_pool_max_count > 0
  vnet_subnet_id        = var.aks_subnet_id
  os_disk_size_gb       = 50
  tags                  = var.tags

  node_labels = {
    "nodepool" = "app"
  }

  node_taints = []
}

# --- Monitoring Node Pool ---
# Isolates Grafana/Prometheus/Loki from app workloads

resource "azurerm_kubernetes_cluster_node_pool" "monitoring" {
  name                  = "monitor"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D2s_v5"
  node_count            = 1
  vnet_subnet_id        = var.aks_subnet_id
  os_disk_size_gb       = 100
  tags                  = var.tags

  node_labels = {
    "nodepool" = "monitoring"
  }

  node_taints = ["dedicated=monitoring:NoSchedule"]
}

# Job nodes are NOT a static pool — they are provisioned dynamically by
# Karpenter (Node Auto-Provisioning) based on each job pod's resource requests.
# See k8s/karpenter/ for the NodePool + AKSNodeClass CRDs that define constraints.

# --- Container Insights (basic) ---
# Live container logs in Azure Portal alongside self-hosted Grafana stack

resource "azurerm_log_analytics_workspace" "aks" {
  name                = "law-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  sku                 = "PerGB2018"
  retention_in_days   = 30
  tags                = var.tags
}

resource "azurerm_monitor_diagnostic_setting" "aks" {
  name                       = "diag-aks-${var.resource_prefix}"
  target_resource_id         = azurerm_kubernetes_cluster.main.id
  log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id

  enabled_log {
    category = "kube-apiserver"
  }

  enabled_log {
    category = "kube-audit-admin"
  }

  enabled_log {
    category = "kube-controller-manager"
  }

  metric {
    category = "AllMetrics"
    enabled  = false
  }
}

# --- Role assignment: AKS → Key Vault Secrets User ---

resource "azurerm_role_assignment" "aks_keyvault" {
  scope                = var.key_vault_id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azurerm_kubernetes_cluster.main.key_vault_secrets_provider[0].secret_identity[0].object_id
}
  • Step 3: Write AKS outputs.tf

terraform/modules/aks/outputs.tf:

output "cluster_id" {
  value = azurerm_kubernetes_cluster.main.id
}

output "cluster_name" {
  value = azurerm_kubernetes_cluster.main.name
}

output "kube_config_raw" {
  value     = azurerm_kubernetes_cluster.main.kube_config_raw
  sensitive = true
}

output "oidc_issuer_url" {
  value = azurerm_kubernetes_cluster.main.oidc_issuer_url
}

output "kubelet_identity_object_id" {
  value = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}

output "cluster_identity_principal_id" {
  value = azurerm_kubernetes_cluster.main.identity[0].principal_id
}
  • Step 4: Validate
cd terraform
terraform validate
terraform fmt -check -recursive
  • Step 5: Commit
git add terraform/modules/aks/
git commit -m "feat: add AKS module — cluster, 3 node pools, workload identity, app routing"

Task 7: Root Module Integration and Environment Configs

Files:

  • Modify: terraform/main.tf

  • Modify: terraform/variables.tf

  • Modify: terraform/outputs.tf

  • Create: terraform/environments/dev.tfvars

  • Create: terraform/environments/preprod.tfvars

  • Create: terraform/environments/prod.tfvars

  • Step 1: Write the root main.tf wiring all modules

Replace the contents of terraform/main.tf:

locals {
  resource_prefix = "${var.project_name}-${var.environment}"
  common_tags = {
    project     = var.project_name
    environment = var.environment
    managed_by  = "terraform"
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "main" {
  name     = "rg-${local.resource_prefix}"
  location = var.location
  tags     = local.common_tags
}

# R2: Lock the resource group in production to prevent accidental deletion
resource "azurerm_management_lock" "rg_nodelete" {
  count      = var.environment == "prod" ? 1 : 0
  name       = "rg-nodelete"
  scope      = azurerm_resource_group.main.id
  lock_level = "CanNotDelete"
  notes      = "Prevent accidental deletion of production resource group"
}

module "networking" {
  source = "./modules/networking"

  resource_prefix     = local.resource_prefix
  location            = var.location
  resource_group_name = azurerm_resource_group.main.name
  sftp_allowed_ips    = var.sftp_allowed_ips
  nat_gateway_enabled = var.nat_gateway_enabled
  tags                = local.common_tags
}

module "database" {
  source = "./modules/database"

  resource_prefix              = local.resource_prefix
  location                     = var.location
  resource_group_name          = azurerm_resource_group.main.name
  db_subnet_id                 = module.networking.db_subnet_id
  private_dns_zone_id          = module.networking.postgres_private_dns_zone_id
  sku_name                     = var.postgres_sku_name
  backup_retention_days        = var.postgres_backup_retention_days
  geo_redundant_backup_enabled = var.postgres_geo_redundant_backup
  administrator_password       = var.postgres_admin_password
  tags                         = local.common_tags

  # Uncomment for production: lifecycle { prevent_destroy = true }
}

module "storage" {
  source = "./modules/storage"

  resource_prefix      = local.resource_prefix
  location             = var.location
  resource_group_name  = azurerm_resource_group.main.name
  replication_type     = var.storage_replication_type
  storage_pe_subnet_id = module.networking.storage_pe_subnet_id
  vnet_id              = module.networking.vnet_id
  sftp_allowed_ips     = var.sftp_allowed_ips
  tags                 = local.common_tags
}

module "keyvault" {
  source = "./modules/keyvault"

  resource_prefix       = local.resource_prefix
  location              = var.location
  resource_group_name   = azurerm_resource_group.main.name
  tenant_id             = data.azurerm_client_config.current.tenant_id
  keyvault_pe_subnet_id = module.networking.storage_pe_subnet_id  # Shared PE subnet (storage + KV)
  vnet_id               = module.networking.vnet_id
  tags                  = local.common_tags
}

module "aks" {
  source = "./modules/aks"

  resource_prefix     = local.resource_prefix
  location            = var.location
  resource_group_name = azurerm_resource_group.main.name
  aks_subnet_id       = module.networking.aks_subnet_id
  system_pool_vm_size = var.system_pool_vm_size
  system_pool_count   = var.aks_system_pool_count
  app_pool_count      = var.aks_app_pool_count
  app_pool_max_count  = var.aks_app_pool_max_count
  node_auto_provisioning_enabled = var.node_auto_provisioning_enabled
  key_vault_id        = module.keyvault.key_vault_id
  tags                = local.common_tags
}

# --- Azure Budget Alert ---
# Single subscription-level budget covering ALL environments combined.
# Only created once (in the dev environment apply) to avoid duplicates.
# Thresholds: €1.5k, €2.5k, €3.5k, €4.5k (warning), €5k (critical)
resource "azurerm_consumption_budget_subscription" "total" {
  count           = var.environment == "dev" ? 1 : 0
  name            = "budget-${var.project_name}-total"
  subscription_id = "/subscriptions/${var.subscription_id}"
  amount          = 5000
  time_grain      = "Monthly"

  time_period {
    start_date = "2026-05-01T00:00:00Z"  # Pinned — avoids plan drift from timestamp()
  }

  # €1,500 — first alert
  notification {
    enabled        = true
    threshold      = 30
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €2,500
  notification {
    enabled        = true
    threshold      = 50
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €3,500
  notification {
    enabled        = true
    threshold      = 70
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €4,500
  notification {
    enabled        = true
    threshold      = 90
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €5,000 — critical, at budget ceiling
  notification {
    enabled        = true
    threshold      = 100
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  lifecycle {
    ignore_changes = [time_period]
  }
}
  • Step 2: Add new variables to root variables.tf

Append to terraform/variables.tf:

variable "postgres_admin_password" {
  description = "PostgreSQL administrator password"
  type        = string
  sensitive   = true
}

variable "nat_gateway_enabled" {
  description = "Enable NAT Gateway for outbound internet from private subnets"
  type        = bool
  default     = true
}

variable "system_pool_vm_size" {
  description = "VM size for AKS system pool"
  type        = string
  default     = "Standard_D2s_v5"
}

variable "budget_alert_emails" {
  description = "Email addresses for budget alerts"
  type        = list(string)
  default     = []
}
  • Step 3: Write root outputs.tf

Replace terraform/outputs.tf:

output "resource_group_name" {
  value = azurerm_resource_group.main.name
}

output "aks_cluster_name" {
  value = module.aks.cluster_name
}

output "aks_oidc_issuer_url" {
  value = module.aks.oidc_issuer_url
}

output "postgres_server_fqdn" {
  value = module.database.server_fqdn
}

output "storage_account_name" {
  value = module.storage.internal_storage_account_name
}

output "key_vault_name" {
  value = module.keyvault.key_vault_name
}
  • Step 4: Write dev.tfvars

terraform/environments/dev.tfvars:

environment            = "dev"
subscription_id        = "REPLACE_WITH_SUBSCRIPTION_ID"

# AKS
aks_system_pool_count  = 1
aks_app_pool_count     = 1
aks_app_pool_max_count = 0
node_auto_provisioning_enabled = true  # Karpenter provisions job nodes dynamically

# PostgreSQL
postgres_sku_name              = "B_Standard_B1ms"
postgres_backup_retention_days = 7
postgres_geo_redundant_backup  = false

# Storage
storage_replication_type = "LRS"

# Networking
nat_gateway_enabled = true

# SFTP
sftp_allowed_ips = []

# Budget (subscription-level, created only in dev apply)
budget_alert_emails = ["REPLACE_WITH_EMAIL"]
  • Step 5: Write preprod.tfvars

terraform/environments/preprod.tfvars:

environment            = "preprod"
subscription_id        = "REPLACE_WITH_SUBSCRIPTION_ID"

# AKS — start single-node, scale via these values when needed
aks_system_pool_count  = 1
aks_app_pool_count     = 1
aks_app_pool_max_count = 0
node_auto_provisioning_enabled = true  # Karpenter provisions job nodes dynamically

# PostgreSQL — start burstable, upgrade to GP_Standard_D2s_v3 when needed
postgres_sku_name              = "B_Standard_B1ms"
postgres_backup_retention_days = 7
postgres_geo_redundant_backup  = false

# Storage
storage_replication_type = "LRS"

# Networking
nat_gateway_enabled = true

# SFTP
sftp_allowed_ips = []

# Budget (subscription-level, created only in dev apply)
budget_alert_emails = ["REPLACE_WITH_EMAIL"]
  • Step 6: Write prod.tfvars

terraform/environments/prod.tfvars:

environment            = "prod"
subscription_id        = "REPLACE_WITH_SUBSCRIPTION_ID"

# AKS — 2 nodes each for zero-downtime during Azure host maintenance
aks_system_pool_count  = 2
aks_app_pool_count     = 2
aks_app_pool_max_count = 0
node_auto_provisioning_enabled = true  # Karpenter provisions job nodes dynamically

# PostgreSQL — start burstable, upgrade to GP_Standard_D2s_v3 when needed
postgres_sku_name              = "B_Standard_B1ms"
postgres_backup_retention_days = 35
# Geo-redundant backup requires General Purpose tier — enable when upgrading SKU:
#   postgres_sku_name = "GP_Standard_D2s_v3"
#   postgres_geo_redundant_backup = true
postgres_geo_redundant_backup  = false

# Storage — ZRS for zone redundancy in prod
storage_replication_type = "ZRS"

# Networking
nat_gateway_enabled = true

# SFTP
sftp_allowed_ips = []

# Budget (subscription-level, created only in dev apply)
budget_alert_emails = ["REPLACE_WITH_EMAIL"]
  • Step 7: Validate the full configuration
cd terraform
terraform init -backend=false
terraform validate
terraform fmt -check -recursive
  • Step 8: Commit
git add terraform/main.tf terraform/outputs.tf terraform/variables.tf terraform/environments/
git commit -m "feat: wire root module with all child modules and environment configs"

Task 8: Kubernetes Base Configuration

Files:

  • Create: k8s/namespaces.yaml

  • Create: k8s/network-policies/default-deny.yaml

  • Create: k8s/network-policies/app-services.yaml

  • Create: k8s/network-policies/db-access.yaml

  • Create: k8s/network-policies/blob-access.yaml

  • Create: k8s/network-policies/jobs.yaml

  • Create: k8s/network-policies/monitoring.yaml

  • Create: k8s/pdbs.yaml

  • Create: k8s/resource-quotas.yaml

  • Step 1: Write namespaces.yaml

k8s/namespaces.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: app
  labels:
    purpose: application
---
apiVersion: v1
kind: Namespace
metadata:
  name: jobs
  labels:
    purpose: job-execution
---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    purpose: observability
---
apiVersion: v1
kind: Namespace
metadata:
  name: velero
  labels:
    purpose: backup
  • Step 2: Write default-deny.yaml

k8s/network-policies/default-deny.yaml:

# Applied to app, jobs, and monitoring namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: app
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: jobs
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: monitoring
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: velero
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  • Step 3: Write app-services.yaml

k8s/network-policies/app-services.yaml:

# Allow inter-service communication: Backend <-> Storage <-> Importer
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-inter-service
  namespace: app
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/part-of: app
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow from other app services
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/part-of: app
    # Allow from ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app-routing-system
  egress:
    # Allow to other app services
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/part-of: app
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
  • Step 4: Write db-access.yaml

k8s/network-policies/db-access.yaml:

# Allow app services to reach PostgreSQL (10.0.16.0/24:5432)
# Note: jobs namespace intentionally excluded — job pods do not need direct database access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-postgres
  namespace: app
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/part-of: app
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.16.0/24
      ports:
        - protocol: TCP
          port: 5432
  • Step 5: Write blob-access.yaml

k8s/network-policies/blob-access.yaml:

# Allow Importer and Storage services to reach Blob Storage (10.0.17.0/24:443)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-blob
  namespace: app
spec:
  podSelector:
    matchExpressions:
      - key: app.kubernetes.io/name
        operator: In
        values:
          - app-storage
          - app-importer
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
---
# Allow job pods to reach Blob Storage for artifacts
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-blob
  namespace: jobs
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
  • Step 6: Write jobs.yaml

k8s/network-policies/jobs.yaml:

# Allow job pods to communicate within the jobs namespace
# and reach app services (for callbacks), but deny internet
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-jobs-internal
  namespace: jobs
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector: {}
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app
  egress:
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    # Allow to app namespace
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app
    # Allow within jobs namespace
    - to:
        - podSelector: {}
  • Step 7: Write monitoring.yaml

k8s/network-policies/monitoring.yaml:

# Prometheus: scrape egress to all namespaces on common metrics ports
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scrape
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: prometheus
  policyTypes:
    - Egress
  egress:
    # Scrape targets across all namespaces
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 9090
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 9100
        - protocol: TCP
          port: 10250
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Grafana: allow ingress on 3000 + egress to Loki/Prometheus for datasource queries
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-grafana
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: grafana
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app-routing-system
      ports:
        - protocol: TCP
          port: 3000
  egress:
    # Query Prometheus and Loki datasources within monitoring namespace
    - to:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 9090
        - protocol: TCP
          port: 3100
    # DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Loki: allow ingress on 3100 + egress to Azure Blob Storage for chunk persistence
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-loki
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: loki
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: jobs
        # Allow promtail to push logs
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 3100
  egress:
    # Azure Blob Storage private endpoint for chunk/index persistence
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
    # DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Velero: egress to Blob Storage for backups + DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-velero-egress
  namespace: velero
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Alloy: egress to Loki + K8s API for service discovery, ingress for Prometheus scrape
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-alloy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: alloy
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Prometheus scrapes Alloy metrics on port 12345
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus
      ports:
        - protocol: TCP
          port: 12345
  egress:
    # Push logs to Loki
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: loki
      ports:
        - protocol: TCP
          port: 3100
    # K8s API server for pod discovery (discovery.kubernetes)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - protocol: TCP
          port: 6443
---
# Monitoring namespace: DNS egress for all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: monitoring
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
  • Step 8: Write pdbs.yaml

k8s/pdbs.yaml:

# PodDisruptionBudgets for observability stack
# Using maxUnavailable (not minAvailable) to avoid blocking node drains
# on single-replica deployments. With 1 replica, minAvailable:1 would
# prevent any voluntary disruption (drains hang indefinitely).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: prometheus-pdb
  namespace: monitoring
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: prometheus
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: grafana-pdb
  namespace: monitoring
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: grafana
  • Step 9: Write resource-quotas.yaml

k8s/resource-quotas.yaml:

# ResourceQuotas per namespace to prevent runaway resource consumption
apiVersion: v1
kind: ResourceQuota
metadata:
  name: app-quota
  namespace: app
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: jobs-quota
  namespace: jobs
spec:
  hard:
    # Must align with Karpenter NodePool limits (192 vCPU / 2048 GiB)
    requests.cpu: "192"
    requests.memory: 2048Gi
    limits.cpu: "192"
    limits.memory: 2048Gi
  • Step 10: Write Job creator RBAC

The Backend service needs to create K8s Jobs in the jobs namespace for script execution. This Role + RoleBinding grants the backend's service account the required permissions.

k8s/rbac/job-creator.yaml:

# Role granting Job lifecycle management in the jobs namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: job-creator
  namespace: jobs
rules:
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
# Bind to the backend service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: backend-job-creator
  namespace: jobs
subjects:
  - kind: ServiceAccount
    name: app-backend
    namespace: app
roleRef:
  kind: Role
  name: job-creator
  apiGroup: rbac.authorization.k8s.io
  • Step 11: Write Karpenter NodePool for job workloads

k8s/karpenter/job-nodepool.yaml:

# Karpenter NodePool — defines constraints for dynamically provisioned job nodes.
# Nodes are created on-demand when job pods are pending, and consolidated/terminated
# when idle. VM size is selected automatically based on pod resource requests.
apiVersion: karpenter.azure.com/v1alpha2
kind: NodePool
metadata:
  name: job-workers
spec:
  template:
    metadata:
      labels:
        nodepool: job
    spec:
      taints:
        - key: workload
          value: job
          effect: NoSchedule
      # Force node recycling after 24h to pick up OS patches
      expireAfter: 24h
      requirements:
        # Allow D-series (balanced) and E-series (memory-optimized) VMs
        - key: karpenter.azure.com/sku-family
          operator: In
          values: ["D", "E"]
        # Only use v5 generation for cost/performance balance
        - key: karpenter.azure.com/sku-version
          operator: In
          values: ["v5"]
        # Limit max VM size to 96 vCPU (E96as_v5 = 96 vCPU, 672 GiB)
        - key: karpenter.azure.com/sku-cpu
          operator: Lt
          values: ["97"]
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: job-workers
  # Consolidation: only terminate nodes when fully empty (not underutilized)
  # to avoid killing nodes mid-job-execution during bursty workloads
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s
  # Limit total resources across all Karpenter-managed job nodes
  # Allows ~2 concurrent max-sized jobs (96 vCPU + 672 GiB each)
  limits:
    cpu: "192"
    memory: "2048Gi"
  • Step 11: Write AKSNodeClass for job nodes

k8s/karpenter/job-nodeclass.yaml:

apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: job-workers
spec:
  # Must match the AKS subnet so job nodes join the same VNet
  # Replace with your actual subnet resource ID after terraform apply
  vnetSubnetID: /subscriptions/SUBSCRIPTION_ID/resourceGroups/rg-app-ENV/providers/Microsoft.Network/virtualNetworks/vnet-app-ENV/subnets/snet-aks
  osDiskSizeGB: 100
  imageFamily: Ubuntu2204
  • Step 12: Write StorageClass for job worker PVCs

Job pods get their own PVCs for scratch space (script inputs/outputs, temp data). Azure Disk PVCs are zone-pinned, so we use volumeBindingMode: WaitForFirstConsumer — the PVC waits until the pod is scheduled to a node, then provisions the disk in the same zone as that node. This prevents zone mismatch with Karpenter-provisioned nodes.

k8s/karpenter/job-storageclass.yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: job-scratch
provisioner: disk.csi.azure.com
parameters:
  skuName: StandardSSD_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
  • Step 13: Validate YAML syntax
for f in k8s/namespaces.yaml k8s/network-policies/*.yaml k8s/pdbs.yaml k8s/resource-quotas.yaml k8s/karpenter/*.yaml; do
  echo "--- $f ---"
  yq '.' "$f" > /dev/null && echo "OK" || echo "FAIL"
done

Expected: all OK.

  • Step 13: Commit
git add k8s/
git commit -m "feat: add K8s base config — namespaces, network policies, PDBs, quotas, Karpenter job pool"

Task 9: Observability Stack (Helm Values)

Files:

  • Create: k8s/observability/prometheus-values.yaml

  • Create: k8s/observability/loki-values.yaml

  • Create: k8s/observability/grafana-datasources.yaml

  • Step 1: Write prometheus-values.yaml

Helm chart: prometheus-community/kube-prometheus-stack

k8s/observability/prometheus-values.yaml:

# Values for kube-prometheus-stack Helm chart
# Includes Prometheus + Grafana

prometheus:
  prometheusSpec:
    nodeSelector:
      nodepool: monitoring
    tolerations:
      - key: dedicated
        operator: Equal
        value: monitoring
        effect: NoSchedule
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    # R8: Configure startupProbe for slow-starting Prometheus instances
    # The kube-prometheus-stack chart supports startupProbe via prometheusSpec;
    # tune failureThreshold and periodSeconds in per-environment overrides if
    # Prometheus takes longer than the default 15-minute window to replay WAL.

grafana:
  nodeSelector:
    nodepool: monitoring
  tolerations:
    - key: dedicated
      operator: Equal
      value: monitoring
      effect: NoSchedule
  persistence:
    enabled: true
    size: 10Gi
  adminPassword: "" # Set via --set at deploy time or External Secrets
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki.monitoring.svc.cluster.local:3100
      access: proxy
      isDefault: false

alertmanager:
  alertmanagerSpec:
    nodeSelector:
      nodepool: monitoring
    tolerations:
      - key: dedicated
        operator: Equal
        value: monitoring
        effect: NoSchedule

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true
  • Step 2: Write loki-values.yaml

Helm chart: grafana/loki

k8s/observability/loki-values.yaml:

# Values for grafana/loki Helm chart (single binary mode)

deploymentMode: SingleBinary

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: azure
    azure:
      container_name: loki-chunks
      # account_name and account_key should be injected via External Secrets from Key Vault
      # Injected at deploy time via --set overrides in install.sh:
      #   --set loki.storage.azure.account_name=<name>
      #   --set loki.storage.azure.account_key=<key>
      # Or use External Secrets Operator to populate a K8s Secret
      account_name: ""
      account_key: ""
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: azure
        schema: v13
        index:
          prefix: index_
          period: 24h

singleBinary:
  replicas: 1
  nodeSelector:
    nodepool: monitoring
  tolerations:
    - key: dedicated
      operator: Equal
      value: monitoring
      effect: NoSchedule
  persistence:
    size: 10Gi

read:
  replicas: 0

write:
  replicas: 0

backend:
  replicas: 0

# Log collection handled by Grafana Alloy — see install.sh

gateway:
  enabled: false
  • Step 3: Write install instructions as a deploy script

k8s/observability/install.sh:

#!/usr/bin/env bash
set -euo pipefail

NAMESPACE="monitoring"

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana)
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace "$NAMESPACE" \
  --create-namespace \
  --values "$(dirname "$0")/prometheus-values.yaml" \
  --wait

# Install Loki
helm upgrade --install loki grafana/loki \
  --namespace "$NAMESPACE" \
  --values "$(dirname "$0")/loki-values.yaml" \
  --wait

# Install Alloy (replaces Promtail — Grafana's unified telemetry collector)
helm upgrade --install alloy grafana/alloy \
  --namespace "$NAMESPACE" \
  --set "alloy.configMap.content=loki.write \"default\" { endpoint { url = \"http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push\" } }\nloki.source.kubernetes \"pods\" { targets = discovery.kubernetes.pods.targets\n forward_to = [loki.write.default.receiver] }\ndiscovery.kubernetes \"pods\" { role = \"pod\" }" \
  --wait

# Install Velero for K8s resource backup to Azure Blob Storage
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm upgrade --install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set "initContainers[0].name=velero-plugin-for-microsoft-azure" \
  --set "initContainers[0].image=velero/velero-plugin-for-microsoft-azure:v1.10.0" \
  --set "initContainers[0].volumeMounts[0].mountPath=/target" \
  --set "initContainers[0].volumeMounts[0].name=plugins" \
  --set "configuration.backupStorageLocation[0].provider=azure" \
  --set "configuration.backupStorageLocation[0].bucket=velero-backups" \
  --set "configuration.backupStorageLocation[0].config.storageAccount=${VELERO_STORAGE_ACCOUNT}" \
  --set "configuration.backupStorageLocation[0].config.resourceGroup=${RESOURCE_GROUP}" \
  --set "schedules.daily.schedule=0 2 * * *" \
  --set "schedules.daily.template.ttl=168h" \
  --wait

echo "Observability + backup stack installed"
echo "Grafana: kubectl port-forward -n $NAMESPACE svc/prometheus-grafana 3000:80"
echo "Velero: velero get backup-locations"
  • Step 4: Validate YAML syntax
for f in k8s/observability/*.yaml; do
  echo "--- $f ---"
  yq '.' "$f" > /dev/null && echo "OK" || echo "FAIL"
done
shellcheck k8s/observability/install.sh
  • Step 5: Commit
git add k8s/observability/
git commit -m "feat: add observability stack — Prometheus, Grafana, Loki Helm values"

Task 10: Application Helm Chart

Files:

  • Create: helm/app-service/Chart.yaml

  • Create: helm/app-service/values.yaml

  • Create: helm/app-service/values-backend.yaml

  • Create: helm/app-service/values-storage.yaml

  • Create: helm/app-service/values-importer.yaml

  • Create: helm/app-service/templates/_helpers.tpl

  • Create: helm/app-service/templates/deployment.yaml

  • Create: helm/app-service/templates/service.yaml

  • Create: helm/app-service/templates/ingress.yaml

  • Create: helm/app-service/templates/hpa.yaml

  • Create: helm/app-service/templates/serviceaccount.yaml

  • Step 1: Write Chart.yaml

helm/app-service/Chart.yaml:

apiVersion: v2
name: app-service
description: Shared Helm chart for APP microservices
type: application
version: 0.1.0
appVersion: "1.0.0"
  • Step 2: Write default values.yaml

helm/app-service/values.yaml:

replicaCount: 1

image:
  repository: ghcr.io/OWNER/app-backend
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8080

ingress:
  enabled: false
  className: webapprouting.kubernetes.azure.com
  host: ""
  path: /
  pathType: Prefix
  tlsKeyVaultUri: ""  # e.g. https://kv-app-prod.vault.azure.net/certificates/app-tls

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

nodeSelector:
  nodepool: app

# Tolerations for scheduling on tainted nodes.
# For job deployments, use:
#   tolerations:
#     - key: workload
#       operator: Equal
#       value: job
#       effect: NoSchedule
tolerations: []

serviceAccount:
  create: true
  annotations: {}

hpa:
  enabled: false
  minReplicas: 1
  maxReplicas: 3
  targetCPUUtilizationPercentage: 80

liquibase:
  enabled: false
  image:
    repository: ghcr.io/OWNER/app-backend
    tag: latest
  changelogPath: /liquibase/changelog.xml

# Env vars for the main application container
env: []

# Env vars for the Liquibase migration Job (runs as pre-upgrade hook,
# NOT as an init container — avoids race conditions in multi-replica deploys).
# Set in per-service values files.
liquibaseEnv: []

labels:
  app.kubernetes.io/part-of: app
  • Step 3: Write per-service values overrides

helm/app-service/values-backend.yaml:

image:
  repository: ghcr.io/OWNER/app-backend

service:
  port: 8080

ingress:
  enabled: true
  host: app.example.com
  path: /

liquibase:
  enabled: true
  changelogPath: /liquibase/changelog.xml

liquibaseEnv:
  - name: LIQUIBASE_COMMAND_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/app_db?sslmode=require"
  - name: LIQUIBASE_COMMAND_USERNAME
    value: "pgadmin"
  - name: LIQUIBASE_COMMAND_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

env:
  - name: SPRING_DATASOURCE_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/app_db?sslmode=require"
  - name: SPRING_PROFILES_ACTIVE
    value: "azure"

helm/app-service/values-storage.yaml:

image:
  repository: ghcr.io/OWNER/app-storage

service:
  port: 8081

ingress:
  enabled: false

liquibase:
  enabled: true
  changelogPath: /liquibase/changelog.xml

liquibaseEnv:
  - name: LIQUIBASE_COMMAND_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/storage_db?sslmode=require"
  - name: LIQUIBASE_COMMAND_USERNAME
    value: "pgadmin"
  - name: LIQUIBASE_COMMAND_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

env:
  - name: SPRING_DATASOURCE_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/storage_db?sslmode=require"
  - name: SPRING_PROFILES_ACTIVE
    value: "azure"

helm/app-service/values-importer.yaml:

image:
  repository: ghcr.io/OWNER/app-importer

service:
  port: 8082

ingress:
  enabled: false

liquibase:
  enabled: true
  changelogPath: /liquibase/changelog.xml

liquibaseEnv:
  - name: LIQUIBASE_COMMAND_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/importer_db?sslmode=require"
  - name: LIQUIBASE_COMMAND_USERNAME
    value: "pgadmin"
  - name: LIQUIBASE_COMMAND_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

env:
  - name: SPRING_DATASOURCE_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/importer_db?sslmode=require"
  - name: SPRING_PROFILES_ACTIVE
    value: "azure"
  • Step 4: Write _helpers.tpl

helm/app-service/templates/_helpers.tpl:

{{/*
Expand the name of the chart release into a fullname.
*/}}
{{- define "app-service.fullname" -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "app-service.labels" -}}
helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
{{ include "app-service.selectorLabels" . }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- range $key, $value := .Values.labels }}
{{ $key }}: {{ $value | quote }}
{{- end }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "app-service.selectorLabels" -}}
app.kubernetes.io/name: {{ include "app-service.fullname" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
  • Step 5: Write deployment.yaml template

helm/app-service/templates/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "app-service.fullname" . }}
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "app-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "app-service.labels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "app-service.fullname" . }}
      nodeSelector:
        {{- toYaml .Values.nodeSelector | nindent 8 }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      # Liquibase runs as a pre-upgrade hook Job (see migration-job.yaml),
      # NOT as an init container — avoids race conditions in multi-replica deploys.
      containers:
        - name: {{ include "app-service.fullname" . }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
              protocol: TCP
          env:
            {{- toYaml .Values.env | nindent 12 }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: {{ .Values.service.port }}
            failureThreshold: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: {{ .Values.service.port }}
            initialDelaySeconds: 60
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: {{ .Values.service.port }}
            initialDelaySeconds: 15
            periodSeconds: 5
  • Step 6: Write service.yaml template

helm/app-service/templates/service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: {{ include "app-service.fullname" . }}
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: {{ .Values.service.port }}
      protocol: TCP
  selector:
    {{- include "app-service.selectorLabels" . | nindent 4 }}
  • Step 7: Write ingress.yaml template

helm/app-service/templates/ingress.yaml:

{{- if .Values.ingress.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ include "app-service.fullname" . }}
  {{- if .Values.ingress.tlsKeyVaultUri }}
  annotations:
    kubernetes.azure.com/tls-cert-keyvault-uri: {{ .Values.ingress.tlsKeyVaultUri }}
  {{- end }}
spec:
  ingressClassName: {{ .Values.ingress.className }}
  rules:
    - host: {{ .Values.ingress.host }}
      http:
        paths:
          - path: {{ .Values.ingress.path }}
            pathType: {{ .Values.ingress.pathType }}
            backend:
              service:
                name: {{ include "app-service.fullname" . }}
                port:
                  number: {{ .Values.service.port }}
  tls:
    - hosts:
        - {{ .Values.ingress.host }}
      secretName: {{ include "app-service.fullname" . }}-tls
{{- end }}
  • Step 8: Write migration-job.yaml template (Liquibase pre-upgrade hook)

helm/app-service/templates/migration-job.yaml:

{{- if .Values.liquibase.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app-service.fullname" . }}-migrate
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-1"
    "helm.sh/hook-delete-policy": before-hook-creation
spec:
  backoffLimit: 3
  template:
    metadata:
      labels:
        {{- include "app-service.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      nodeSelector:
        {{- toYaml .Values.nodeSelector | nindent 8 }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      containers:
        - name: liquibase
          image: "{{ .Values.liquibase.image.repository }}:{{ .Values.liquibase.image.tag }}"
          command: ["liquibase"]
          args:
            - "--changelog-file={{ .Values.liquibase.changelogPath }}"
            - "update"
          env:
            {{- toYaml .Values.liquibaseEnv | nindent 12 }}
{{- end }}
  • Step 9: Write hpa.yaml and serviceaccount.yaml templates

helm/app-service/templates/hpa.yaml:

{{- if .Values.hpa.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "app-service.fullname" . }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "app-service.fullname" . }}
  minReplicas: {{ .Values.hpa.minReplicas }}
  maxReplicas: {{ .Values.hpa.maxReplicas }}
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.hpa.targetCPUUtilizationPercentage }}
{{- end }}

helm/app-service/templates/serviceaccount.yaml:

{{- if .Values.serviceAccount.create }}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "app-service.fullname" . }}
  annotations:
    {{- toYaml .Values.serviceAccount.annotations | nindent 4 }}
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
{{- end }}
  • Step 9: Lint the Helm chart
helm lint helm/app-service/
helm lint helm/app-service/ -f helm/app-service/values-backend.yaml
helm lint helm/app-service/ -f helm/app-service/values-storage.yaml
helm lint helm/app-service/ -f helm/app-service/values-importer.yaml

Expected: all pass with no errors.

  • Step 10: Commit
git add helm/
git commit -m "feat: add shared Helm chart for APP microservices with per-service overrides"

Task 11: GitHub Actions -- Terraform Workflow

Files:

  • Create: .github/workflows/terraform.yml

  • Step 1: Write terraform.yml

.github/workflows/terraform.yml:

name: Terraform

on:
  pull_request:
    paths:
      - "terraform/**"
  push:
    branches:
      - develop
      - main
    paths:
      - "terraform/**"

permissions:
  id-token: write
  contents: read
  pull-requests: write

env:
  ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
  ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
  ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  ARM_USE_OIDC: "true"
  TF_VERSION: "1.9.0"

jobs:
  determine-env:
    runs-on: ubuntu-latest
    outputs:
      environment: ${{ steps.env.outputs.environment }}
    steps:
      - id: env
        run: |
          if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
            echo "environment=preprod" >> "$GITHUB_OUTPUT"
          elif [[ "${{ github.ref }}" == "refs/heads/develop" ]]; then
            echo "environment=dev" >> "$GITHUB_OUTPUT"
          else
            echo "environment=dev" >> "$GITHUB_OUTPUT"
          fi

  plan:
    needs: determine-env
    runs-on: ubuntu-latest
    environment: ${{ needs.determine-env.outputs.environment }}
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init -backend-config="key=app-${{ needs.determine-env.outputs.environment }}.tfstate"

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        id: plan
        run: |
          terraform plan \
            -var-file="environments/${{ needs.determine-env.outputs.environment }}.tfvars" \
            -var="postgres_admin_password=${{ secrets.POSTGRES_ADMIN_PASSWORD }}" \
            -out=tfplan \
            -no-color
        continue-on-error: true

      - name: Upload Plan Artifact
        if: success() || steps.plan.outcome == 'success'
        uses: actions/upload-artifact@v4
        with:
          name: tfplan-${{ needs.determine-env.outputs.environment }}
          path: terraform/tfplan
          retention-days: 5

      - name: Comment PR with Plan
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan 📖
            **Environment:** \`${{ needs.determine-env.outputs.environment }}\`
            **Result:** \`${{ steps.plan.outcome }}\`

            <details><summary>Show Plan</summary>

            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\`

            </details>`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

      - name: Plan Status
        if: steps.plan.outcome == 'failure'
        run: exit 1

  apply:
    needs: [determine-env, plan]
    if: github.event_name == 'push' && (github.ref == 'refs/heads/develop' || github.ref == 'refs/heads/main')
    runs-on: ubuntu-latest
    environment: ${{ needs.determine-env.outputs.environment }}
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init -backend-config="key=app-${{ needs.determine-env.outputs.environment }}.tfstate"

      - name: Download Plan Artifact
        uses: actions/download-artifact@v4
        with:
          name: tfplan-${{ needs.determine-env.outputs.environment }}
          path: terraform

      - name: Terraform Apply
        run: terraform apply tfplan

  apply-prod:
    needs: [plan]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: prod
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init -backend-config="key=app-prod.tfstate"

      - name: Terraform Plan (Prod)
        run: |
          terraform plan \
            -var-file="environments/prod.tfvars" \
            -var="postgres_admin_password=${{ secrets.POSTGRES_ADMIN_PASSWORD }}" \
            -out=tfplan-prod

      - name: Terraform Apply (Prod)
        run: terraform apply tfplan-prod
  • Step 2: Validate workflow syntax
yq '.' .github/workflows/terraform.yml > /dev/null && echo "YAML OK" || echo "YAML FAIL"
  • Step 3: Commit
git add .github/workflows/terraform.yml
git commit -m "feat: add GitHub Actions workflow for Terraform plan/apply"

Task 12: Reusable CI Workflow Template (lives in app-deployment repo)

Each application source repo needs a CI workflow that builds images and updates tags in app-deployment. Instead of duplicating this across repos, provide a reusable workflow template.

Files:

  • Create: app-deployment/.github/workflow-templates/build-and-update-tag.yml

  • Create: .github/workflows/app-deploy.yml (example for app source repos to copy)

  • Step 1: Write app-deploy.yml

Note: This workflow lives in each application source repo (not app-deployment). The version below is a template — each app repo copies it and adjusts the SERVICE_NAME and DOCKER_CONTEXT values.

.github/workflows/app-deploy.yml:

# APP Service CI — Template for each application source repo.
# Copy this file to each app repo's .github/workflows/ and set the env vars below.
#
# Required secrets (set at org level for all app repos):
#   DEPLOY_APP_ID          — GitHub App ID with write access to app-deployment
#   DEPLOY_APP_PRIVATE_KEY — GitHub App private key
name: CI

on:
  pull_request:
  push:
    branches: [develop, main]

permissions:
  id-token: write
  contents: read
  packages: write

env:
  REGISTRY: ghcr.io
  # ── Per-repo config (change these when copying to a new repo) ──
  IMAGE_NAME: app-backend              # GHCR image name
  DOCKER_CONTEXT: .                    # Docker build context path
  HELM_VALUES_FILE: values-backend.yaml # Corresponding file in app-deployment

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and test
        run: echo "Replace with your build/test commands"

  push-image:
    needs: build-and-test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: ${{ env.DOCKER_CONTEXT }}
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ env.IMAGE_NAME }}:latest

  # Cross-repo: update image tag in app-deployment so ArgoCD picks it up
  update-deployment:
    needs: push-image
    runs-on: ubuntu-latest
    steps:
      - name: Generate GitHub App Token
        id: app-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_APP_ID }}
          private-key: ${{ secrets.DEPLOY_APP_PRIVATE_KEY }}
          repositories: app-deployment

      - name: Checkout app-deployment repo
        uses: actions/checkout@v4
        with:
          repository: ${{ github.repository_owner }}/app-deployment
          token: ${{ steps.app-token.outputs.token }}

      - name: Update image tag
        run: |
          sed -i "s|tag:.*|tag: ${{ github.sha }}|" \
            "helm/app-service/${{ env.HELM_VALUES_FILE }}"

      - name: Commit and push updated tags
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add helm/app-service/values-*.yaml
          git diff --cached --quiet || git commit -m "chore: update image tags to ${{ github.sha }} [skip ci]"
          git push
  • Step 2: Validate workflow syntax
yq '.' .github/workflows/app-deploy.yml > /dev/null && echo "YAML OK" || echo "YAML FAIL"
  • Step 3: Commit
git add .github/workflows/app-deploy.yml
git commit -m "feat: add GitHub Actions workflow for app build and deploy"

Task 13: Dev Environment Bootstrap and Validation

This task is executed manually after all code is committed. It provisions the Dev environment and validates end-to-end.

  • Step 1: Create the Terraform state backend
az group create --name rg-app-tfstate --location westeurope
az storage account create \
  --name stappinfratfstate \
  --resource-group rg-app-tfstate \
  --location westeurope \
  --sku Standard_LRS \
  --min-tls-version TLS1_2
az storage container create \
  --name tfstate \
  --account-name stappinfratfstate
  • Step 2: Initialize and apply Terraform for Dev
cd terraform
terraform init -backend-config="key=app-dev.tfstate"
terraform plan \
  -var-file="environments/dev.tfvars" \
  -var="postgres_admin_password=REPLACE_WITH_SECURE_PASSWORD" \
  -out=dev.tfplan
terraform apply dev.tfplan
  • Step 3: Store postgres admin password in Key Vault
az keyvault secret set \
  --vault-name kv-app-dev \
  --name postgres-admin-password \
  --value "REPLACE_WITH_SECURE_PASSWORD"
  • Step 4: Verify Azure resources exist
az group show --name rg-app-dev --query "properties.provisioningState" -o tsv
# Expected: Succeeded

az aks show --resource-group rg-app-dev --name aks-app-dev --query "provisioningState" -o tsv
# Expected: Succeeded

az postgres flexible-server show --resource-group rg-app-dev --name psql-app-dev --query "state" -o tsv
# Expected: Ready

# Internal storage account
az storage account show --resource-group rg-app-dev --name stappdevint --query "provisioningState" -o tsv
# Expected: Succeeded

# SFTP storage account
az storage account show --resource-group rg-app-dev --name stappdevsftp --query "provisioningState" -o tsv
# Expected: Succeeded

# NAT Gateway
az network nat gateway show --resource-group rg-app-dev --name natgw-app-dev --query "provisioningState" -o tsv
# Expected: Succeeded
  • Step 5: Connect to AKS and apply K8s base config
az aks get-credentials --resource-group rg-app-dev --name aks-app-dev --overwrite-existing
kubectl apply -f k8s/namespaces.yaml
kubectl apply -f k8s/network-policies/
kubectl apply -f k8s/karpenter/
  • Step 6: Verify namespaces and network policies
kubectl get namespaces app jobs monitoring
# Expected: all Active

kubectl get networkpolicies -n app
# Expected: default-deny-all, allow-app-inter-service, allow-egress-postgres, allow-egress-blob

kubectl get networkpolicies -n jobs
# Expected: default-deny-all, allow-egress-blob, allow-jobs-internal (no postgres — jobs don't access DB)
  • Step 7: Install observability stack
chmod +x k8s/observability/install.sh
bash k8s/observability/install.sh
  • Step 8: Verify observability pods are running
kubectl get pods -n monitoring
# Expected: prometheus-*, grafana-*, loki-*, alertmanager-*, node-exporter-* all Running

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
# Open http://localhost:3000 — Grafana login should appear
kill %1
  • Step 9 (optional, prod only): Enable DDoS Protection Plan
# Uncomment and run for prod environment only:
# az network ddos-protection create \
#   --resource-group rg-app-prod \
#   --name ddos-app-prod \
#   --location westeurope
# az network vnet update \
#   --resource-group rg-app-prod \
#   --name vnet-app-prod \
#   --ddos-protection-plan ddos-app-prod \
#   --ddos-protection true
  • Step 10: Commit README

Create a README.md with quickstart instructions covering the steps above, then:

git add README.md
git commit -m "docs: add README with bootstrap and deployment instructions"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment