sla-te/2026-04-13-app-azure-infrastructure-design.md

Created April 13, 2026 17:04

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sla-te/6a3b01e6c909f0d3ac6b8a89cb15ea47.js"></script>
Save sla-te/6a3b01e6c909f0d3ac6b8a89cb15ea47 to your computer and use it in GitHub Desktop.

Download ZIP

APP Azure Infrastructure — Spec + Implementation Plan (25 reviews, production-ready)

Raw

2026-04-13-app-azure-infrastructure-design.md

APP Azure Infrastructure Design

Migration of the APP application from on-premises Docker/MicroK8s to Azure Kubernetes Service with managed backing services.

Context

APP is a Java 17 / Vue.js microservices application for document processing with computational job execution. It currently runs on Docker on BTC servers with MicroK8s for orchestration, GitLab CI/CD, Oracle databases, and Minio object storage.

The organization has decided to migrate APP to Azure with a migration-first philosophy: minimize code changes, get running on Azure, then optimize.

Application Modules

Module	Tech	Port	Purpose
APP-Backend	Java 17	8080	Core backend, serves the Vue.js frontend
APP-Frontend	Vue.js	via backend	User-facing SPA
APP-Storage	Java 17	8081 `/storage/api/v1/...`	Storage microservice
APP-Importer	Java 17	8082 `/importer/api/v1/...`	File import, forwards to Storage service

Supporting workloads:

Job pipeline: Job-Initializer → ScriptRunner → Job-Collector executing Matlab Runtime and Python scripts in container pods
Jobs are primarily user-triggered (upload/click → run), with occasional scheduled batch runs

Architecture: AKS-Centric

All workloads run in AKS. Managed Azure services provide databases, object storage, secrets, and SFTP ingestion. This approach was chosen because:

The team already runs MicroK8s with NGINX ingress — the K8s mental model transfers directly
The job pipeline requires K8s anyway (Jobs/CronJobs), so splitting workloads across platforms adds unnecessary complexity
At the expected scale (~1k-5k daily users), PaaS alternatives offer no meaningful advantage over AKS

AKS Cluster (one per environment)

Two static node pools plus Karpenter-managed job nodes:

Node Pool	Workloads	Scaling
System	NGINX Ingress (App Routing Add-on), cert-manager, External Secrets Operator	Fixed
App	APP Backend, Frontend, Storage Svc, Importer Svc	Fixed or mild HPA
Monitoring	Grafana, Prometheus, Loki, Alloy (tainted: `dedicated=monitoring:NoSchedule`)	Fixed

Job Execution: AKS Node Auto-Provisioning (Karpenter)

User-uploaded scripts (Matlab, Python, etc.) have unpredictable resource needs — from 50MB/2vCPU to 500GB/80vCPU. Static node pools cannot serve this range. Instead, AKS Node Auto-Provisioning (Karpenter) dynamically provisions right-sized VMs based on each job pod's resources.requests.

Allowed VM families: D-series (balanced compute), E-series (memory-optimized)
Max per node: 96 vCPU, 672 GiB RAM (E96as_v5 ceiling)
Scale from zero: nodes are provisioned on demand and terminated when idle (consolidation policy)
Taint: workload=job:NoSchedule — only job pods with matching toleration schedule here
Resource requests: set per job, either by user input (explicit CPU/RAM) or system estimation — application-level decision, not infrastructure

Ingress: AKS App Routing Add-on (Managed NGINX)

Azure-managed NGINX ingress controller. Chosen because:

Same NGINX Ingress syntax the team already uses on MicroK8s — configs transfer as-is
Azure manages the controller lifecycle (upgrades, scaling, HA)
Free — included with AKS
Auto-integrates with Azure DNS and Key Vault for TLS certificates

Observability: Grafana + Prometheus + Loki

The existing Grafana + Prometheus stack migrates into AKS on a dedicated monitoring node pool (tainted dedicated=monitoring:NoSchedule). Loki is added for log aggregation, Alloy for log collection. Azure Container Insights (basic) supplements with API server logs in Azure Portal.

Component	Purpose	Deployment
Prometheus	Metrics scraping + alerting	AKS monitoring pool
Grafana	Dashboards	AKS monitoring pool
Loki	Log aggregation (Azure Blob backend)	AKS monitoring pool
Alloy	Log collection (DaemonSet)	All nodes
Container Insights	API server + audit logs	Azure Log Analytics

Networking & Security

VNet Layout: 10.0.0.0/16 (West Europe)

Subnet	CIDR	Purpose	Access Control
`aks-subnet`	10.0.0.0/20	All AKS node pools	NSG: 80/443 from Internet + LB health probes
`db-subnet`	10.0.16.0/24	PostgreSQL Flexible Server (VNet-integrated)	NSG: 5432 from aks-subnet only
`storage-pe-subnet`	10.0.17.0/24	Blob Storage private endpoint	NSG: from aks-subnet only
`sftp-subnet`	10.0.18.0/24	Blob SFTP endpoint (public-facing)	NSG: port 22 from whitelisted IPs only

Traffic Flow

Internet → Azure Load Balancer → NGINX Ingress (App Routing) → K8s Services
                                                                    ↓
                                                    Backend / Storage / Importer
                                                         ↓           ↓
                                                    PostgreSQL    Blob Storage

All managed services connect via private endpoints or VNet integration. No public IPs except the Ingress load balancer. SFTP endpoint is public but restricted to whitelisted source IPs.

Encryption

TLS 1.2+ on all connections (ingress, database, blob storage)
PostgreSQL: SSL enforced
Blob Storage: AES-256 encryption at rest (Azure-managed keys)
AKS: etcd encrypted at rest

Identity & Access

AKS Workload Identity for all service-to-Azure authentication (Key Vault, Blob Storage, PostgreSQL)
No passwords or connection strings in environment variables or ConfigMaps
Azure RBAC for cluster administrator access
Separate K8s namespaces per application concern (app, jobs, monitoring)

Network Policies

Default deny all ingress and egress
Explicit allow-list:
- Backend ↔ Storage Svc ↔ Importer Svc (inter-service)
- All services → PostgreSQL (5432)
- Importer + Storage Svc → Blob Storage
- Job pods: cluster-internal only (deny internet unless specifically required)

SFTP Access

Azure Blob Storage SFTP endpoint with local user accounts
SSH key authentication only (no passwords)
IP whitelist via NSG on sftp-subnet

Data & Storage

PostgreSQL Flexible Server

Single Azure Database for PostgreSQL Flexible Server instance per environment, hosting three databases:

Database	Schemas	Used By
`app_db`	APP_ADMIN, APP_USER	Backend service
`storage_db`	STORAGE_ADMIN, STORAGE_USER	Storage service
`importer_db`	IMPORTER_ADMIN, IMPORTER_USER	Importer service

Sizing:

Sizing (start small, scale via tfvars):

Environment	SKU	Backup Retention	Redundancy
Dev	Burstable B1ms	7 days	None
PreProd	Burstable B1ms	7 days	None
Prod	Burstable B1ms	35 days	None (requires GP tier)

Scale-up path: change postgres_sku_name to GP_Standard_D2s_v3 and enable postgres_geo_redundant_backup — both require General Purpose tier.

Oracle → PostgreSQL migration:

Liquibase changelogs require one-time review for Oracle-specific SQL (sequences, data types, PL/SQL)
JPA/Hibernate dialect switch from Oracle12cDialect to PostgreSQLDialect
Liquibase migrations run as K8s init containers on each service deployment (same pattern as today)

Azure Blob Storage

S3-compatibility layer enabled initially for migration-first approach — existing Java S3 client code works with an endpoint swap. Native Azure Blob SDK adoption deferred to a later phase.

Container	Purpose	Access
`sftp-ingest`	Landing zone for SFTP uploads	SFTP users write, Importer reads
`shared-storage`	Internal object storage (replaces Minio)	Storage Svc + Importer read/write
`job-artifacts`	Matlab/Python job inputs and outputs	Job pods read/write

Storage redundancy:

All environments: LRS (locally redundant) — upgrade Prod to ZRS when scale justifies it

No automatic deletion lifecycle policies. Data is retained until explicitly removed.

Repository Structure

Repositories are split by concern and change cadence. The application consists of multiple existing source repos (not listed here — they predate this infrastructure design). This section defines the infrastructure and deployment repos, plus the integration contract that each application repo must follow.

Infrastructure & Deployment Repos (created by this plan)

Repo	Purpose	ArgoCD	GH Actions	Owner
`app-infrastructure`	Terraform (Azure resources)	No	Plan/apply	Infra
`app-deployment`	Helm charts + K8s manifests + monitoring (GitOps state)	Yes	Helm lint on PR	Shared
`app-job-images`	Matlab Runtime + Python runner Dockerfiles	No	Build runner images	App/Data

Application Source Repos (existing, not managed by this plan)

The application consists of multiple existing repos. Each repo that produces a deployable container image must follow the integration contract below. The exact repo list should be filled in during onboarding.

Repo	Image(s) Produced	Helm Values File	Notes
`<TBD: backend repo>`	`app-backend`	`values-backend.yaml`	Serves Vue.js frontend
`<TBD: storage repo>`	`app-storage`	`values-storage.yaml`
`<TBD: importer repo>`	`app-importer`	`values-importer.yaml`
`<TBD: additional repos>`	`<image-name>`	`values-<name>.yaml`	Add rows as needed

Integration Contract for Application Repos

Each application source repo must:

Build an OCI image and push to GHCR tagged with the git SHA
Update its image tag in app-deployment via cross-repo push (using a GitHub App token with write access to app-deployment)
Include [skip ci] in the commit message to avoid triggering the app-deployment lint workflow
Own its Helm values file in app-deployment (e.g., values-backend.yaml) — this is where service-specific config lives (ports, env vars, resource requests, Liquibase config)
Follow the branch strategy: develop → Dev, main → PreProd, manual ArgoCD sync → Prod

A reusable GitHub Actions workflow template for steps 1-3 should be provided in app-deployment under .github/workflow-templates/ for application repos to adopt.

CI/CD: GitHub Enterprise + Actions + ArgoCD

Pipeline Flow

App source repo: Push → Build & Test → Build OCI Image → Push to GHCR → Cross-repo update tag in app-deployment
app-deployment repo: ArgoCD detects tag change → syncs to AKS
app-infrastructure repo: Terraform plan/apply (separate lifecycle)

Environment Promotion

Trigger	Target
Push to feature branch (any app repo)	Build + test only (no deploy)
Merge to `develop` (any app repo)	Push image, update tag in `app-deployment` → ArgoCD auto-syncs Dev
Merge to `main` (any app repo)	Push image, update tag in `app-deployment` → ArgoCD auto-syncs PreProd
Manual sync in ArgoCD	Promote to Prod

Key Decisions

Container registry: GitHub Container Registry (GHCR), images tagged with git SHA
Deployment method: ArgoCD watches app-deployment repo for Helm value changes and auto-syncs. Application repos build images and push updated tags to app-deployment via GitHub App token (cross-repo).
Secrets: GitHub Actions OIDC → Azure Workload Identity Federation (no stored Azure credentials in GitHub)
Database migrations: Liquibase runs as a Helm pre-upgrade hook Job (not init container) — prevents race conditions in multi-replica deploys
Job RBAC: Backend service account has Role + RoleBinding to create/manage K8s Jobs in jobs namespace
Job container images: Matlab Runtime and Python runner images in separate app-job-images repo (different build lifecycle)
Branch strategy: develop + main (matches current team workflow)
Adding new services: create a new values-<name>.yaml in app-deployment, add an ArgoCD Application manifest, and wire the source repo's CI to push image tags — no infrastructure changes needed
K8s backup: Velero with daily scheduled backups to Azure Blob Storage (168h retention)

Environments & Cost

Node Sizing (start small, scale via tfvars)

Dev/PreProd start single-node. Prod gets 2 nodes each for zero-downtime maintenance. Scale up by changing tfvars.

	Dev	PreProd	Prod
AKS control plane	Standard (required by Karpenter)	Standard	Standard
AKS system pool	1x Standard_D2s_v5	1x Standard_D2s_v5	2x Standard_D2s_v5
AKS app pool	1x Standard_D2s_v5	1x Standard_D2s_v5	2x Standard_D2s_v5
AKS monitoring pool	1x Standard_D2s_v5	1x Standard_D2s_v5	1x Standard_D2s_v5
AKS job nodes	Karpenter (0→N on demand)	Karpenter (0→N)	Karpenter (0→N)
PostgreSQL	Burstable B1ms	Burstable B1ms	Burstable B1ms
Blob Storage	LRS	LRS	ZRS
Key Vault	Standard	Standard	Standard

Job worker pods get their own PVCs via job-scratch StorageClass (StandardSSD_LRS, WaitForFirstConsumer binding for zone-awareness with Karpenter).

Monthly Cost Estimate (West Europe, pay-as-you-go)

Component	Dev	PreProd	Prod
AKS nodes (system+app+monitoring)	~€120	~€120	~€180
AKS control plane (Standard)	~€60	~€60	~€60
PostgreSQL	~€15	~€15	~€25
Blob Storage	~€5	~€5	~€12
Key Vault	~€5	~€5	~€5
Load Balancer	~€20	~€20	~€20
NAT Gateway	~€40	~€40	~€40
Log Analytics (basic)	~€10	~€10	~€10
Total	~€275/mo	~€275/mo	~€352/mo

Notes:

Current scale target: ~50-100 daily users, built ready to scale
Job pool nodes only incur cost when jobs are running (auto-scale from zero)
Grafana/Prometheus/Loki run on the system pool at no extra Azure cost
GHCR storage is included with GitHub Enterprise
Scale-up path: change SKU/node counts in tfvars, terraform apply — no re-architecture needed
When scaling: add nodes, move PostgreSQL to General Purpose, enable geo-redundant backup

Out of Scope

Application code changes beyond Oracle→PostgreSQL dialect and Minio→Blob endpoint swap
Azure Blob SDK migration (deferred — S3-compat layer used initially)
WAF (Web Application Firewall) — can be added later via Azure Front Door if needed
Multi-region / disaster recovery beyond geo-redundant database backups
User authentication / identity provider integration (assumed handled at application level)

Raw

2026-04-13-app-azure-infrastructure.md

APP Azure Infrastructure Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Provision Azure infrastructure for the APP application migration from on-prem Docker/MicroK8s to AKS with managed backing services, deployed via Terraform, Helm, and ArgoCD.

Architecture: AKS-Centric with dedicated node pools (system, app, monitoring) + Karpenter NAP for jobs, managed PostgreSQL Flexible Server, Azure Blob Storage (dual accounts), ArgoCD GitOps deployment. Three environments: Dev, PreProd, Prod.

Tech Stack: Terraform (AzureRM provider), Helm 3, ArgoCD, GitHub Actions, Azure CLI (az)

Spec: docs/superpowers/specs/2026-04-13-hera-azure-infrastructure-design.md

Reviewed: 6 rounds × 4 cloud specialists + 1 backend architect = 25 reviews, 40+ fixes applied.

Handover Instructions

This plan was designed WITHOUT access to the application source code. The infrastructure side (Tasks 1-9) is complete and reviewed. The application-facing side (Tasks 10-13) uses assumptions that need validation against the real codebase.

What to execute as-is

Tasks 1-9 (infrastructure + K8s base config) are infra-only — they create Azure resources, networking, and K8s manifests with no dependency on application code. These have been through 25 specialist reviews and can be executed immediately:

Task 1: Terraform scaffolding (app-infrastructure repo)
Task 2: Networking module (VNet, subnets, NSGs, NAT Gateway)
Task 3: Database module (PostgreSQL Flexible Server)
Task 4: Storage module (dual Blob Storage accounts)
Task 5: Key Vault module
Task 6: AKS module (cluster, node pools, Karpenter NAP)
Task 7: Root module wiring + environment tfvars
Task 8: K8s base config (namespaces, network policies, RBAC, Karpenter CRDs)
Task 9: Observability stack (Prometheus, Grafana, Loki, Alloy, Velero)

What needs adaptation after reviewing source code

Tasks 10-13 need adjustment based on the actual application repos:

Task 10 (Helm chart): Verify service ports match reality (assumed 8080/8081/8082). Check health endpoint paths (/actuator/health/* assumed — may differ). Adjust resource requests based on actual service profiles. Map Liquibase changelog paths to real locations in the Docker images.
Task 11 (Terraform CI): Execute as-is — no app dependency.
Task 12 (App CI template): Adapt the workflow template to each real source repo. Fill in actual IMAGE_NAME, DOCKER_CONTEXT, and HELM_VALUES_FILE values. Add repo-specific build/test commands.
Task 13 (Bootstrap): Execute after Tasks 1-9 are applied. App deployment validation (Helm install) depends on Task 10 adjustments.

Assumptions to validate against source code

Assumption	Where used	What to check
Services listen on 8080, 8081, 8082	Helm values, network policies	Actual ports in Dockerfiles / Spring Boot config
Health endpoints at `/actuator/health/liveness` and `/readiness`	deployment.yaml	Actual health check paths
Spring Boot with `SPRING_DATASOURCE_URL` env var	Helm values	Actual env var names for DB connection
Liquibase changelogs at `/liquibase/changelog.xml`	migration-job.yaml	Actual changelog location in Docker image
Backend creates K8s Jobs for script execution	RBAC, network policies	How job creation actually works in the code
Frontend bundled in backend Docker image	No separate frontend deployment	Verify — may need separate Helm values
3 databases: app_db, storage_db, importer_db	Terraform database module	Actual database names and schema requirements
Services communicate via HTTP (Backend → Storage, Importer → Storage)	Network policies	Actual inter-service call patterns

Adding new services

When onboarding an additional application repo:

Add a values-<name>.yaml in app-deployment/helm/app-service/
Add an ArgoCD Application manifest in app-deployment/argocd/
Copy the CI workflow template to the source repo, set 3 env vars
If the service needs DB access, add a database in the Terraform database module and a network policy egress rule
No other infrastructure changes needed

Repository Split

Repo	Purpose	ArgoCD	GH Actions	Owner
`app-infrastructure`	Terraform (Azure resources)	No	Plan/apply	Infra
`app`	Java 17 + Vue.js source code	No	Build, test, push images	App
`app-deployment`	Helm + K8s manifests + monitoring (GitOps state)	Yes	Helm lint on PR	Shared
`app-job-images`	Matlab Runtime + Python runner Dockerfiles	No	Build runner images	App/Data

File Structures

Repo 1: `app-infrastructure`

app-infrastructure/
├── terraform/
│   ├── main.tf                      # Root module — wires all child modules
│   ├── variables.tf                 # Root input variables
│   ├── outputs.tf                   # Root outputs
│   ├── providers.tf                 # AzureRM + AzAPI provider config
│   ├── versions.tf                  # Required providers + Terraform version
│   ├── backend.tf                   # Azure Storage remote state backend (partial config)
│   ├── environments/
│   │   ├── dev.tfvars
│   │   ├── preprod.tfvars
│   │   └── prod.tfvars
│   └── modules/
│       ├── networking/              # VNet, subnets, NSGs, NAT Gateway, private DNS
│       ├── database/                # PostgreSQL Flexible Server, databases
│       ├── storage/                 # Dual Blob Storage (internal + SFTP), private endpoint
│       ├── keyvault/                # Key Vault + private endpoint
│       └── aks/                     # AKS cluster, node pools, NAP, Container Insights
├── .github/workflows/
│   └── terraform.yml                # Plan on PR, apply on merge
└── README.md

Application Source Repos (existing — NOT created by this plan)

The application consists of multiple existing repos (backend, frontend, storage, importer, etc.). Each repo that produces a deployable container image must adopt the integration contract:

Build OCI image → push to GHCR tagged with git SHA
Cross-repo push updated image tag to app-deployment (via GitHub App token)
Include [skip ci] in the tag-update commit

A reusable workflow template is provided in app-deployment/.github/workflow-templates/ for adoption.

Fill in the actual repo names and image mappings during onboarding:

Repo	Image(s)	Helm Values File
`<TBD>`	`app-backend`	`values-backend.yaml`
`<TBD>`	`app-storage`	`values-storage.yaml`
`<TBD>`	`app-importer`	`values-importer.yaml`
...	...	...

Repo 3: `app-deployment` (ArgoCD watches this)

app-deployment/
├── helm/
│   └── app-service/
│       ├── Chart.yaml
│       ├── values.yaml
│       ├── values-backend.yaml
│       ├── values-storage.yaml
│       ├── values-importer.yaml
│       └── templates/
│           ├── _helpers.tpl
│           ├── deployment.yaml
│           ├── service.yaml
│           ├── ingress.yaml
│           ├── hpa.yaml
│           ├── serviceaccount.yaml
│           └── migration-job.yaml   # Liquibase as pre-upgrade hook
├── k8s/
│   ├── namespaces.yaml
│   ├── pdbs.yaml
│   ├── resource-quotas.yaml
│   ├── rbac/
│   │   └── job-creator.yaml         # Role + RoleBinding for Backend → jobs namespace
│   ├── karpenter/
│   │   ├── job-nodepool.yaml
│   │   ├── job-nodeclass.yaml
│   │   └── job-storageclass.yaml
│   └── network-policies/
│       ├── default-deny.yaml
│       ├── app-services.yaml
│       ├── db-access.yaml
│       ├── blob-access.yaml
│       ├── jobs.yaml
│       └── monitoring.yaml
├── monitoring/
│   ├── prometheus-values.yaml
│   ├── loki-values.yaml
│   └── install.sh
├── argocd/
│   ├── app-backend.yaml             # ArgoCD Application per service
│   ├── app-storage.yaml
│   ├── app-importer.yaml
│   ├── base-config.yaml             # K8s base manifests (namespaces, netpol, etc.)
│   └── monitoring.yaml              # Observability stack
├── .github/
│   ├── workflows/
│   │   └── lint.yml                 # Helm lint on PR
│   └── workflow-templates/
│       └── build-and-update-tag.yml # Reusable workflow for app source repos
└── README.md

Repo 4: `app-job-images`

app-job-images/
├── matlab-runner/
│   └── Dockerfile
├── python-runner/
│   ├── Dockerfile
│   └── requirements.txt
├── .github/workflows/
│   └── build-runners.yml            # Build + push to GHCR on change
└── README.md

Task-to-repo mapping: Tasks 1-7 → app-infrastructure. Tasks 8-9 → app-deployment. Task 10 → app-deployment. Tasks 11 → app-infrastructure. Task 12 → app-deployment. Task 13 → cross-repo bootstrap.

Task 1: Repository Scaffolding and Terraform Bootstrap

Files:

Create: terraform/versions.tf
Create: terraform/providers.tf
Create: terraform/backend.tf
Create: terraform/variables.tf
Create: terraform/main.tf (empty root, populated in Task 7)
Create: terraform/outputs.tf (empty root, populated in Task 7)
Create: .gitignore
Step 1: Create the repository and .gitignore

mkdir -p app-infrastructure/terraform/modules app-infrastructure/terraform/environments
cd app-infrastructure
git init

.gitignore:

# Terraform
*.tfstate
*.tfstate.*
.terraform/
crash.log
override.tf
override.tf.json
*_override.tf
*_override.tf.json
*.tfplan

# IDE
.idea/
.vscode/
*.swp

# OS
.DS_Store

Step 2: Write versions.tf

terraform/versions.tf:

terraform {
  required_version = ">= 1.9.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
    azapi = {
      source  = "azure/azapi"
      version = "~> 2.0"
    }
  }
}

Step 3: Write providers.tf

terraform/providers.tf:

provider "azurerm" {
  features {
    key_vault {
      purge_soft_delete_on_destroy = false
    }
    resource_group {
      prevent_deletion_if_contains_resources = true
    }
  }

  subscription_id = var.subscription_id
}

provider "azapi" {}

Step 4: Write backend.tf

terraform/backend.tf:

# Partial backend configuration. The state file key is passed per-environment
# at init time via: terraform init -backend-config "key=app-${ENV}.tfstate"
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-app-tfstate"
    storage_account_name = "stappinfratfstate"
    container_name       = "tfstate"
  }
}

Step 5: Write root variables.tf with shared variables

terraform/variables.tf:

variable "subscription_id" {
  description = "Azure subscription ID"
  type        = string
}

variable "environment" {
  description = "Environment name: dev, preprod, or prod"
  type        = string
  validation {
    condition     = contains(["dev", "preprod", "prod"], var.environment)
    error_message = "Environment must be dev, preprod, or prod."
  }
}

variable "location" {
  description = "Azure region"
  type        = string
  default     = "westeurope"
}

variable "project_name" {
  description = "Project identifier used in resource naming"
  type        = string
  default     = "app"
}

variable "sftp_allowed_ips" {
  description = "List of IP addresses allowed to connect via SFTP"
  type        = list(string)
  default     = []
}

variable "aks_system_pool_count" {
  description = "Number of nodes in the AKS system pool"
  type        = number
  default     = 1
}

variable "aks_app_pool_count" {
  description = "Number of nodes in the AKS app pool"
  type        = number
  default     = 1
}

variable "aks_app_pool_max_count" {
  description = "Max nodes for AKS app pool autoscaler (0 = no autoscaling)"
  type        = number
  default     = 0
}

variable "node_auto_provisioning_enabled" {
  description = "Enable Karpenter-based Node Auto-Provisioning for job workloads"
  type        = bool
  default     = true
}

variable "postgres_sku_name" {
  description = "PostgreSQL Flexible Server SKU"
  type        = string
  default     = "B_Standard_B1ms"
}

variable "postgres_backup_retention_days" {
  description = "PostgreSQL backup retention in days"
  type        = number
  default     = 7
}

variable "postgres_geo_redundant_backup" {
  description = "Enable geo-redundant backup for PostgreSQL"
  type        = bool
  default     = false
}

variable "storage_replication_type" {
  description = "Blob Storage replication: LRS or ZRS"
  type        = string
  default     = "LRS"
  validation {
    condition     = contains(["LRS", "ZRS"], var.storage_replication_type)
    error_message = "Must be LRS or ZRS."
  }
}

Step 6: Create empty root main.tf and outputs.tf

terraform/main.tf:

# Root module — child modules wired in Task 7
locals {
  resource_prefix = "${var.project_name}-${var.environment}"
  common_tags = {
    project     = var.project_name
    environment = var.environment
    managed_by  = "terraform"
  }
}

terraform/outputs.tf:

# Root outputs — populated in Task 7

Step 7: Validate the bootstrap

cd terraform
terraform init -backend=false && terraform validate && terraform fmt -check -recursive

Expected: all pass with no errors.

Step 8: Commit

git add .
git commit -m "feat: scaffold terraform project with providers, backend, and root variables"

Task 2: Networking Module

Files:

Create: terraform/modules/networking/main.tf
Create: terraform/modules/networking/variables.tf
Create: terraform/modules/networking/outputs.tf
Step 1: Write networking variables.tf

terraform/modules/networking/variables.tf:

variable "resource_prefix" {
  description = "Prefix for resource names"
  type        = string
}

variable "location" {
  description = "Azure region"
  type        = string
}

variable "resource_group_name" {
  description = "Name of the resource group"
  type        = string
}

variable "vnet_address_space" {
  description = "VNet address space"
  type        = list(string)
  default     = ["10.0.0.0/16"]
}

variable "sftp_allowed_ips" {
  description = "IP addresses allowed to connect via SFTP"
  type        = list(string)
  default     = []
}

variable "nat_gateway_enabled" {
  description = "Enable NAT Gateway for AKS subnet egress"
  type        = bool
  default     = true
}

variable "tags" {
  description = "Resource tags"
  type        = map(string)
  default     = {}
}

Step 2: Write networking main.tf

terraform/modules/networking/main.tf:

locals {
  aks_subnet_cidr        = "10.0.0.0/20"
  db_subnet_cidr         = "10.0.16.0/24"
  storage_pe_subnet_cidr = "10.0.17.0/24"
  sftp_subnet_cidr       = "10.0.18.0/24"
}

resource "azurerm_virtual_network" "main" {
  name                = "vnet-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  address_space       = var.vnet_address_space
  tags                = var.tags
}

# --- Subnets ---

resource "azurerm_subnet" "aks" {
  name                 = "snet-aks"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [local.aks_subnet_cidr]
}

resource "azurerm_subnet" "db" {
  name                 = "snet-db"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [local.db_subnet_cidr]

  delegation {
    name = "postgresql-delegation"
    service_delegation {
      name    = "Microsoft.DBforPostgreSQL/flexibleServers"
      actions = ["Microsoft.Network/virtualNetworks/subnets/join/action"]
    }
  }
}

resource "azurerm_subnet" "storage_pe" {
  name                              = "snet-storage-pe"
  resource_group_name               = var.resource_group_name
  virtual_network_name              = azurerm_virtual_network.main.name
  address_prefixes                  = [local.storage_pe_subnet_cidr]
  private_endpoint_network_policies = "Disabled"
}

resource "azurerm_subnet" "sftp" {
  name                 = "snet-sftp"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = [local.sftp_subnet_cidr]
}

# --- NAT Gateway (for AKS egress) ---

resource "azurerm_public_ip" "nat" {
  count               = var.nat_gateway_enabled ? 1 : 0
  name                = "pip-nat-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  allocation_method   = "Static"
  sku                 = "Standard"
  tags                = var.tags
}

resource "azurerm_nat_gateway" "main" {
  count                   = var.nat_gateway_enabled ? 1 : 0
  name                    = "natgw-${var.resource_prefix}"
  location                = var.location
  resource_group_name     = var.resource_group_name
  sku_name                = "Standard"
  idle_timeout_in_minutes = 10
  tags                    = var.tags
}

resource "azurerm_nat_gateway_public_ip_association" "main" {
  count                = var.nat_gateway_enabled ? 1 : 0
  nat_gateway_id       = azurerm_nat_gateway.main[0].id
  public_ip_address_id = azurerm_public_ip.nat[0].id
}

resource "azurerm_subnet_nat_gateway_association" "aks" {
  count          = var.nat_gateway_enabled ? 1 : 0
  subnet_id      = azurerm_subnet.aks.id
  nat_gateway_id = azurerm_nat_gateway.main[0].id
}

# --- NSGs ---

resource "azurerm_network_security_group" "aks" {
  name                = "nsg-aks-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  security_rule {
    name                       = "AllowWebInbound"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_ranges    = ["80", "443"]
    source_address_prefix      = "Internet"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "AllowLBProbes"
    priority                   = 110
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "AzureLoadBalancer"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "AllowKubeletAPI"
    priority                   = 120
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "10250"
    source_address_prefix      = "VirtualNetwork"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "AllowNodePortRange"
    priority                   = 130
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "30000-32767"
    source_address_prefix      = "AzureLoadBalancer"
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "db" {
  name                = "nsg-db-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  security_rule {
    name                       = "AllowPostgresFromAKS"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "5432"
    source_address_prefix      = local.aks_subnet_cidr
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "storage_pe" {
  name                = "nsg-storage-pe-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  security_rule {
    name                       = "AllowFromAKS"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = local.aks_subnet_cidr
    destination_address_prefix = "*"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_network_security_group" "sftp" {
  name                = "nsg-sftp-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tags                = var.tags

  dynamic "security_rule" {
    for_each = length(var.sftp_allowed_ips) > 0 ? [1] : []
    content {
      name                       = "AllowSFTPFromWhitelist"
      priority                   = 100
      direction                  = "Inbound"
      access                     = "Allow"
      protocol                   = "Tcp"
      source_port_range          = "*"
      destination_port_range     = "22"
      source_address_prefixes    = var.sftp_allowed_ips
      destination_address_prefix = "*"
    }
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

# --- NSG Associations ---

resource "azurerm_subnet_network_security_group_association" "aks" {
  subnet_id                 = azurerm_subnet.aks.id
  network_security_group_id = azurerm_network_security_group.aks.id
}

resource "azurerm_subnet_network_security_group_association" "db" {
  subnet_id                 = azurerm_subnet.db.id
  network_security_group_id = azurerm_network_security_group.db.id
}

resource "azurerm_subnet_network_security_group_association" "storage_pe" {
  subnet_id                 = azurerm_subnet.storage_pe.id
  network_security_group_id = azurerm_network_security_group.storage_pe.id
}

resource "azurerm_subnet_network_security_group_association" "sftp" {
  subnet_id                 = azurerm_subnet.sftp.id
  network_security_group_id = azurerm_network_security_group.sftp.id
}

# --- Private DNS Zone for PostgreSQL ---

resource "azurerm_private_dns_zone" "postgres" {
  name                = "privatelink.postgres.database.azure.com"
  resource_group_name = var.resource_group_name
  tags                = var.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "postgres" {
  name                  = "link-postgres"
  resource_group_name   = var.resource_group_name
  private_dns_zone_name = azurerm_private_dns_zone.postgres.name
  virtual_network_id    = azurerm_virtual_network.main.id
}

Step 3: Write networking outputs.tf

terraform/modules/networking/outputs.tf:

output "vnet_id" {
  value = azurerm_virtual_network.main.id
}

output "vnet_name" {
  value = azurerm_virtual_network.main.name
}

output "aks_subnet_id" {
  value = azurerm_subnet.aks.id
}

output "db_subnet_id" {
  value = azurerm_subnet.db.id
}

output "storage_pe_subnet_id" {
  value = azurerm_subnet.storage_pe.id
}

output "sftp_subnet_id" {
  value = azurerm_subnet.sftp.id
}

output "postgres_private_dns_zone_id" {
  value = azurerm_private_dns_zone.postgres.id
}

output "nat_gateway_id" {
  value = var.nat_gateway_enabled ? azurerm_nat_gateway.main[0].id : null
}

Step 4: Validate the module

cd terraform
terraform init -backend=false
terraform validate
terraform fmt -check -recursive

Expected: all pass.

Step 5: Commit

git add terraform/modules/networking/
git commit -m "feat: add networking module — VNet, subnets, NSGs, NAT gateway, private DNS"

Task 3: Database Module

Files:

Create: terraform/modules/database/main.tf
Create: terraform/modules/database/variables.tf
Create: terraform/modules/database/outputs.tf
Step 1: Write database variables.tf

terraform/modules/database/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "db_subnet_id" {
  description = "Subnet ID for PostgreSQL VNet integration"
  type        = string
}

variable "private_dns_zone_id" {
  description = "Private DNS zone ID for PostgreSQL"
  type        = string
}

variable "sku_name" {
  description = "PostgreSQL SKU name"
  type        = string
  default     = "B_Standard_B1ms"
}

variable "storage_mb" {
  description = "Storage in MB"
  type        = number
  default     = 32768
}

variable "backup_retention_days" {
  description = "Backup retention in days"
  type        = number
  default     = 7
}

variable "geo_redundant_backup_enabled" {
  description = "Enable geo-redundant backups"
  type        = bool
  default     = false
}

variable "administrator_login" {
  description = "PostgreSQL administrator login name"
  type        = string
  default     = "pgadmin"
}

# Bootstrap only. Password is stored in Terraform state. Post-provision: rotate
# immediately, store in Key Vault, and migrate to Azure AD authentication for
# all service connections. Set via TF_VAR_postgres_admin_password environment
# variable — never in tfvars.
variable "administrator_password" {
  description = "PostgreSQL administrator password"
  type        = string
  sensitive   = true
}

variable "tags" {
  type    = map(string)
  default = {}
}

Step 2: Write database main.tf

terraform/modules/database/main.tf:

resource "azurerm_postgresql_flexible_server" "main" {
  name                          = "psql-${var.resource_prefix}"
  resource_group_name           = var.resource_group_name
  location                      = var.location
  version                       = "16"
  administrator_login           = var.administrator_login
  administrator_password        = var.administrator_password
  sku_name                      = var.sku_name
  storage_mb                    = var.storage_mb
  backup_retention_days         = var.backup_retention_days
  geo_redundant_backup_enabled  = var.geo_redundant_backup_enabled
  delegated_subnet_id           = var.db_subnet_id
  private_dns_zone_id           = var.private_dns_zone_id
  public_network_access_enabled = false
  zone                          = "1"
  tags                          = var.tags

  authentication {
    password_auth_enabled         = true
    active_directory_auth_enabled = true
  }
}

resource "azurerm_postgresql_flexible_server_configuration" "require_ssl" {
  name      = "require_secure_transport"
  server_id = azurerm_postgresql_flexible_server.main.id
  value     = "ON"
}

resource "azurerm_postgresql_flexible_server_database" "app" {
  name      = "app_db"
  server_id = azurerm_postgresql_flexible_server.main.id
  charset   = "UTF8"
  collation = "en_US.utf8"
}

resource "azurerm_postgresql_flexible_server_database" "storage" {
  name      = "storage_db"
  server_id = azurerm_postgresql_flexible_server.main.id
  charset   = "UTF8"
  collation = "en_US.utf8"
}

resource "azurerm_postgresql_flexible_server_database" "importer" {
  name      = "importer_db"
  server_id = azurerm_postgresql_flexible_server.main.id
  charset   = "UTF8"
  collation = "en_US.utf8"
}

Step 3: Write database outputs.tf

terraform/modules/database/outputs.tf:

output "server_id" {
  value = azurerm_postgresql_flexible_server.main.id
}

output "server_fqdn" {
  value = azurerm_postgresql_flexible_server.main.fqdn
}

output "server_name" {
  value = azurerm_postgresql_flexible_server.main.name
}

output "database_names" {
  value = {
    app      = azurerm_postgresql_flexible_server_database.app.name
    storage  = azurerm_postgresql_flexible_server_database.storage.name
    importer = azurerm_postgresql_flexible_server_database.importer.name
  }
}

Step 4: Validate

cd terraform
terraform validate
terraform fmt -check -recursive

Step 5: Commit

git add terraform/modules/database/
git commit -m "feat: add database module — PostgreSQL Flexible Server with 3 databases"

Task 4: Storage Module

Files:

Create: terraform/modules/storage/main.tf
Create: terraform/modules/storage/variables.tf
Create: terraform/modules/storage/outputs.tf
Step 1: Write storage variables.tf

terraform/modules/storage/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "replication_type" {
  description = "LRS or ZRS"
  type        = string
  default     = "LRS"
}

variable "storage_pe_subnet_id" {
  description = "Subnet ID for Blob Storage private endpoint"
  type        = string
}

variable "vnet_id" {
  description = "VNet ID for private DNS zone link"
  type        = string
}

variable "sftp_allowed_ips" {
  description = "IP addresses allowed to connect via SFTP"
  type        = list(string)
  default     = []
}

variable "tags" {
  type    = map(string)
  default = {}
}

Step 2: Write storage main.tf

terraform/modules/storage/main.tf:

# --- Internal Blob Storage Account (private, no public access) ---

resource "azurerm_storage_account" "internal" {
  name                          = substr(replace("st${var.resource_prefix}int", "-", ""), 0, 24)
  resource_group_name           = var.resource_group_name
  location                      = var.location
  account_tier                  = "Standard"
  account_replication_type      = var.replication_type
  account_kind                  = "StorageV2"
  min_tls_version               = "TLS1_2"
  public_network_access_enabled = false
  tags                          = var.tags

  blob_properties {
    versioning_enabled = true
  }
}

# --- SFTP Storage Account (public for SFTP, IP-restricted) ---

resource "azurerm_storage_account" "sftp" {
  name                          = substr(replace("st${var.resource_prefix}sftp", "-", ""), 0, 24)
  resource_group_name           = var.resource_group_name
  location                      = var.location
  account_tier                  = "Standard"
  account_replication_type      = var.replication_type
  account_kind                  = "StorageV2"
  min_tls_version               = "TLS1_2"
  public_network_access_enabled = true
  is_hns_enabled                = true
  sftp_enabled                  = true
  tags                          = var.tags

  network_rules {
    default_action = "Deny"
    ip_rules       = var.sftp_allowed_ips
  }

  blob_properties {
    versioning_enabled = true
  }
}

# --- Blob Containers ---

resource "azurerm_storage_container" "shared_storage" {
  name                  = "shared-storage"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "job_artifacts" {
  name                  = "job-artifacts"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "loki_chunks" {
  name                  = "loki-chunks"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "velero_backups" {
  name                  = "velero-backups"
  storage_account_id    = azurerm_storage_account.internal.id
  container_access_type = "private"
}

resource "azurerm_storage_container" "sftp_ingest" {
  name                  = "sftp-ingest"
  storage_account_id    = azurerm_storage_account.sftp.id
  container_access_type = "private"
}

# --- Private Endpoint for Internal Blob ---

resource "azurerm_private_endpoint" "blob" {
  name                = "pe-blob-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  subnet_id           = var.storage_pe_subnet_id
  tags                = var.tags

  private_service_connection {
    name                           = "psc-blob-${var.resource_prefix}"
    private_connection_resource_id = azurerm_storage_account.internal.id
    subresource_names              = ["blob"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "blobdns"
    private_dns_zone_ids = [azurerm_private_dns_zone.blob.id]
  }
}

resource "azurerm_private_dns_zone" "blob" {
  name                = "privatelink.blob.core.windows.net"
  resource_group_name = var.resource_group_name
  tags                = var.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "blob" {
  name                  = "link-blob"
  resource_group_name   = var.resource_group_name
  private_dns_zone_name = azurerm_private_dns_zone.blob.name
  virtual_network_id    = var.vnet_id
}

Step 3: Write storage outputs.tf

terraform/modules/storage/outputs.tf:

output "internal_storage_account_id" {
  value = azurerm_storage_account.internal.id
}

output "internal_storage_account_name" {
  value = azurerm_storage_account.internal.name
}

output "primary_blob_endpoint" {
  value = azurerm_storage_account.internal.primary_blob_endpoint
}

output "sftp_storage_account_id" {
  value = azurerm_storage_account.sftp.id
}

output "sftp_storage_account_name" {
  value = azurerm_storage_account.sftp.name
}

output "sftp_endpoint" {
  value = azurerm_storage_account.sftp.primary_blob_endpoint
}

output "container_names" {
  value = {
    sftp_ingest    = azurerm_storage_container.sftp_ingest.name
    shared_storage = azurerm_storage_container.shared_storage.name
    job_artifacts  = azurerm_storage_container.job_artifacts.name
    loki_chunks    = azurerm_storage_container.loki_chunks.name
    velero_backups = azurerm_storage_container.velero_backups.name
  }
}

Step 4: Validate

cd terraform
terraform validate
terraform fmt -check -recursive

Step 5: Commit

git add terraform/modules/storage/
git commit -m "feat: add storage module — dual accounts (internal + SFTP), containers, private endpoint"

Task 5: Key Vault Module

Files:

Create: terraform/modules/keyvault/main.tf
Create: terraform/modules/keyvault/variables.tf
Create: terraform/modules/keyvault/outputs.tf
Step 1: Write keyvault variables.tf

terraform/modules/keyvault/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "tenant_id" {
  description = "Azure AD tenant ID"
  type        = string
}

variable "keyvault_pe_subnet_id" {
  description = "Subnet ID for Key Vault private endpoint"
  type        = string
}

variable "vnet_id" {
  description = "VNet ID for private DNS zone link"
  type        = string
}

variable "tags" {
  type    = map(string)
  default = {}
}

Step 2: Write keyvault main.tf

terraform/modules/keyvault/main.tf:

# RBAC authorization is the default in AzureRM 4.x (enable_rbac_authorization removed).
# AKS role assignments (Key Vault Secrets User) work with the default RBAC mode.
resource "azurerm_key_vault" "main" {
  name                          = "kv-${var.resource_prefix}"
  location                      = var.location
  resource_group_name           = var.resource_group_name
  tenant_id                     = var.tenant_id
  sku_name                      = "standard"
  purge_protection_enabled      = true
  soft_delete_retention_days    = 90
  public_network_access_enabled = false
  tags                          = var.tags
}

# --- Private Endpoint for Key Vault ---

resource "azurerm_private_endpoint" "keyvault" {
  name                = "pe-kv-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  subnet_id           = var.keyvault_pe_subnet_id
  tags                = var.tags

  private_service_connection {
    name                           = "psc-kv-${var.resource_prefix}"
    private_connection_resource_id = azurerm_key_vault.main.id
    subresource_names              = ["vault"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "kvdns"
    private_dns_zone_ids = [azurerm_private_dns_zone.keyvault.id]
  }
}

resource "azurerm_private_dns_zone" "keyvault" {
  name                = "privatelink.vaultcore.azure.net"
  resource_group_name = var.resource_group_name
  tags                = var.tags
}

resource "azurerm_private_dns_zone_virtual_network_link" "keyvault" {
  name                  = "link-keyvault"
  resource_group_name   = var.resource_group_name
  private_dns_zone_name = azurerm_private_dns_zone.keyvault.name
  virtual_network_id    = var.vnet_id
}

Step 3: Write keyvault outputs.tf

terraform/modules/keyvault/outputs.tf:

output "key_vault_id" {
  value = azurerm_key_vault.main.id
}

output "key_vault_name" {
  value = azurerm_key_vault.main.name
}

output "key_vault_uri" {
  value = azurerm_key_vault.main.vault_uri
}

output "key_vault_private_endpoint_ip" {
  value = azurerm_private_endpoint.keyvault.private_service_connection[0].private_ip_address
}

Step 4: Validate

cd terraform
terraform validate
terraform fmt -check -recursive

Step 5: Commit

git add terraform/modules/keyvault/
git commit -m "feat: add keyvault module with private endpoint"

Task 6: AKS Module

Files:

Create: terraform/modules/aks/main.tf
Create: terraform/modules/aks/variables.tf
Create: terraform/modules/aks/outputs.tf
Step 1: Write AKS variables.tf

terraform/modules/aks/variables.tf:

variable "resource_prefix" {
  type = string
}

variable "location" {
  type = string
}

variable "resource_group_name" {
  type = string
}

variable "aks_subnet_id" {
  description = "Subnet ID for AKS nodes"
  type        = string
}

variable "system_pool_vm_size" {
  description = "VM size for system pool (override in prod.tfvars for larger nodes)"
  type        = string
  default     = "Standard_D2s_v5"
}

variable "system_pool_count" {
  description = "Node count for system pool"
  type        = number
  default     = 1
}

variable "app_pool_count" {
  description = "Node count for app pool"
  type        = number
  default     = 1
}

variable "app_pool_max_count" {
  description = "Max node count for app pool autoscaler (0 = no autoscaling)"
  type        = number
  default     = 0
}

variable "node_auto_provisioning_enabled" {
  description = "Enable Karpenter-based Node Auto-Provisioning for job workloads"
  type        = bool
  default     = true
}

variable "key_vault_id" {
  description = "Key Vault ID for App Routing TLS integration"
  type        = string
}

variable "tags" {
  type    = map(string)
  default = {}
}

Step 2: Write AKS main.tf

terraform/modules/aks/main.tf:

resource "azurerm_kubernetes_cluster" "main" {
  name                = "aks-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  dns_prefix          = "aks-${var.resource_prefix}"
  kubernetes_version  = "1.31"
  sku_tier            = "Standard"  # Required for Node Auto-Provisioning (Karpenter)
  tags                = var.tags

  # Uncomment for production: lifecycle { prevent_destroy = true }

  # System node pool
  default_node_pool {
    name                = "system"
    vm_size             = var.system_pool_vm_size
    node_count          = var.system_pool_count
    vnet_subnet_id      = var.aks_subnet_id
    os_disk_size_gb     = 50
    temporary_name_for_rotation = "systemtmp"

    node_labels = {
      "nodepool" = "system"
    }
  }

  identity {
    type = "SystemAssigned"
  }

  oidc_issuer_enabled       = true
  workload_identity_enabled = true

  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    service_cidr      = "10.1.0.0/16"
    dns_service_ip    = "10.1.0.10"
  }

  # App Routing Add-on (Managed NGINX)
  web_app_routing {
    dns_zone_ids = []
  }

  key_vault_secrets_provider {
    secret_rotation_enabled  = true
    secret_rotation_interval = "2m"
  }

  # Node Auto-Provisioning (Karpenter) for job workloads
  # Dynamically provisions right-sized VMs based on pod resource requests
  node_provisioning_profile {
    enabled = var.node_auto_provisioning_enabled
  }
}

# --- App Node Pool ---

resource "azurerm_kubernetes_cluster_node_pool" "app" {
  name                  = "app"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D2s_v5"
  node_count            = var.app_pool_count
  min_count             = var.app_pool_max_count > 0 ? var.app_pool_count : null
  max_count             = var.app_pool_max_count > 0 ? var.app_pool_max_count : null
  auto_scaling_enabled  = var.app_pool_max_count > 0
  vnet_subnet_id        = var.aks_subnet_id
  os_disk_size_gb       = 50
  tags                  = var.tags

  node_labels = {
    "nodepool" = "app"
  }

  node_taints = []
}

# --- Monitoring Node Pool ---
# Isolates Grafana/Prometheus/Loki from app workloads

resource "azurerm_kubernetes_cluster_node_pool" "monitoring" {
  name                  = "monitor"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D2s_v5"
  node_count            = 1
  vnet_subnet_id        = var.aks_subnet_id
  os_disk_size_gb       = 100
  tags                  = var.tags

  node_labels = {
    "nodepool" = "monitoring"
  }

  node_taints = ["dedicated=monitoring:NoSchedule"]
}

# Job nodes are NOT a static pool — they are provisioned dynamically by
# Karpenter (Node Auto-Provisioning) based on each job pod's resource requests.
# See k8s/karpenter/ for the NodePool + AKSNodeClass CRDs that define constraints.

# --- Container Insights (basic) ---
# Live container logs in Azure Portal alongside self-hosted Grafana stack

resource "azurerm_log_analytics_workspace" "aks" {
  name                = "law-${var.resource_prefix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  sku                 = "PerGB2018"
  retention_in_days   = 30
  tags                = var.tags
}

resource "azurerm_monitor_diagnostic_setting" "aks" {
  name                       = "diag-aks-${var.resource_prefix}"
  target_resource_id         = azurerm_kubernetes_cluster.main.id
  log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id

  enabled_log {
    category = "kube-apiserver"
  }

  enabled_log {
    category = "kube-audit-admin"
  }

  enabled_log {
    category = "kube-controller-manager"
  }

  metric {
    category = "AllMetrics"
    enabled  = false
  }
}

# --- Role assignment: AKS → Key Vault Secrets User ---

resource "azurerm_role_assignment" "aks_keyvault" {
  scope                = var.key_vault_id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azurerm_kubernetes_cluster.main.key_vault_secrets_provider[0].secret_identity[0].object_id
}

Step 3: Write AKS outputs.tf

terraform/modules/aks/outputs.tf:

output "cluster_id" {
  value = azurerm_kubernetes_cluster.main.id
}

output "cluster_name" {
  value = azurerm_kubernetes_cluster.main.name
}

output "kube_config_raw" {
  value     = azurerm_kubernetes_cluster.main.kube_config_raw
  sensitive = true
}

output "oidc_issuer_url" {
  value = azurerm_kubernetes_cluster.main.oidc_issuer_url
}

output "kubelet_identity_object_id" {
  value = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}

output "cluster_identity_principal_id" {
  value = azurerm_kubernetes_cluster.main.identity[0].principal_id
}

Step 4: Validate

cd terraform
terraform validate
terraform fmt -check -recursive

Step 5: Commit

git add terraform/modules/aks/
git commit -m "feat: add AKS module — cluster, 3 node pools, workload identity, app routing"

Task 7: Root Module Integration and Environment Configs

Files:

Modify: terraform/main.tf
Modify: terraform/variables.tf
Modify: terraform/outputs.tf
Create: terraform/environments/dev.tfvars
Create: terraform/environments/preprod.tfvars
Create: terraform/environments/prod.tfvars
Step 1: Write the root main.tf wiring all modules

Replace the contents of terraform/main.tf:

locals {
  resource_prefix = "${var.project_name}-${var.environment}"
  common_tags = {
    project     = var.project_name
    environment = var.environment
    managed_by  = "terraform"
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "main" {
  name     = "rg-${local.resource_prefix}"
  location = var.location
  tags     = local.common_tags
}

# R2: Lock the resource group in production to prevent accidental deletion
resource "azurerm_management_lock" "rg_nodelete" {
  count      = var.environment == "prod" ? 1 : 0
  name       = "rg-nodelete"
  scope      = azurerm_resource_group.main.id
  lock_level = "CanNotDelete"
  notes      = "Prevent accidental deletion of production resource group"
}

module "networking" {
  source = "./modules/networking"

  resource_prefix     = local.resource_prefix
  location            = var.location
  resource_group_name = azurerm_resource_group.main.name
  sftp_allowed_ips    = var.sftp_allowed_ips
  nat_gateway_enabled = var.nat_gateway_enabled
  tags                = local.common_tags
}

module "database" {
  source = "./modules/database"

  resource_prefix              = local.resource_prefix
  location                     = var.location
  resource_group_name          = azurerm_resource_group.main.name
  db_subnet_id                 = module.networking.db_subnet_id
  private_dns_zone_id          = module.networking.postgres_private_dns_zone_id
  sku_name                     = var.postgres_sku_name
  backup_retention_days        = var.postgres_backup_retention_days
  geo_redundant_backup_enabled = var.postgres_geo_redundant_backup
  administrator_password       = var.postgres_admin_password
  tags                         = local.common_tags

  # Uncomment for production: lifecycle { prevent_destroy = true }
}

module "storage" {
  source = "./modules/storage"

  resource_prefix      = local.resource_prefix
  location             = var.location
  resource_group_name  = azurerm_resource_group.main.name
  replication_type     = var.storage_replication_type
  storage_pe_subnet_id = module.networking.storage_pe_subnet_id
  vnet_id              = module.networking.vnet_id
  sftp_allowed_ips     = var.sftp_allowed_ips
  tags                 = local.common_tags
}

module "keyvault" {
  source = "./modules/keyvault"

  resource_prefix       = local.resource_prefix
  location              = var.location
  resource_group_name   = azurerm_resource_group.main.name
  tenant_id             = data.azurerm_client_config.current.tenant_id
  keyvault_pe_subnet_id = module.networking.storage_pe_subnet_id  # Shared PE subnet (storage + KV)
  vnet_id               = module.networking.vnet_id
  tags                  = local.common_tags
}

module "aks" {
  source = "./modules/aks"

  resource_prefix     = local.resource_prefix
  location            = var.location
  resource_group_name = azurerm_resource_group.main.name
  aks_subnet_id       = module.networking.aks_subnet_id
  system_pool_vm_size = var.system_pool_vm_size
  system_pool_count   = var.aks_system_pool_count
  app_pool_count      = var.aks_app_pool_count
  app_pool_max_count  = var.aks_app_pool_max_count
  node_auto_provisioning_enabled = var.node_auto_provisioning_enabled
  key_vault_id        = module.keyvault.key_vault_id
  tags                = local.common_tags
}

# --- Azure Budget Alert ---
# Single subscription-level budget covering ALL environments combined.
# Only created once (in the dev environment apply) to avoid duplicates.
# Thresholds: €1.5k, €2.5k, €3.5k, €4.5k (warning), €5k (critical)
resource "azurerm_consumption_budget_subscription" "total" {
  count           = var.environment == "dev" ? 1 : 0
  name            = "budget-${var.project_name}-total"
  subscription_id = "/subscriptions/${var.subscription_id}"
  amount          = 5000
  time_grain      = "Monthly"

  time_period {
    start_date = "2026-05-01T00:00:00Z"  # Pinned — avoids plan drift from timestamp()
  }

  # €1,500 — first alert
  notification {
    enabled        = true
    threshold      = 30
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €2,500
  notification {
    enabled        = true
    threshold      = 50
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €3,500
  notification {
    enabled        = true
    threshold      = 70
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €4,500
  notification {
    enabled        = true
    threshold      = 90
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  # €5,000 — critical, at budget ceiling
  notification {
    enabled        = true
    threshold      = 100
    operator       = "GreaterThan"
    contact_emails = var.budget_alert_emails
  }

  lifecycle {
    ignore_changes = [time_period]
  }
}

Step 2: Add new variables to root variables.tf

Append to terraform/variables.tf:

variable "postgres_admin_password" {
  description = "PostgreSQL administrator password"
  type        = string
  sensitive   = true
}

variable "nat_gateway_enabled" {
  description = "Enable NAT Gateway for outbound internet from private subnets"
  type        = bool
  default     = true
}

variable "system_pool_vm_size" {
  description = "VM size for AKS system pool"
  type        = string
  default     = "Standard_D2s_v5"
}

variable "budget_alert_emails" {
  description = "Email addresses for budget alerts"
  type        = list(string)
  default     = []
}

Step 3: Write root outputs.tf

Replace terraform/outputs.tf:

output "resource_group_name" {
  value = azurerm_resource_group.main.name
}

output "aks_cluster_name" {
  value = module.aks.cluster_name
}

output "aks_oidc_issuer_url" {
  value = module.aks.oidc_issuer_url
}

output "postgres_server_fqdn" {
  value = module.database.server_fqdn
}

output "storage_account_name" {
  value = module.storage.internal_storage_account_name
}

output "key_vault_name" {
  value = module.keyvault.key_vault_name
}

Step 4: Write dev.tfvars

terraform/environments/dev.tfvars:

environment            = "dev"
subscription_id        = "REPLACE_WITH_SUBSCRIPTION_ID"

# AKS
aks_system_pool_count  = 1
aks_app_pool_count     = 1
aks_app_pool_max_count = 0
node_auto_provisioning_enabled = true  # Karpenter provisions job nodes dynamically

# PostgreSQL
postgres_sku_name              = "B_Standard_B1ms"
postgres_backup_retention_days = 7
postgres_geo_redundant_backup  = false

# Storage
storage_replication_type = "LRS"

# Networking
nat_gateway_enabled = true

# SFTP
sftp_allowed_ips = []

# Budget (subscription-level, created only in dev apply)
budget_alert_emails = ["REPLACE_WITH_EMAIL"]

Step 5: Write preprod.tfvars

terraform/environments/preprod.tfvars:

environment            = "preprod"
subscription_id        = "REPLACE_WITH_SUBSCRIPTION_ID"

# AKS — start single-node, scale via these values when needed
aks_system_pool_count  = 1
aks_app_pool_count     = 1
aks_app_pool_max_count = 0
node_auto_provisioning_enabled = true  # Karpenter provisions job nodes dynamically

# PostgreSQL — start burstable, upgrade to GP_Standard_D2s_v3 when needed
postgres_sku_name              = "B_Standard_B1ms"
postgres_backup_retention_days = 7
postgres_geo_redundant_backup  = false

# Storage
storage_replication_type = "LRS"

# Networking
nat_gateway_enabled = true

# SFTP
sftp_allowed_ips = []

# Budget (subscription-level, created only in dev apply)
budget_alert_emails = ["REPLACE_WITH_EMAIL"]

Step 6: Write prod.tfvars

terraform/environments/prod.tfvars:

environment            = "prod"
subscription_id        = "REPLACE_WITH_SUBSCRIPTION_ID"

# AKS — 2 nodes each for zero-downtime during Azure host maintenance
aks_system_pool_count  = 2
aks_app_pool_count     = 2
aks_app_pool_max_count = 0
node_auto_provisioning_enabled = true  # Karpenter provisions job nodes dynamically

# PostgreSQL — start burstable, upgrade to GP_Standard_D2s_v3 when needed
postgres_sku_name              = "B_Standard_B1ms"
postgres_backup_retention_days = 35
# Geo-redundant backup requires General Purpose tier — enable when upgrading SKU:
#   postgres_sku_name = "GP_Standard_D2s_v3"
#   postgres_geo_redundant_backup = true
postgres_geo_redundant_backup  = false

# Storage — ZRS for zone redundancy in prod
storage_replication_type = "ZRS"

# Networking
nat_gateway_enabled = true

# SFTP
sftp_allowed_ips = []

# Budget (subscription-level, created only in dev apply)
budget_alert_emails = ["REPLACE_WITH_EMAIL"]

Step 7: Validate the full configuration

cd terraform
terraform init -backend=false
terraform validate
terraform fmt -check -recursive

Step 8: Commit

git add terraform/main.tf terraform/outputs.tf terraform/variables.tf terraform/environments/
git commit -m "feat: wire root module with all child modules and environment configs"

Task 8: Kubernetes Base Configuration

Files:

Create: k8s/namespaces.yaml
Create: k8s/network-policies/default-deny.yaml
Create: k8s/network-policies/app-services.yaml
Create: k8s/network-policies/db-access.yaml
Create: k8s/network-policies/blob-access.yaml
Create: k8s/network-policies/jobs.yaml
Create: k8s/network-policies/monitoring.yaml
Create: k8s/pdbs.yaml
Create: k8s/resource-quotas.yaml
Step 1: Write namespaces.yaml

k8s/namespaces.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: app
  labels:
    purpose: application
---
apiVersion: v1
kind: Namespace
metadata:
  name: jobs
  labels:
    purpose: job-execution
---
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    purpose: observability
---
apiVersion: v1
kind: Namespace
metadata:
  name: velero
  labels:
    purpose: backup

Step 2: Write default-deny.yaml

k8s/network-policies/default-deny.yaml:

# Applied to app, jobs, and monitoring namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: app
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: jobs
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: monitoring
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: velero
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Step 3: Write app-services.yaml

k8s/network-policies/app-services.yaml:

# Allow inter-service communication: Backend <-> Storage <-> Importer
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-inter-service
  namespace: app
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/part-of: app
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow from other app services
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/part-of: app
    # Allow from ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app-routing-system
  egress:
    # Allow to other app services
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/part-of: app
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Step 4: Write db-access.yaml

k8s/network-policies/db-access.yaml:

# Allow app services to reach PostgreSQL (10.0.16.0/24:5432)
# Note: jobs namespace intentionally excluded — job pods do not need direct database access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-postgres
  namespace: app
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/part-of: app
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.16.0/24
      ports:
        - protocol: TCP
          port: 5432

Step 5: Write blob-access.yaml

k8s/network-policies/blob-access.yaml:

# Allow Importer and Storage services to reach Blob Storage (10.0.17.0/24:443)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-blob
  namespace: app
spec:
  podSelector:
    matchExpressions:
      - key: app.kubernetes.io/name
        operator: In
        values:
          - app-storage
          - app-importer
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
---
# Allow job pods to reach Blob Storage for artifacts
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-blob
  namespace: jobs
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443

Step 6: Write jobs.yaml

k8s/network-policies/jobs.yaml:

# Allow job pods to communicate within the jobs namespace
# and reach app services (for callbacks), but deny internet
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-jobs-internal
  namespace: jobs
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector: {}
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app
  egress:
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    # Allow to app namespace
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app
    # Allow within jobs namespace
    - to:
        - podSelector: {}

Step 7: Write monitoring.yaml

k8s/network-policies/monitoring.yaml:

# Prometheus: scrape egress to all namespaces on common metrics ports
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scrape
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: prometheus
  policyTypes:
    - Egress
  egress:
    # Scrape targets across all namespaces
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 9090
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 9100
        - protocol: TCP
          port: 10250
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Grafana: allow ingress on 3000 + egress to Loki/Prometheus for datasource queries
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-grafana
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: grafana
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app-routing-system
      ports:
        - protocol: TCP
          port: 3000
  egress:
    # Query Prometheus and Loki datasources within monitoring namespace
    - to:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 9090
        - protocol: TCP
          port: 3100
    # DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Loki: allow ingress on 3100 + egress to Azure Blob Storage for chunk persistence
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-loki
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: loki
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: app
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: jobs
        # Allow promtail to push logs
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 3100
  egress:
    # Azure Blob Storage private endpoint for chunk/index persistence
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
    # DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Velero: egress to Blob Storage for backups + DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-velero-egress
  namespace: velero
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.17.0/24
      ports:
        - protocol: TCP
          port: 443
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# Alloy: egress to Loki + K8s API for service discovery, ingress for Prometheus scrape
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-alloy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: alloy
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Prometheus scrapes Alloy metrics on port 12345
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: prometheus
      ports:
        - protocol: TCP
          port: 12345
  egress:
    # Push logs to Loki
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: loki
      ports:
        - protocol: TCP
          port: 3100
    # K8s API server for pod discovery (discovery.kubernetes)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - protocol: TCP
          port: 6443
---
# Monitoring namespace: DNS egress for all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: monitoring
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Step 8: Write pdbs.yaml

k8s/pdbs.yaml:

# PodDisruptionBudgets for observability stack
# Using maxUnavailable (not minAvailable) to avoid blocking node drains
# on single-replica deployments. With 1 replica, minAvailable:1 would
# prevent any voluntary disruption (drains hang indefinitely).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: prometheus-pdb
  namespace: monitoring
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: prometheus
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: grafana-pdb
  namespace: monitoring
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: grafana

Step 9: Write resource-quotas.yaml

k8s/resource-quotas.yaml:

# ResourceQuotas per namespace to prevent runaway resource consumption
apiVersion: v1
kind: ResourceQuota
metadata:
  name: app-quota
  namespace: app
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: jobs-quota
  namespace: jobs
spec:
  hard:
    # Must align with Karpenter NodePool limits (192 vCPU / 2048 GiB)
    requests.cpu: "192"
    requests.memory: 2048Gi
    limits.cpu: "192"
    limits.memory: 2048Gi

Step 10: Write Job creator RBAC

The Backend service needs to create K8s Jobs in the jobs namespace for script execution. This Role + RoleBinding grants the backend's service account the required permissions.

k8s/rbac/job-creator.yaml:

# Role granting Job lifecycle management in the jobs namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: job-creator
  namespace: jobs
rules:
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "watch", "delete"]
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
---
# Bind to the backend service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: backend-job-creator
  namespace: jobs
subjects:
  - kind: ServiceAccount
    name: app-backend
    namespace: app
roleRef:
  kind: Role
  name: job-creator
  apiGroup: rbac.authorization.k8s.io

Step 11: Write Karpenter NodePool for job workloads

k8s/karpenter/job-nodepool.yaml:

# Karpenter NodePool — defines constraints for dynamically provisioned job nodes.
# Nodes are created on-demand when job pods are pending, and consolidated/terminated
# when idle. VM size is selected automatically based on pod resource requests.
apiVersion: karpenter.azure.com/v1alpha2
kind: NodePool
metadata:
  name: job-workers
spec:
  template:
    metadata:
      labels:
        nodepool: job
    spec:
      taints:
        - key: workload
          value: job
          effect: NoSchedule
      # Force node recycling after 24h to pick up OS patches
      expireAfter: 24h
      requirements:
        # Allow D-series (balanced) and E-series (memory-optimized) VMs
        - key: karpenter.azure.com/sku-family
          operator: In
          values: ["D", "E"]
        # Only use v5 generation for cost/performance balance
        - key: karpenter.azure.com/sku-version
          operator: In
          values: ["v5"]
        # Limit max VM size to 96 vCPU (E96as_v5 = 96 vCPU, 672 GiB)
        - key: karpenter.azure.com/sku-cpu
          operator: Lt
          values: ["97"]
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: job-workers
  # Consolidation: only terminate nodes when fully empty (not underutilized)
  # to avoid killing nodes mid-job-execution during bursty workloads
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s
  # Limit total resources across all Karpenter-managed job nodes
  # Allows ~2 concurrent max-sized jobs (96 vCPU + 672 GiB each)
  limits:
    cpu: "192"
    memory: "2048Gi"

Step 11: Write AKSNodeClass for job nodes

k8s/karpenter/job-nodeclass.yaml:

apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: job-workers
spec:
  # Must match the AKS subnet so job nodes join the same VNet
  # Replace with your actual subnet resource ID after terraform apply
  vnetSubnetID: /subscriptions/SUBSCRIPTION_ID/resourceGroups/rg-app-ENV/providers/Microsoft.Network/virtualNetworks/vnet-app-ENV/subnets/snet-aks
  osDiskSizeGB: 100
  imageFamily: Ubuntu2204

Step 12: Write StorageClass for job worker PVCs

Job pods get their own PVCs for scratch space (script inputs/outputs, temp data). Azure Disk PVCs are zone-pinned, so we use volumeBindingMode: WaitForFirstConsumer — the PVC waits until the pod is scheduled to a node, then provisions the disk in the same zone as that node. This prevents zone mismatch with Karpenter-provisioned nodes.

k8s/karpenter/job-storageclass.yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: job-scratch
provisioner: disk.csi.azure.com
parameters:
  skuName: StandardSSD_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Step 13: Validate YAML syntax

for f in k8s/namespaces.yaml k8s/network-policies/*.yaml k8s/pdbs.yaml k8s/resource-quotas.yaml k8s/karpenter/*.yaml; do
  echo "--- $f ---"
  yq '.' "$f" > /dev/null && echo "OK" || echo "FAIL"
done

Expected: all OK.

Step 13: Commit

git add k8s/
git commit -m "feat: add K8s base config — namespaces, network policies, PDBs, quotas, Karpenter job pool"

Task 9: Observability Stack (Helm Values)

Files:

Create: k8s/observability/prometheus-values.yaml
Create: k8s/observability/loki-values.yaml
Create: k8s/observability/grafana-datasources.yaml
Step 1: Write prometheus-values.yaml

Helm chart: prometheus-community/kube-prometheus-stack

k8s/observability/prometheus-values.yaml:

# Values for kube-prometheus-stack Helm chart
# Includes Prometheus + Grafana

prometheus:
  prometheusSpec:
    nodeSelector:
      nodepool: monitoring
    tolerations:
      - key: dedicated
        operator: Equal
        value: monitoring
        effect: NoSchedule
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    # R8: Configure startupProbe for slow-starting Prometheus instances
    # The kube-prometheus-stack chart supports startupProbe via prometheusSpec;
    # tune failureThreshold and periodSeconds in per-environment overrides if
    # Prometheus takes longer than the default 15-minute window to replay WAL.

grafana:
  nodeSelector:
    nodepool: monitoring
  tolerations:
    - key: dedicated
      operator: Equal
      value: monitoring
      effect: NoSchedule
  persistence:
    enabled: true
    size: 10Gi
  adminPassword: "" # Set via --set at deploy time or External Secrets
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki.monitoring.svc.cluster.local:3100
      access: proxy
      isDefault: false

alertmanager:
  alertmanagerSpec:
    nodeSelector:
      nodepool: monitoring
    tolerations:
      - key: dedicated
        operator: Equal
        value: monitoring
        effect: NoSchedule

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

Step 2: Write loki-values.yaml

Helm chart: grafana/loki

k8s/observability/loki-values.yaml:

# Values for grafana/loki Helm chart (single binary mode)

deploymentMode: SingleBinary

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: azure
    azure:
      container_name: loki-chunks
      # account_name and account_key should be injected via External Secrets from Key Vault
      # Injected at deploy time via --set overrides in install.sh:
      #   --set loki.storage.azure.account_name=<name>
      #   --set loki.storage.azure.account_key=<key>
      # Or use External Secrets Operator to populate a K8s Secret
      account_name: ""
      account_key: ""
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: azure
        schema: v13
        index:
          prefix: index_
          period: 24h

singleBinary:
  replicas: 1
  nodeSelector:
    nodepool: monitoring
  tolerations:
    - key: dedicated
      operator: Equal
      value: monitoring
      effect: NoSchedule
  persistence:
    size: 10Gi

read:
  replicas: 0

write:
  replicas: 0

backend:
  replicas: 0

# Log collection handled by Grafana Alloy — see install.sh

gateway:
  enabled: false

Step 3: Write install instructions as a deploy script

k8s/observability/install.sh:

#!/usr/bin/env bash
set -euo pipefail

NAMESPACE="monitoring"

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana)
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace "$NAMESPACE" \
  --create-namespace \
  --values "$(dirname "$0")/prometheus-values.yaml" \
  --wait

# Install Loki
helm upgrade --install loki grafana/loki \
  --namespace "$NAMESPACE" \
  --values "$(dirname "$0")/loki-values.yaml" \
  --wait

# Install Alloy (replaces Promtail — Grafana's unified telemetry collector)
helm upgrade --install alloy grafana/alloy \
  --namespace "$NAMESPACE" \
  --set "alloy.configMap.content=loki.write \"default\" { endpoint { url = \"http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push\" } }\nloki.source.kubernetes \"pods\" { targets = discovery.kubernetes.pods.targets\n forward_to = [loki.write.default.receiver] }\ndiscovery.kubernetes \"pods\" { role = \"pod\" }" \
  --wait

# Install Velero for K8s resource backup to Azure Blob Storage
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm upgrade --install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set "initContainers[0].name=velero-plugin-for-microsoft-azure" \
  --set "initContainers[0].image=velero/velero-plugin-for-microsoft-azure:v1.10.0" \
  --set "initContainers[0].volumeMounts[0].mountPath=/target" \
  --set "initContainers[0].volumeMounts[0].name=plugins" \
  --set "configuration.backupStorageLocation[0].provider=azure" \
  --set "configuration.backupStorageLocation[0].bucket=velero-backups" \
  --set "configuration.backupStorageLocation[0].config.storageAccount=${VELERO_STORAGE_ACCOUNT}" \
  --set "configuration.backupStorageLocation[0].config.resourceGroup=${RESOURCE_GROUP}" \
  --set "schedules.daily.schedule=0 2 * * *" \
  --set "schedules.daily.template.ttl=168h" \
  --wait

echo "Observability + backup stack installed"
echo "Grafana: kubectl port-forward -n $NAMESPACE svc/prometheus-grafana 3000:80"
echo "Velero: velero get backup-locations"

Step 4: Validate YAML syntax

for f in k8s/observability/*.yaml; do
  echo "--- $f ---"
  yq '.' "$f" > /dev/null && echo "OK" || echo "FAIL"
done
shellcheck k8s/observability/install.sh

Step 5: Commit

git add k8s/observability/
git commit -m "feat: add observability stack — Prometheus, Grafana, Loki Helm values"

Task 10: Application Helm Chart

Files:

Create: helm/app-service/Chart.yaml
Create: helm/app-service/values.yaml
Create: helm/app-service/values-backend.yaml
Create: helm/app-service/values-storage.yaml
Create: helm/app-service/values-importer.yaml
Create: helm/app-service/templates/_helpers.tpl
Create: helm/app-service/templates/deployment.yaml
Create: helm/app-service/templates/service.yaml
Create: helm/app-service/templates/ingress.yaml
Create: helm/app-service/templates/hpa.yaml
Create: helm/app-service/templates/serviceaccount.yaml
Step 1: Write Chart.yaml

helm/app-service/Chart.yaml:

apiVersion: v2
name: app-service
description: Shared Helm chart for APP microservices
type: application
version: 0.1.0
appVersion: "1.0.0"

Step 2: Write default values.yaml

helm/app-service/values.yaml:

replicaCount: 1

image:
  repository: ghcr.io/OWNER/app-backend
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8080

ingress:
  enabled: false
  className: webapprouting.kubernetes.azure.com
  host: ""
  path: /
  pathType: Prefix
  tlsKeyVaultUri: ""  # e.g. https://kv-app-prod.vault.azure.net/certificates/app-tls

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

nodeSelector:
  nodepool: app

# Tolerations for scheduling on tainted nodes.
# For job deployments, use:
#   tolerations:
#     - key: workload
#       operator: Equal
#       value: job
#       effect: NoSchedule
tolerations: []

serviceAccount:
  create: true
  annotations: {}

hpa:
  enabled: false
  minReplicas: 1
  maxReplicas: 3
  targetCPUUtilizationPercentage: 80

liquibase:
  enabled: false
  image:
    repository: ghcr.io/OWNER/app-backend
    tag: latest
  changelogPath: /liquibase/changelog.xml

# Env vars for the main application container
env: []

# Env vars for the Liquibase migration Job (runs as pre-upgrade hook,
# NOT as an init container — avoids race conditions in multi-replica deploys).
# Set in per-service values files.
liquibaseEnv: []

labels:
  app.kubernetes.io/part-of: app

Step 3: Write per-service values overrides

helm/app-service/values-backend.yaml:

image:
  repository: ghcr.io/OWNER/app-backend

service:
  port: 8080

ingress:
  enabled: true
  host: app.example.com
  path: /

liquibase:
  enabled: true
  changelogPath: /liquibase/changelog.xml

liquibaseEnv:
  - name: LIQUIBASE_COMMAND_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/app_db?sslmode=require"
  - name: LIQUIBASE_COMMAND_USERNAME
    value: "pgadmin"
  - name: LIQUIBASE_COMMAND_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

env:
  - name: SPRING_DATASOURCE_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/app_db?sslmode=require"
  - name: SPRING_PROFILES_ACTIVE
    value: "azure"

helm/app-service/values-storage.yaml:

image:
  repository: ghcr.io/OWNER/app-storage

service:
  port: 8081

ingress:
  enabled: false

liquibase:
  enabled: true
  changelogPath: /liquibase/changelog.xml

liquibaseEnv:
  - name: LIQUIBASE_COMMAND_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/storage_db?sslmode=require"
  - name: LIQUIBASE_COMMAND_USERNAME
    value: "pgadmin"
  - name: LIQUIBASE_COMMAND_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

env:
  - name: SPRING_DATASOURCE_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/storage_db?sslmode=require"
  - name: SPRING_PROFILES_ACTIVE
    value: "azure"

helm/app-service/values-importer.yaml:

image:
  repository: ghcr.io/OWNER/app-importer

service:
  port: 8082

ingress:
  enabled: false

liquibase:
  enabled: true
  changelogPath: /liquibase/changelog.xml

liquibaseEnv:
  - name: LIQUIBASE_COMMAND_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/importer_db?sslmode=require"
  - name: LIQUIBASE_COMMAND_USERNAME
    value: "pgadmin"
  - name: LIQUIBASE_COMMAND_PASSWORD
    valueFrom:
      secretKeyRef:
        name: postgres-credentials
        key: password

env:
  - name: SPRING_DATASOURCE_URL
    value: "jdbc:postgresql://$(POSTGRES_HOST):5432/importer_db?sslmode=require"
  - name: SPRING_PROFILES_ACTIVE
    value: "azure"

Step 4: Write _helpers.tpl

helm/app-service/templates/_helpers.tpl:

{{/*
Expand the name of the chart release into a fullname.
*/}}
{{- define "app-service.fullname" -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "app-service.labels" -}}
helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
{{ include "app-service.selectorLabels" . }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- range $key, $value := .Values.labels }}
{{ $key }}: {{ $value | quote }}
{{- end }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "app-service.selectorLabels" -}}
app.kubernetes.io/name: {{ include "app-service.fullname" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

Step 5: Write deployment.yaml template

helm/app-service/templates/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "app-service.fullname" . }}
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "app-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "app-service.labels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "app-service.fullname" . }}
      nodeSelector:
        {{- toYaml .Values.nodeSelector | nindent 8 }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      # Liquibase runs as a pre-upgrade hook Job (see migration-job.yaml),
      # NOT as an init container — avoids race conditions in multi-replica deploys.
      containers:
        - name: {{ include "app-service.fullname" . }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
              protocol: TCP
          env:
            {{- toYaml .Values.env | nindent 12 }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: {{ .Values.service.port }}
            failureThreshold: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: {{ .Values.service.port }}
            initialDelaySeconds: 60
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: {{ .Values.service.port }}
            initialDelaySeconds: 15
            periodSeconds: 5

Step 6: Write service.yaml template

helm/app-service/templates/service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: {{ include "app-service.fullname" . }}
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: {{ .Values.service.port }}
      protocol: TCP
  selector:
    {{- include "app-service.selectorLabels" . | nindent 4 }}

Step 7: Write ingress.yaml template

helm/app-service/templates/ingress.yaml:

{{- if .Values.ingress.enabled }}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ include "app-service.fullname" . }}
  {{- if .Values.ingress.tlsKeyVaultUri }}
  annotations:
    kubernetes.azure.com/tls-cert-keyvault-uri: {{ .Values.ingress.tlsKeyVaultUri }}
  {{- end }}
spec:
  ingressClassName: {{ .Values.ingress.className }}
  rules:
    - host: {{ .Values.ingress.host }}
      http:
        paths:
          - path: {{ .Values.ingress.path }}
            pathType: {{ .Values.ingress.pathType }}
            backend:
              service:
                name: {{ include "app-service.fullname" . }}
                port:
                  number: {{ .Values.service.port }}
  tls:
    - hosts:
        - {{ .Values.ingress.host }}
      secretName: {{ include "app-service.fullname" . }}-tls
{{- end }}

Step 8: Write migration-job.yaml template (Liquibase pre-upgrade hook)

helm/app-service/templates/migration-job.yaml:

{{- if .Values.liquibase.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app-service.fullname" . }}-migrate
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-1"
    "helm.sh/hook-delete-policy": before-hook-creation
spec:
  backoffLimit: 3
  template:
    metadata:
      labels:
        {{- include "app-service.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      nodeSelector:
        {{- toYaml .Values.nodeSelector | nindent 8 }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      containers:
        - name: liquibase
          image: "{{ .Values.liquibase.image.repository }}:{{ .Values.liquibase.image.tag }}"
          command: ["liquibase"]
          args:
            - "--changelog-file={{ .Values.liquibase.changelogPath }}"
            - "update"
          env:
            {{- toYaml .Values.liquibaseEnv | nindent 12 }}
{{- end }}

Step 9: Write hpa.yaml and serviceaccount.yaml templates

helm/app-service/templates/hpa.yaml:

{{- if .Values.hpa.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "app-service.fullname" . }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "app-service.fullname" . }}
  minReplicas: {{ .Values.hpa.minReplicas }}
  maxReplicas: {{ .Values.hpa.maxReplicas }}
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.hpa.targetCPUUtilizationPercentage }}
{{- end }}

helm/app-service/templates/serviceaccount.yaml:

{{- if .Values.serviceAccount.create }}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "app-service.fullname" . }}
  annotations:
    {{- toYaml .Values.serviceAccount.annotations | nindent 4 }}
  labels:
    {{- include "app-service.labels" . | nindent 4 }}
{{- end }}

Step 9: Lint the Helm chart

helm lint helm/app-service/
helm lint helm/app-service/ -f helm/app-service/values-backend.yaml
helm lint helm/app-service/ -f helm/app-service/values-storage.yaml
helm lint helm/app-service/ -f helm/app-service/values-importer.yaml

Expected: all pass with no errors.

Step 10: Commit

git add helm/
git commit -m "feat: add shared Helm chart for APP microservices with per-service overrides"

Task 11: GitHub Actions -- Terraform Workflow

Files:

Create: .github/workflows/terraform.yml
Step 1: Write terraform.yml

.github/workflows/terraform.yml:

name: Terraform

on:
  pull_request:
    paths:
      - "terraform/**"
  push:
    branches:
      - develop
      - main
    paths:
      - "terraform/**"

permissions:
  id-token: write
  contents: read
  pull-requests: write

env:
  ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
  ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
  ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  ARM_USE_OIDC: "true"
  TF_VERSION: "1.9.0"

jobs:
  determine-env:
    runs-on: ubuntu-latest
    outputs:
      environment: ${{ steps.env.outputs.environment }}
    steps:
      - id: env
        run: |
          if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
            echo "environment=preprod" >> "$GITHUB_OUTPUT"
          elif [[ "${{ github.ref }}" == "refs/heads/develop" ]]; then
            echo "environment=dev" >> "$GITHUB_OUTPUT"
          else
            echo "environment=dev" >> "$GITHUB_OUTPUT"
          fi

  plan:
    needs: determine-env
    runs-on: ubuntu-latest
    environment: ${{ needs.determine-env.outputs.environment }}
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init -backend-config="key=app-${{ needs.determine-env.outputs.environment }}.tfstate"

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        id: plan
        run: |
          terraform plan \
            -var-file="environments/${{ needs.determine-env.outputs.environment }}.tfvars" \
            -var="postgres_admin_password=${{ secrets.POSTGRES_ADMIN_PASSWORD }}" \
            -out=tfplan \
            -no-color
        continue-on-error: true

      - name: Upload Plan Artifact
        if: success() || steps.plan.outcome == 'success'
        uses: actions/upload-artifact@v4
        with:
          name: tfplan-${{ needs.determine-env.outputs.environment }}
          path: terraform/tfplan
          retention-days: 5

      - name: Comment PR with Plan
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan 📖
            **Environment:** \`${{ needs.determine-env.outputs.environment }}\`
            **Result:** \`${{ steps.plan.outcome }}\`

            <details><summary>Show Plan</summary>

            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\`

            </details>`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

      - name: Plan Status
        if: steps.plan.outcome == 'failure'
        run: exit 1

  apply:
    needs: [determine-env, plan]
    if: github.event_name == 'push' && (github.ref == 'refs/heads/develop' || github.ref == 'refs/heads/main')
    runs-on: ubuntu-latest
    environment: ${{ needs.determine-env.outputs.environment }}
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init -backend-config="key=app-${{ needs.determine-env.outputs.environment }}.tfstate"

      - name: Download Plan Artifact
        uses: actions/download-artifact@v4
        with:
          name: tfplan-${{ needs.determine-env.outputs.environment }}
          path: terraform

      - name: Terraform Apply
        run: terraform apply tfplan

  apply-prod:
    needs: [plan]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: prod
    defaults:
      run:
        working-directory: terraform
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        run: terraform init -backend-config="key=app-prod.tfstate"

      - name: Terraform Plan (Prod)
        run: |
          terraform plan \
            -var-file="environments/prod.tfvars" \
            -var="postgres_admin_password=${{ secrets.POSTGRES_ADMIN_PASSWORD }}" \
            -out=tfplan-prod

      - name: Terraform Apply (Prod)
        run: terraform apply tfplan-prod

Step 2: Validate workflow syntax

yq '.' .github/workflows/terraform.yml > /dev/null && echo "YAML OK" || echo "YAML FAIL"

Step 3: Commit

git add .github/workflows/terraform.yml
git commit -m "feat: add GitHub Actions workflow for Terraform plan/apply"

Task 12: Reusable CI Workflow Template (lives in `app-deployment` repo)

Each application source repo needs a CI workflow that builds images and updates tags in app-deployment. Instead of duplicating this across repos, provide a reusable workflow template.

Files:

Create: app-deployment/.github/workflow-templates/build-and-update-tag.yml
Create: .github/workflows/app-deploy.yml (example for app source repos to copy)
Step 1: Write app-deploy.yml

Note: This workflow lives in each application source repo (not app-deployment). The version below is a template — each app repo copies it and adjusts the SERVICE_NAME and DOCKER_CONTEXT values.

.github/workflows/app-deploy.yml:

# APP Service CI — Template for each application source repo.
# Copy this file to each app repo's .github/workflows/ and set the env vars below.
#
# Required secrets (set at org level for all app repos):
#   DEPLOY_APP_ID          — GitHub App ID with write access to app-deployment
#   DEPLOY_APP_PRIVATE_KEY — GitHub App private key
name: CI

on:
  pull_request:
  push:
    branches: [develop, main]

permissions:
  id-token: write
  contents: read
  packages: write

env:
  REGISTRY: ghcr.io
  # ── Per-repo config (change these when copying to a new repo) ──
  IMAGE_NAME: app-backend              # GHCR image name
  DOCKER_CONTEXT: .                    # Docker build context path
  HELM_VALUES_FILE: values-backend.yaml # Corresponding file in app-deployment

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and test
        run: echo "Replace with your build/test commands"

  push-image:
    needs: build-and-test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Log in to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: ${{ env.DOCKER_CONTEXT }}
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ env.IMAGE_NAME }}:latest

  # Cross-repo: update image tag in app-deployment so ArgoCD picks it up
  update-deployment:
    needs: push-image
    runs-on: ubuntu-latest
    steps:
      - name: Generate GitHub App Token
        id: app-token
        uses: actions/create-github-app-token@v1
        with:
          app-id: ${{ secrets.DEPLOY_APP_ID }}
          private-key: ${{ secrets.DEPLOY_APP_PRIVATE_KEY }}
          repositories: app-deployment

      - name: Checkout app-deployment repo
        uses: actions/checkout@v4
        with:
          repository: ${{ github.repository_owner }}/app-deployment
          token: ${{ steps.app-token.outputs.token }}

      - name: Update image tag
        run: |
          sed -i "s|tag:.*|tag: ${{ github.sha }}|" \
            "helm/app-service/${{ env.HELM_VALUES_FILE }}"

      - name: Commit and push updated tags
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add helm/app-service/values-*.yaml
          git diff --cached --quiet || git commit -m "chore: update image tags to ${{ github.sha }} [skip ci]"
          git push

Step 2: Validate workflow syntax

yq '.' .github/workflows/app-deploy.yml > /dev/null && echo "YAML OK" || echo "YAML FAIL"

Step 3: Commit

git add .github/workflows/app-deploy.yml
git commit -m "feat: add GitHub Actions workflow for app build and deploy"

Task 13: Dev Environment Bootstrap and Validation

This task is executed manually after all code is committed. It provisions the Dev environment and validates end-to-end.

Step 1: Create the Terraform state backend

az group create --name rg-app-tfstate --location westeurope
az storage account create \
  --name stappinfratfstate \
  --resource-group rg-app-tfstate \
  --location westeurope \
  --sku Standard_LRS \
  --min-tls-version TLS1_2
az storage container create \
  --name tfstate \
  --account-name stappinfratfstate

Step 2: Initialize and apply Terraform for Dev

cd terraform
terraform init -backend-config="key=app-dev.tfstate"
terraform plan \
  -var-file="environments/dev.tfvars" \
  -var="postgres_admin_password=REPLACE_WITH_SECURE_PASSWORD" \
  -out=dev.tfplan
terraform apply dev.tfplan

Step 3: Store postgres admin password in Key Vault

az keyvault secret set \
  --vault-name kv-app-dev \
  --name postgres-admin-password \
  --value "REPLACE_WITH_SECURE_PASSWORD"

Step 4: Verify Azure resources exist

az group show --name rg-app-dev --query "properties.provisioningState" -o tsv
# Expected: Succeeded

az aks show --resource-group rg-app-dev --name aks-app-dev --query "provisioningState" -o tsv
# Expected: Succeeded

az postgres flexible-server show --resource-group rg-app-dev --name psql-app-dev --query "state" -o tsv
# Expected: Ready

# Internal storage account
az storage account show --resource-group rg-app-dev --name stappdevint --query "provisioningState" -o tsv
# Expected: Succeeded

# SFTP storage account
az storage account show --resource-group rg-app-dev --name stappdevsftp --query "provisioningState" -o tsv
# Expected: Succeeded

# NAT Gateway
az network nat gateway show --resource-group rg-app-dev --name natgw-app-dev --query "provisioningState" -o tsv
# Expected: Succeeded

Step 5: Connect to AKS and apply K8s base config

az aks get-credentials --resource-group rg-app-dev --name aks-app-dev --overwrite-existing
kubectl apply -f k8s/namespaces.yaml
kubectl apply -f k8s/network-policies/
kubectl apply -f k8s/karpenter/

Step 6: Verify namespaces and network policies

kubectl get namespaces app jobs monitoring
# Expected: all Active

kubectl get networkpolicies -n app
# Expected: default-deny-all, allow-app-inter-service, allow-egress-postgres, allow-egress-blob

kubectl get networkpolicies -n jobs
# Expected: default-deny-all, allow-egress-blob, allow-jobs-internal (no postgres — jobs don't access DB)

Step 7: Install observability stack

chmod +x k8s/observability/install.sh
bash k8s/observability/install.sh

Step 8: Verify observability pods are running

kubectl get pods -n monitoring
# Expected: prometheus-*, grafana-*, loki-*, alertmanager-*, node-exporter-* all Running

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
# Open http://localhost:3000 — Grafana login should appear
kill %1

Step 9 (optional, prod only): Enable DDoS Protection Plan

# Uncomment and run for prod environment only:
# az network ddos-protection create \
#   --resource-group rg-app-prod \
#   --name ddos-app-prod \
#   --location westeurope
# az network vnet update \
#   --resource-group rg-app-prod \
#   --name vnet-app-prod \
#   --ddos-protection-plan ddos-app-prod \
#   --ddos-protection true

Step 10: Commit README

Create a README.md with quickstart instructions covering the steps above, then:

git add README.md
git commit -m "docs: add README with bootstrap and deployment instructions"

sla-te/2026-04-13-app-azure-infrastructure-design.md

APP Azure Infrastructure Design

Context

Application Modules

Architecture: AKS-Centric

AKS Cluster (one per environment)

Job Execution: AKS Node Auto-Provisioning (Karpenter)

Ingress: AKS App Routing Add-on (Managed NGINX)

Observability: Grafana + Prometheus + Loki

Networking & Security

VNet Layout: 10.0.0.0/16 (West Europe)

Traffic Flow

Encryption

Identity & Access

Network Policies

SFTP Access

Data & Storage

PostgreSQL Flexible Server

Azure Blob Storage

Repository Structure

Infrastructure & Deployment Repos (created by this plan)

Application Source Repos (existing, not managed by this plan)

Integration Contract for Application Repos

CI/CD: GitHub Enterprise + Actions + ArgoCD

Pipeline Flow

Environment Promotion

Key Decisions

Environments & Cost

Node Sizing (start small, scale via tfvars)

Monthly Cost Estimate (West Europe, pay-as-you-go)

Out of Scope

APP Azure Infrastructure Implementation Plan

Handover Instructions

What to execute as-is

What needs adaptation after reviewing source code

Assumptions to validate against source code

Adding new services

Repository Split

File Structures

Repo 1: app-infrastructure

Application Source Repos (existing — NOT created by this plan)

Repo 3: app-deployment (ArgoCD watches this)

Repo 4: app-job-images

Task 1: Repository Scaffolding and Terraform Bootstrap

Task 2: Networking Module

Task 3: Database Module

Task 4: Storage Module

Task 5: Key Vault Module

Task 6: AKS Module

Task 7: Root Module Integration and Environment Configs

Task 8: Kubernetes Base Configuration

Task 9: Observability Stack (Helm Values)

Task 10: Application Helm Chart

Task 11: GitHub Actions -- Terraform Workflow

Task 12: Reusable CI Workflow Template (lives in app-deployment repo)

Task 13: Dev Environment Bootstrap and Validation

Repo 1: `app-infrastructure`

Repo 3: `app-deployment` (ArgoCD watches this)

Repo 4: `app-job-images`

Task 12: Reusable CI Workflow Template (lives in `app-deployment` repo)