You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Migration of the APP application from on-premises Docker/MicroK8s to Azure Kubernetes Service with managed backing services.
Context
APP is a Java 17 / Vue.js microservices application for document processing with computational job execution. It currently runs on Docker on BTC servers with MicroK8s for orchestration, GitLab CI/CD, Oracle databases, and Minio object storage.
The organization has decided to migrate APP to Azure with a migration-first philosophy: minimize code changes, get running on Azure, then optimize.
Application Modules
Module
Tech
Port
Purpose
APP-Backend
Java 17
8080
Core backend, serves the Vue.js frontend
APP-Frontend
Vue.js
via backend
User-facing SPA
APP-Storage
Java 17
8081 /storage/api/v1/...
Storage microservice
APP-Importer
Java 17
8082 /importer/api/v1/...
File import, forwards to Storage service
Supporting workloads:
Job pipeline: Job-Initializer → ScriptRunner → Job-Collector executing Matlab Runtime and Python scripts in container pods
Jobs are primarily user-triggered (upload/click → run), with occasional scheduled batch runs
Architecture: AKS-Centric
All workloads run in AKS. Managed Azure services provide databases, object storage, secrets, and SFTP ingestion. This approach was chosen because:
The team already runs MicroK8s with NGINX ingress — the K8s mental model transfers directly
The job pipeline requires K8s anyway (Jobs/CronJobs), so splitting workloads across platforms adds unnecessary complexity
At the expected scale (~1k-5k daily users), PaaS alternatives offer no meaningful advantage over AKS
AKS Cluster (one per environment)
Two static node pools plus Karpenter-managed job nodes:
Job Execution: AKS Node Auto-Provisioning (Karpenter)
User-uploaded scripts (Matlab, Python, etc.) have unpredictable resource needs — from 50MB/2vCPU to 500GB/80vCPU. Static node pools cannot serve this range. Instead, AKS Node Auto-Provisioning (Karpenter) dynamically provisions right-sized VMs based on each job pod's resources.requests.
Allowed VM families: D-series (balanced compute), E-series (memory-optimized)
Max per node: 96 vCPU, 672 GiB RAM (E96as_v5 ceiling)
Scale from zero: nodes are provisioned on demand and terminated when idle (consolidation policy)
Taint: workload=job:NoSchedule — only job pods with matching toleration schedule here
Resource requests: set per job, either by user input (explicit CPU/RAM) or system estimation — application-level decision, not infrastructure
Same NGINX Ingress syntax the team already uses on MicroK8s — configs transfer as-is
Azure manages the controller lifecycle (upgrades, scaling, HA)
Free — included with AKS
Auto-integrates with Azure DNS and Key Vault for TLS certificates
Observability: Grafana + Prometheus + Loki
The existing Grafana + Prometheus stack migrates into AKS on a dedicated monitoring node pool (tainted dedicated=monitoring:NoSchedule). Loki is added for log aggregation, Alloy for log collection. Azure Container Insights (basic) supplements with API server logs in Azure Portal.
All managed services connect via private endpoints or VNet integration. No public IPs except the Ingress load balancer. SFTP endpoint is public but restricted to whitelisted source IPs.
Encryption
TLS 1.2+ on all connections (ingress, database, blob storage)
PostgreSQL: SSL enforced
Blob Storage: AES-256 encryption at rest (Azure-managed keys)
AKS: etcd encrypted at rest
Identity & Access
AKS Workload Identity for all service-to-Azure authentication (Key Vault, Blob Storage, PostgreSQL)
No passwords or connection strings in environment variables or ConfigMaps
Azure RBAC for cluster administrator access
Separate K8s namespaces per application concern (app, jobs, monitoring)
Job pods: cluster-internal only (deny internet unless specifically required)
SFTP Access
Azure Blob Storage SFTP endpoint with local user accounts
SSH key authentication only (no passwords)
IP whitelist via NSG on sftp-subnet
Data & Storage
PostgreSQL Flexible Server
Single Azure Database for PostgreSQL Flexible Server instance per environment, hosting three databases:
Database
Schemas
Used By
app_db
APP_ADMIN, APP_USER
Backend service
storage_db
STORAGE_ADMIN, STORAGE_USER
Storage service
importer_db
IMPORTER_ADMIN, IMPORTER_USER
Importer service
Sizing:
Sizing (start small, scale via tfvars):
Environment
SKU
Backup Retention
Redundancy
Dev
Burstable B1ms
7 days
None
PreProd
Burstable B1ms
7 days
None
Prod
Burstable B1ms
35 days
None (requires GP tier)
Scale-up path: change postgres_sku_name to GP_Standard_D2s_v3 and enable postgres_geo_redundant_backup — both require General Purpose tier.
Oracle → PostgreSQL migration:
Liquibase changelogs require one-time review for Oracle-specific SQL (sequences, data types, PL/SQL)
JPA/Hibernate dialect switch from Oracle12cDialect to PostgreSQLDialect
Liquibase migrations run as K8s init containers on each service deployment (same pattern as today)
Azure Blob Storage
S3-compatibility layer enabled initially for migration-first approach — existing Java S3 client code works with an endpoint swap. Native Azure Blob SDK adoption deferred to a later phase.
Container
Purpose
Access
sftp-ingest
Landing zone for SFTP uploads
SFTP users write, Importer reads
shared-storage
Internal object storage (replaces Minio)
Storage Svc + Importer read/write
job-artifacts
Matlab/Python job inputs and outputs
Job pods read/write
Storage redundancy:
All environments: LRS (locally redundant) — upgrade Prod to ZRS when scale justifies it
No automatic deletion lifecycle policies. Data is retained until explicitly removed.
Repository Structure
Repositories are split by concern and change cadence. The application consists of multiple existing source repos (not listed here — they predate this infrastructure design). This section defines the infrastructure and deployment repos, plus the integration contract that each application repo must follow.
Infrastructure & Deployment Repos (created by this plan)
Application Source Repos (existing, not managed by this plan)
The application consists of multiple existing repos. Each repo that produces a deployable container image must follow the integration contract below. The exact repo list should be filled in during onboarding.
Repo
Image(s) Produced
Helm Values File
Notes
<TBD: backend repo>
app-backend
values-backend.yaml
Serves Vue.js frontend
<TBD: storage repo>
app-storage
values-storage.yaml
<TBD: importer repo>
app-importer
values-importer.yaml
<TBD: additional repos>
<image-name>
values-<name>.yaml
Add rows as needed
Integration Contract for Application Repos
Each application source repo must:
Build an OCI image and push to GHCR tagged with the git SHA
Update its image tag in app-deployment via cross-repo push (using a GitHub App token with write access to app-deployment)
Include [skip ci] in the commit message to avoid triggering the app-deployment lint workflow
Own its Helm values file in app-deployment (e.g., values-backend.yaml) — this is where service-specific config lives (ports, env vars, resource requests, Liquibase config)
Follow the branch strategy: develop → Dev, main → PreProd, manual ArgoCD sync → Prod
A reusable GitHub Actions workflow template for steps 1-3 should be provided in app-deployment under .github/workflow-templates/ for application repos to adopt.
CI/CD: GitHub Enterprise + Actions + ArgoCD
Pipeline Flow
App source repo: Push → Build & Test → Build OCI Image → Push to GHCR → Cross-repo update tag in app-deployment
app-deployment repo: ArgoCD detects tag change → syncs to AKS
app-infrastructure repo: Terraform plan/apply (separate lifecycle)
Environment Promotion
Trigger
Target
Push to feature branch (any app repo)
Build + test only (no deploy)
Merge to develop (any app repo)
Push image, update tag in app-deployment → ArgoCD auto-syncs Dev
Merge to main (any app repo)
Push image, update tag in app-deployment → ArgoCD auto-syncs PreProd
Deployment method: ArgoCD watches app-deployment repo for Helm value changes and auto-syncs. Application repos build images and push updated tags to app-deployment via GitHub App token (cross-repo).
Secrets: GitHub Actions OIDC → Azure Workload Identity Federation (no stored Azure credentials in GitHub)
Database migrations: Liquibase runs as a Helm pre-upgrade hook Job (not init container) — prevents race conditions in multi-replica deploys
Job RBAC: Backend service account has Role + RoleBinding to create/manage K8s Jobs in jobs namespace
Job container images: Matlab Runtime and Python runner images in separate app-job-images repo (different build lifecycle)
Branch strategy: develop + main (matches current team workflow)
Adding new services: create a new values-<name>.yaml in app-deployment, add an ArgoCD Application manifest, and wire the source repo's CI to push image tags — no infrastructure changes needed
K8s backup: Velero with daily scheduled backups to Azure Blob Storage (168h retention)
Environments & Cost
Node Sizing (start small, scale via tfvars)
Dev/PreProd start single-node. Prod gets 2 nodes each for zero-downtime maintenance. Scale up by changing tfvars.
Dev
PreProd
Prod
AKS control plane
Standard (required by Karpenter)
Standard
Standard
AKS system pool
1x Standard_D2s_v5
1x Standard_D2s_v5
2x Standard_D2s_v5
AKS app pool
1x Standard_D2s_v5
1x Standard_D2s_v5
2x Standard_D2s_v5
AKS monitoring pool
1x Standard_D2s_v5
1x Standard_D2s_v5
1x Standard_D2s_v5
AKS job nodes
Karpenter (0→N on demand)
Karpenter (0→N)
Karpenter (0→N)
PostgreSQL
Burstable B1ms
Burstable B1ms
Burstable B1ms
Blob Storage
LRS
LRS
ZRS
Key Vault
Standard
Standard
Standard
Job worker pods get their own PVCs via job-scratch StorageClass (StandardSSD_LRS, WaitForFirstConsumer binding for zone-awareness with Karpenter).
Monthly Cost Estimate (West Europe, pay-as-you-go)
Component
Dev
PreProd
Prod
AKS nodes (system+app+monitoring)
~€120
~€120
~€180
AKS control plane (Standard)
~€60
~€60
~€60
PostgreSQL
~€15
~€15
~€25
Blob Storage
~€5
~€5
~€12
Key Vault
~€5
~€5
~€5
Load Balancer
~€20
~€20
~€20
NAT Gateway
~€40
~€40
~€40
Log Analytics (basic)
~€10
~€10
~€10
Total
~€275/mo
~€275/mo
~€352/mo
Notes:
Current scale target: ~50-100 daily users, built ready to scale
Job pool nodes only incur cost when jobs are running (auto-scale from zero)
Grafana/Prometheus/Loki run on the system pool at no extra Azure cost
GHCR storage is included with GitHub Enterprise
Scale-up path: change SKU/node counts in tfvars, terraform apply — no re-architecture needed
When scaling: add nodes, move PostgreSQL to General Purpose, enable geo-redundant backup
Out of Scope
Application code changes beyond Oracle→PostgreSQL dialect and Minio→Blob endpoint swap
Azure Blob SDK migration (deferred — S3-compat layer used initially)
WAF (Web Application Firewall) — can be added later via Azure Front Door if needed
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.
Goal: Provision Azure infrastructure for the APP application migration from on-prem Docker/MicroK8s to AKS with managed backing services, deployed via Terraform, Helm, and ArgoCD.
Architecture: AKS-Centric with dedicated node pools (system, app, monitoring) + Karpenter NAP for jobs, managed PostgreSQL Flexible Server, Azure Blob Storage (dual accounts), ArgoCD GitOps deployment. Three environments: Dev, PreProd, Prod.
This plan was designed WITHOUT access to the application source code. The infrastructure side (Tasks 1-9) is complete and reviewed. The application-facing side (Tasks 10-13) uses assumptions that need validation against the real codebase.
What to execute as-is
Tasks 1-9 (infrastructure + K8s base config) are infra-only — they create Azure resources, networking, and K8s manifests with no dependency on application code. These have been through 25 specialist reviews and can be executed immediately:
Tasks 10-13 need adjustment based on the actual application repos:
Task 10 (Helm chart): Verify service ports match reality (assumed 8080/8081/8082). Check health endpoint paths (/actuator/health/* assumed — may differ). Adjust resource requests based on actual service profiles. Map Liquibase changelog paths to real locations in the Docker images.
Task 11 (Terraform CI): Execute as-is — no app dependency.
Task 12 (App CI template): Adapt the workflow template to each real source repo. Fill in actual IMAGE_NAME, DOCKER_CONTEXT, and HELM_VALUES_FILE values. Add repo-specific build/test commands.
Task 13 (Bootstrap): Execute after Tasks 1-9 are applied. App deployment validation (Helm install) depends on Task 10 adjustments.
Assumptions to validate against source code
Assumption
Where used
What to check
Services listen on 8080, 8081, 8082
Helm values, network policies
Actual ports in Dockerfiles / Spring Boot config
Health endpoints at /actuator/health/liveness and /readiness
deployment.yaml
Actual health check paths
Spring Boot with SPRING_DATASOURCE_URL env var
Helm values
Actual env var names for DB connection
Liquibase changelogs at /liquibase/changelog.xml
migration-job.yaml
Actual changelog location in Docker image
Backend creates K8s Jobs for script execution
RBAC, network policies
How job creation actually works in the code
Frontend bundled in backend Docker image
No separate frontend deployment
Verify — may need separate Helm values
3 databases: app_db, storage_db, importer_db
Terraform database module
Actual database names and schema requirements
Services communicate via HTTP (Backend → Storage, Importer → Storage)
Network policies
Actual inter-service call patterns
Adding new services
When onboarding an additional application repo:
Add a values-<name>.yaml in app-deployment/helm/app-service/
Add an ArgoCD Application manifest in app-deployment/argocd/
Copy the CI workflow template to the source repo, set 3 env vars
If the service needs DB access, add a database in the Terraform database module and a network policy egress rule
Application Source Repos (existing — NOT created by this plan)
The application consists of multiple existing repos (backend, frontend, storage, importer, etc.). Each repo that produces a deployable container image must adopt the integration contract:
Build OCI image → push to GHCR tagged with git SHA
Cross-repo push updated image tag to app-deployment (via GitHub App token)
Include [skip ci] in the tag-update commit
A reusable workflow template is provided in app-deployment/.github/workflow-templates/ for adoption.
Fill in the actual repo names and image mappings during onboarding:
# Partial backend configuration. The state file key is passed per-environment# at init time via: terraform init -backend-config "key=app-${ENV}.tfstate"terraform {
backend"azurerm" {
resource_group_name="rg-app-tfstate"storage_account_name="stappinfratfstate"container_name="tfstate"
}
}
Step 5: Write root variables.tf with shared variables
terraform/variables.tf:
variable"subscription_id" {
description="Azure subscription ID"type=string
}
variable"environment" {
description="Environment name: dev, preprod, or prod"type=stringvalidation {
condition=contains(["dev", "preprod", "prod"], var.environment)
error_message="Environment must be dev, preprod, or prod."
}
}
variable"location" {
description="Azure region"type=stringdefault="westeurope"
}
variable"project_name" {
description="Project identifier used in resource naming"type=stringdefault="app"
}
variable"sftp_allowed_ips" {
description="List of IP addresses allowed to connect via SFTP"type=list(string)
default=[]
}
variable"aks_system_pool_count" {
description="Number of nodes in the AKS system pool"type=numberdefault=1
}
variable"aks_app_pool_count" {
description="Number of nodes in the AKS app pool"type=numberdefault=1
}
variable"aks_app_pool_max_count" {
description="Max nodes for AKS app pool autoscaler (0 = no autoscaling)"type=numberdefault=0
}
variable"node_auto_provisioning_enabled" {
description="Enable Karpenter-based Node Auto-Provisioning for job workloads"type=booldefault=true
}
variable"postgres_sku_name" {
description="PostgreSQL Flexible Server SKU"type=stringdefault="B_Standard_B1ms"
}
variable"postgres_backup_retention_days" {
description="PostgreSQL backup retention in days"type=numberdefault=7
}
variable"postgres_geo_redundant_backup" {
description="Enable geo-redundant backup for PostgreSQL"type=booldefault=false
}
variable"storage_replication_type" {
description="Blob Storage replication: LRS or ZRS"type=stringdefault="LRS"validation {
condition=contains(["LRS", "ZRS"], var.storage_replication_type)
error_message="Must be LRS or ZRS."
}
}
environment="dev"subscription_id="REPLACE_WITH_SUBSCRIPTION_ID"# AKSaks_system_pool_count=1aks_app_pool_count=1aks_app_pool_max_count=0node_auto_provisioning_enabled=true# Karpenter provisions job nodes dynamically# PostgreSQLpostgres_sku_name="B_Standard_B1ms"postgres_backup_retention_days=7postgres_geo_redundant_backup=false# Storagestorage_replication_type="LRS"# Networkingnat_gateway_enabled=true# SFTPsftp_allowed_ips=[]
# Budget (subscription-level, created only in dev apply)budget_alert_emails=["REPLACE_WITH_EMAIL"]
Step 5: Write preprod.tfvars
terraform/environments/preprod.tfvars:
environment="preprod"subscription_id="REPLACE_WITH_SUBSCRIPTION_ID"# AKS — start single-node, scale via these values when neededaks_system_pool_count=1aks_app_pool_count=1aks_app_pool_max_count=0node_auto_provisioning_enabled=true# Karpenter provisions job nodes dynamically# PostgreSQL — start burstable, upgrade to GP_Standard_D2s_v3 when neededpostgres_sku_name="B_Standard_B1ms"postgres_backup_retention_days=7postgres_geo_redundant_backup=false# Storagestorage_replication_type="LRS"# Networkingnat_gateway_enabled=true# SFTPsftp_allowed_ips=[]
# Budget (subscription-level, created only in dev apply)budget_alert_emails=["REPLACE_WITH_EMAIL"]
Step 6: Write prod.tfvars
terraform/environments/prod.tfvars:
environment="prod"subscription_id="REPLACE_WITH_SUBSCRIPTION_ID"# AKS — 2 nodes each for zero-downtime during Azure host maintenanceaks_system_pool_count=2aks_app_pool_count=2aks_app_pool_max_count=0node_auto_provisioning_enabled=true# Karpenter provisions job nodes dynamically# PostgreSQL — start burstable, upgrade to GP_Standard_D2s_v3 when neededpostgres_sku_name="B_Standard_B1ms"postgres_backup_retention_days=35# Geo-redundant backup requires General Purpose tier — enable when upgrading SKU:# postgres_sku_name = "GP_Standard_D2s_v3"# postgres_geo_redundant_backup = truepostgres_geo_redundant_backup=false# Storage — ZRS for zone redundancy in prodstorage_replication_type="ZRS"# Networkingnat_gateway_enabled=true# SFTPsftp_allowed_ips=[]
# Budget (subscription-level, created only in dev apply)budget_alert_emails=["REPLACE_WITH_EMAIL"]
The Backend service needs to create K8s Jobs in the jobs namespace for script execution. This Role + RoleBinding grants the backend's service account the required permissions.
k8s/rbac/job-creator.yaml:
# Role granting Job lifecycle management in the jobs namespaceapiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata:
name: job-creatornamespace: jobsrules:
- apiGroups: ["batch"]resources: ["jobs"]verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: [""]resources: ["pods", "pods/log"]verbs: ["get", "list", "watch"]
---
# Bind to the backend service accountapiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata:
name: backend-job-creatornamespace: jobssubjects:
- kind: ServiceAccountname: app-backendnamespace: approleRef:
kind: Rolename: job-creatorapiGroup: rbac.authorization.k8s.io
Step 11: Write Karpenter NodePool for job workloads
k8s/karpenter/job-nodepool.yaml:
# Karpenter NodePool — defines constraints for dynamically provisioned job nodes.# Nodes are created on-demand when job pods are pending, and consolidated/terminated# when idle. VM size is selected automatically based on pod resource requests.apiVersion: karpenter.azure.com/v1alpha2kind: NodePoolmetadata:
name: job-workersspec:
template:
metadata:
labels:
nodepool: jobspec:
taints:
- key: workloadvalue: jobeffect: NoSchedule# Force node recycling after 24h to pick up OS patchesexpireAfter: 24hrequirements:
# Allow D-series (balanced) and E-series (memory-optimized) VMs
- key: karpenter.azure.com/sku-familyoperator: Invalues: ["D", "E"]# Only use v5 generation for cost/performance balance
- key: karpenter.azure.com/sku-versionoperator: Invalues: ["v5"]# Limit max VM size to 96 vCPU (E96as_v5 = 96 vCPU, 672 GiB)
- key: karpenter.azure.com/sku-cpuoperator: Ltvalues: ["97"]nodeClassRef:
group: karpenter.azure.comkind: AKSNodeClassname: job-workers# Consolidation: only terminate nodes when fully empty (not underutilized)# to avoid killing nodes mid-job-execution during bursty workloadsdisruption:
consolidationPolicy: WhenEmptyconsolidateAfter: 300s# Limit total resources across all Karpenter-managed job nodes# Allows ~2 concurrent max-sized jobs (96 vCPU + 672 GiB each)limits:
cpu: "192"memory: "2048Gi"
Step 11: Write AKSNodeClass for job nodes
k8s/karpenter/job-nodeclass.yaml:
apiVersion: karpenter.azure.com/v1alpha2kind: AKSNodeClassmetadata:
name: job-workersspec:
# Must match the AKS subnet so job nodes join the same VNet# Replace with your actual subnet resource ID after terraform applyvnetSubnetID: /subscriptions/SUBSCRIPTION_ID/resourceGroups/rg-app-ENV/providers/Microsoft.Network/virtualNetworks/vnet-app-ENV/subnets/snet-aksosDiskSizeGB: 100imageFamily: Ubuntu2204
Step 12: Write StorageClass for job worker PVCs
Job pods get their own PVCs for scratch space (script inputs/outputs, temp data). Azure Disk PVCs are zone-pinned, so we use volumeBindingMode: WaitForFirstConsumer — the PVC waits until the pod is scheduled to a node, then provisions the disk in the same zone as that node. This prevents zone mismatch with Karpenter-provisioned nodes.
Task 12: Reusable CI Workflow Template (lives in app-deployment repo)
Each application source repo needs a CI workflow that builds images and updates tags in app-deployment. Instead of duplicating this across repos, provide a reusable workflow template.
Create: .github/workflows/app-deploy.yml (example for app source repos to copy)
Step 1: Write app-deploy.yml
Note: This workflow lives in each application source repo (not app-deployment). The version below is a template — each app repo copies it and adjusts the SERVICE_NAME and DOCKER_CONTEXT values.
.github/workflows/app-deploy.yml:
# APP Service CI — Template for each application source repo.# Copy this file to each app repo's .github/workflows/ and set the env vars below.## Required secrets (set at org level for all app repos):# DEPLOY_APP_ID — GitHub App ID with write access to app-deployment# DEPLOY_APP_PRIVATE_KEY — GitHub App private keyname: CIon:
pull_request:
push:
branches: [develop, main]permissions:
id-token: writecontents: readpackages: writeenv:
REGISTRY: ghcr.io# ── Per-repo config (change these when copying to a new repo) ──IMAGE_NAME: app-backend # GHCR image nameDOCKER_CONTEXT: . # Docker build context pathHELM_VALUES_FILE: values-backend.yaml # Corresponding file in app-deploymentjobs:
build-and-test:
runs-on: ubuntu-lateststeps:
- uses: actions/checkout@v4
- name: Build and testrun: echo "Replace with your build/test commands"push-image:
needs: build-and-testif: github.event_name == 'push'runs-on: ubuntu-lateststeps:
- uses: actions/checkout@v4
- name: Log in to GHCRuses: docker/login-action@v3with:
registry: ${{ env.REGISTRY }}username: ${{ github.actor }}password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and pushuses: docker/build-push-action@v6with:
context: ${{ env.DOCKER_CONTEXT }}push: truetags: | ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ env.IMAGE_NAME }}:${{ github.sha }} ${{ env.REGISTRY }}/${{ github.repository_owner }}/${{ env.IMAGE_NAME }}:latest# Cross-repo: update image tag in app-deployment so ArgoCD picks it upupdate-deployment:
needs: push-imageruns-on: ubuntu-lateststeps:
- name: Generate GitHub App Tokenid: app-tokenuses: actions/create-github-app-token@v1with:
app-id: ${{ secrets.DEPLOY_APP_ID }}private-key: ${{ secrets.DEPLOY_APP_PRIVATE_KEY }}repositories: app-deployment
- name: Checkout app-deployment repouses: actions/checkout@v4with:
repository: ${{ github.repository_owner }}/app-deploymenttoken: ${{ steps.app-token.outputs.token }}
- name: Update image tagrun: | sed -i "s|tag:.*|tag: ${{ github.sha }}|" \ "helm/app-service/${{ env.HELM_VALUES_FILE }}"
- name: Commit and push updated tagsrun: | git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" git add helm/app-service/values-*.yaml git diff --cached --quiet || git commit -m "chore: update image tags to ${{ github.sha }} [skip ci]" git push
git add .github/workflows/app-deploy.yml
git commit -m "feat: add GitHub Actions workflow for app build and deploy"
Task 13: Dev Environment Bootstrap and Validation
This task is executed manually after all code is committed. It provisions the Dev environment and validates end-to-end.
Step 1: Create the Terraform state backend
az group create --name rg-app-tfstate --location westeurope
az storage account create \
--name stappinfratfstate \
--resource-group rg-app-tfstate \
--location westeurope \
--sku Standard_LRS \
--min-tls-version TLS1_2
az storage container create \
--name tfstate \
--account-name stappinfratfstate
Step 2: Initialize and apply Terraform for Dev
cd terraform
terraform init -backend-config="key=app-dev.tfstate"
terraform plan \
-var-file="environments/dev.tfvars" \
-var="postgres_admin_password=REPLACE_WITH_SECURE_PASSWORD" \
-out=dev.tfplan
terraform apply dev.tfplan
Step 3: Store postgres admin password in Key Vault
az keyvault secret set \
--vault-name kv-app-dev \
--name postgres-admin-password \
--value "REPLACE_WITH_SECURE_PASSWORD"
Step 4: Verify Azure resources exist
az group show --name rg-app-dev --query "properties.provisioningState" -o tsv
# Expected: Succeeded
az aks show --resource-group rg-app-dev --name aks-app-dev --query "provisioningState" -o tsv
# Expected: Succeeded
az postgres flexible-server show --resource-group rg-app-dev --name psql-app-dev --query "state" -o tsv
# Expected: Ready# Internal storage account
az storage account show --resource-group rg-app-dev --name stappdevint --query "provisioningState" -o tsv
# Expected: Succeeded# SFTP storage account
az storage account show --resource-group rg-app-dev --name stappdevsftp --query "provisioningState" -o tsv
# Expected: Succeeded# NAT Gateway
az network nat gateway show --resource-group rg-app-dev --name natgw-app-dev --query "provisioningState" -o tsv
# Expected: Succeeded
Step 5: Connect to AKS and apply K8s base config
az aks get-credentials --resource-group rg-app-dev --name aks-app-dev --overwrite-existing
kubectl apply -f k8s/namespaces.yaml
kubectl apply -f k8s/network-policies/
kubectl apply -f k8s/karpenter/
Step 6: Verify namespaces and network policies
kubectl get namespaces app jobs monitoring
# Expected: all Active
kubectl get networkpolicies -n app
# Expected: default-deny-all, allow-app-inter-service, allow-egress-postgres, allow-egress-blob
kubectl get networkpolicies -n jobs# Expected: default-deny-all, allow-egress-blob, allow-jobs-internal (no postgres — jobs don't access DB)