System Design Document
Last Updated: March 2025
Author: Barrett Little
This document outlines the architecture and implementation for deploying Lightning Network Daemon (LND) infrastructure on Google Cloud Platform (GCP). The system is designed to support global, 24/7 financial transactions with enterprise-grade security, high availability, and operational maturity. This infrastructure serves as a foundational platform for building Lightning Network applications and services.
- Introduction
- Architecture Principles
- Organizational Structure
- Network Architecture
- Lightning Core Infrastructure
- Data Persistence and Backup
- Security Model
- Observability and Operations
- Infrastructure as Code
- References
The Lightning Network (BOLT specifications) is a Layer 2 payment protocol operating on Bitcoin that enables instant, high-volume micropayments. Operating Lightning infrastructure requires:
- 24/7 Uptime: Nodes must remain online to monitor channels and respond to payments
- Low Latency: Sub-second response times for payment routing and forwarding
- Data Integrity: Channel state must be preserved with zero data loss
- Security: Private keys and channel data must be protected from unauthorized access
- Disaster Recovery: Channel backup files (Static Channel Backups) must be maintained
This infrastructure is architected to provide:
- High Availability: Multi-zone deployment with automated failover
- Security in Depth: Network isolation, encryption at rest and in transit, least-privilege IAM
- Operational Excellence: Comprehensive observability, automated deployments, documented runbooks
- Cost Efficiency: Right-sized resources with budget controls and monitoring
- Developer Experience: Clear abstractions and reusable modules for application teams
We maintain strict separation between development, non-production, and production environments:
- Development: Rapid iteration with relaxed controls for experimentation
- Non-Production: Staging environment mirroring production for integration testing
- Production: Customer-facing services with maximum security controls
Common resources are centralized to reduce operational overhead:
- DNS, logging, monitoring, and container registries are managed once
- Billing and security monitoring provide organization-wide visibility
- Secrets management provides a single source of truth for credentials
Following GCP's security foundations guide, we implement:
- VPC Service Controls for data exfiltration prevention
- Shared VPC architecture for centralized network security
- Organization policies enforcing security baselines
- Workload Identity for pod-to-GCP authentication without static credentials
All infrastructure is defined in Terraform and managed through GitOps workflows:
- Version-controlled infrastructure changes with PR reviews
- Automated drift detection and compliance checking
- Reproducible environments across all stages
foo.xyz (Organization)
│
├── fldr-bootstrap
│ └── Automation service accounts and IAM
│
├── fldr-common
│ ├── prj-common-dns-hub
│ ├── prj-common-logging
│ ├── prj-common-billing
│ ├── prj-common-secrets
│ ├── prj-common-scc (Security Command Center)
│ ├── prj-common-container-registry
│ └── prj-common-monitoring
│
├── fldr-production
│ ├── prj-prod-shared-base-network
│ ├── prj-prod-shared-restricted-network
│ ├── prj-prod-secrets
│ ├── prj-prod-monitoring
│ └── fldr-prod-applications
│ └── Lightning application projects
│
├── fldr-nonproduction
│ ├── prj-nonprod-shared-base-network
│ ├── prj-nonprod-shared-restricted-network
│ ├── prj-nonprod-secrets
│ ├── prj-nonprod-monitoring
│ └── fldr-nonprod-applications
│ └── Lightning application projects
│
└── fldr-development
├── prj-dev-shared-base-network
├── prj-dev-shared-restricted-network
├── prj-dev-secrets
├── prj-dev-monitoring
└── fldr-dev-applications
└── Lightning application projects
Contains automation resources for organization-wide infrastructure management:
- Terraform Cloud service accounts
- GitHub Actions OIDC federation
- CI/CD pipeline IAM roles
Provides shared services consumed by all environments:
- DNS Hub: Central Cloud DNS zone management
- Logging: Organization-level Cloud Logging sink aggregation
- Billing: BigQuery exports and budget alerts
- Secrets: Secret Manager for cross-environment secrets
- Security Command Center: Centralized security findings and compliance
- Container Registry: Artifact Registry for Docker images
- Monitoring: Cloud Monitoring dashboards and alerting policies
Each environment folder (fldr-production, fldr-nonproduction, fldr-development) follows identical structure:
- Shared VPC Host Projects: Network infrastructure with base and restricted tiers
- Secrets Project: Environment-specific credentials and certificates
- Monitoring Project: Environment-scoped dashboards and alerts
- Applications Folder: Contains all application-specific projects
We implement Shared VPC to centralize network administration while allowing application teams autonomy:
Host Project: prj-{env}-shared-base-network
├── vpc-{env}-shared-base
│ ├── subnet-{env}-gke-nodes (10.0.0.0/20)
│ ├── subnet-{env}-gke-pods (10.1.0.0/16)
│ └── subnet-{env}-gke-services (10.2.0.0/20)
Host Project: prj-{env}-shared-restricted-network
├── vpc-{env}-shared-restricted
│ ├── subnet-{env}-databases (10.10.0.0/24)
│ └── subnet-{env}-management (10.10.1.0/24)
Base Network: For workloads requiring internet egress and moderate security controls
- GKE clusters running stateless services
- Application APIs and web services
- Cloud NAT for controlled egress
Restricted Network: For sensitive workloads with strict access controls
- Cloud SQL databases containing channel state
- Secrets management services
- VPC Service Controls perimeter enforcement
- Cloud NAT: Provides outbound internet connectivity without exposing nodes
- Private Google Access: Enables private access to GCP APIs
- VPC Peering: Connects to partner networks when required
- Cloud VPN/Interconnect: For hybrid connectivity (future)
fldr-{env}-applications
└── prj-{env}-lightning-core
├── gke-{env}-lightning-cluster
│ ├── bitcoin-core (StatefulSet)
│ └── lnd-nodes (StatefulSet)
├── sql-{env}-lnd-database (Cloud SQL PostgreSQL)
└── bkt-{env}-lightning-backups (Cloud Storage)
Bitcoin Core runs as a StatefulSet to provide blockchain data to LND:
Configuration:
- Pruned mode disabled (full node required for Lightning)
txindex=1enabled for transaction lookups- ZMQ enabled for real-time block and transaction notifications
- RPC authentication using cookie file stored in Secret Manager
Storage:
- Persistent disk: 1TB SSD for blockchain data (~600GB as of 2025)
- Disk snapshots scheduled daily for disaster recovery
- Local SSD for chainstate database (performance-critical)
Resources:
- CPU: 4 vCPUs
- Memory: 8GB RAM
- Disk: 1TB persistent SSD + 375GB local SSD
LND (Lightning Network Daemon) runs as a StatefulSet with the following architecture:
Database Backend:
- Uses PostgreSQL backend instead of default bbolt
- Cloud SQL for PostgreSQL with high availability configuration
- Private IP only, accessed via Cloud SQL Proxy
Wallet Security:
- Multi-signature wallet configuration using LND's native multisig support
- 2-of-3 signing threshold with keys distributed across Cloud HSM, offline hardware wallet, and hot key in Secret Manager
- Hot key used for automated channel operations; cold keys required for on-chain fund movements
Configuration:
- Watchtower client enabled with multiple independent watchtower services for channel monitoring redundancy
- Anchor outputs commitment type for fee management
- Neutrino disabled (using Bitcoin Core for chain backend)
- TLS certificate stored in Secret Manager
- Macaroon authentication with separate admin/readonly credentials
- RPC rate limiting: 100 requests/second per client to prevent resource exhaustion
Storage:
- Channel state: Cloud SQL PostgreSQL
- Static channel backups (SCB): Cloud Storage with versioning
- TLS certificates and macaroons: Secret Manager
- Channel graph cache: Persistent disk (10GB)
Resources:
- CPU: 4 vCPUs (bursts to 8)
- Memory: 16GB RAM
- Network: Premium tier for low latency
High Availability:
- Pod anti-affinity rules ensure distribution across zones
- Readiness and liveness probes for automated recovery
- PodDisruptionBudget to maintain minimum availability during maintenance
GKE Cluster Specifications:
- Release channel: Regular (for balance of stability and features)
- Node pools: Separate pools for Bitcoin Core and LND for resource isolation
- Workload Identity enabled for pod-to-GCP authentication
- Binary Authorization for container image verification
- Network policy enforcement via Calico
Node Pool: bitcoin-core
machine_type: n2-standard-4
disk_size_gb: 50
disk_type: pd-ssd
min_nodes: 1
max_nodes: 1 # StatefulSet with single replicaNode Pool: lnd-nodes
machine_type: n2-standard-4
disk_size_gb: 50
disk_type: pd-ssd
min_nodes: 2
max_nodes: 4
autoscaling: enabledCloud SQL Configuration:
- Version: PostgreSQL 14
- Tier: db-custom-4-16384 (4 vCPU, 16GB RAM)
- High availability: Regional configuration with automatic failover
- Backup: Automated daily backups with 7-day retention
- Point-in-time recovery: Enabled with 7-day recovery window
- Encryption: Customer-managed encryption keys (CMEK) in Cloud KMS
Database Schema:
- LND manages schema via internal migrations
- Read replicas for analytics workloads (future)
Static Channel Backups (SCB) are critical for channel recovery:
Backup Strategy:
- Automated SCB file upload to Cloud Storage on every channel state change
- Object versioning enabled for historical recovery
- Cross-region replication to secondary region
- Cross-cloud replication to AWS S3 via Storage Transfer Service for defense against GCP-specific incidents
- Lifecycle policy: Retain 30 days of versions
Recovery Procedure:
- Latest SCB file retrieved from Cloud Storage
- LND
--reset-wallet-transactionsflag used to rescan blockchain - Channels force-closed to recover funds on-chain
Cloud Storage Bucket Configuration:
bkt-{env}-lightning-backups/
├── scb/
│ └── channel.backup (versioned)
├── macaroons/
│ ├── admin.macaroon
│ └── readonly.macaroon
├── tls/
│ ├── tls.cert
│ └── tls.key
└── config/
└── lnd.conf
Backup Validation:
- Daily automated restore tests in non-production environment
- Alerting on backup failures via Cloud Monitoring
Following principle of least privilege:
Service Accounts:
sa-lnd-node@prj-{env}-lightning-core.iam.gserviceaccount.com- Permissions: Cloud SQL Client, Secret Manager Accessor, Storage Object Creator
sa-bitcoin-core@prj-{env}-lightning-core.iam.gserviceaccount.com- Permissions: Storage Object Viewer (for snapshots)
Workload Identity Binding:
Kubernetes ServiceAccount: lnd-node
GCP ServiceAccount: sa-lnd-node@...
Binding: iam.gke.io/gcp-service-account=sa-lnd-node@...
All sensitive data stored in Secret Manager:
- LND wallet password (auto-unlocking via startup script)
- Bitcoin Core RPC credentials
- Database connection strings
- TLS certificates and private keys
- Macaroon files
- Multi-sig hot key (cold keys stored offline in hardware wallets)
Access Controls:
- Secrets accessed via Workload Identity (no static credentials)
- Audit logging enabled for all secret access
- Automatic rotation for database credentials (90-day cycle)
- Cloud HSM integration for cryptographic operations on multi-sig keys
Firewall Rules:
- Default deny ingress on all VPCs
- Explicit allow rules for:
- Bitcoin P2P (port 8333) from Cloud NAT IPs
- LND P2P (port 9735) from Cloud NAT IPs
- Internal service-to-service communication
- Egress restricted via VPC Service Controls in production
TLS Everywhere:
- LND gRPC/REST API requires TLS 1.3
- Cloud SQL connections via TLS
- Internal service mesh with mTLS (future: Istio)
- Cloud Audit Logs for all admin and data access
- Security Command Center for security findings
- Organization policy constraints enforcing:
- VM external IP restrictions
- Uniform bucket-level access on Cloud Storage
- Domain-restricted sharing
- OS Login requirement on VMs
GCP Native Tools:
- Cloud Monitoring for metrics and alerting
- Cloud Logging for structured log aggregation
- Cloud Trace for distributed request tracing
Datadog Integration:
- Agent deployed via DaemonSet for enhanced observability
- Custom dashboards for Lightning-specific metrics
- Anomaly detection for channel balance changes
Infrastructure:
- GKE node CPU/memory utilization
- Pod restart count and crash loop detection
- Persistent disk IOPS and throughput
Bitcoin Core:
- Block height lag (should be 0)
- Peer connection count
- Mempool transaction count
- ZMQ notification latency
LND:
- Channel count (active, pending, inactive)
- Total channel capacity (local and remote balance)
- Payment success/failure rate
- Forwarding fee revenue
- HTLC processing latency
- Database query performance
Critical alerts configured with PagerDuty escalation:
- Bitcoin Core sync lag > 2 blocks
- LND node offline > 5 minutes
- Channel force closure detected
- Database connection failures
- Backup upload failures
- Pod crash loop
Structured Logging:
- JSON format with standardized fields
- Correlation IDs for request tracing
- Log levels: ERROR, WARN, INFO, DEBUG
Log Retention:
- 30 days in Cloud Logging
- 1 year in BigQuery for analysis
- Audit logs retained for 7 years
Cross-Cloud Log Export:
- Real-time log streaming to AWS S3 via Pub/Sub push subscription and Lambda for tamper-evident audit trail
- Protects forensic evidence in the event of GCP account compromise
- S3 bucket configured with Object Lock (compliance mode) and MFA delete
Repository Structure:
terraform/
├── modules/
│ ├── folder/
│ ├── project/
│ ├── shared-vpc/
│ ├── gke-cluster/
│ └── lightning-core/
├── environments/
│ ├── development/
│ ├── nonproduction/
│ └── production/
└── bootstrap/
State Management:
- Terraform Cloud for remote state
- State locking to prevent concurrent modifications
- Workspace-per-environment pattern
Example Module: Lightning Core Project
# modules/lightning-core/main.tf
variable "org_id" {
description = "GCP Organization ID"
type = string
}
variable "environment_code" {
description = "Environment code (d/n/p)"
type = string
validation {
condition = contains(["d", "n", "p"], var.environment_code)
error_message = "Must be d (dev), n (nonprod), or p (prod)"
}
}
variable "folder_id" {
description = "Parent folder ID"
type = string
}
locals {
env_map = {
d = "dev"
n = "nonprod"
p = "prod"
}
environment = local.env_map[var.environment_code]
project_id = "prj-${local.environment}-lightning-core"
}
resource "google_project" "lightning_core" {
name = local.project_id
project_id = "${local.project_id}-${random_id.suffix.hex}"
folder_id = var.folder_id
labels = {
environment = local.environment
application = "lightning"
managed_by = "terraform"
}
}
resource "random_id" "suffix" {
byte_length = 4
}
# Enable required APIs
resource "google_project_service" "services" {
for_each = toset([
"container.googleapis.com",
"sqladmin.googleapis.com",
"secretmanager.googleapis.com",
"monitoring.googleapis.com",
"logging.googleapis.com",
])
project = google_project.lightning_core.project_id
service = each.key
disable_on_destroy = false
}GitHub Actions Workflow:
name: Terraform Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Authenticate to GCP
uses: google-github-actions/auth@v1
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.SA_EMAIL }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
cli_config_credentials_token: ${{ secrets.TF_CLOUD_TOKEN }}
- name: Terraform Init
run: terraform init
- name: Terraform Plan
run: terraform plan -out=tfplan
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply tfplanFollowing GCP naming best practices:
| Resource Type | Pattern | Example |
|---|---|---|
| Folder | fldr-{purpose} |
fldr-production |
| Project | prj-{env}-{purpose} |
prj-prod-lightning-core |
| VPC | vpc-{env}-{tier} |
vpc-prod-shared-base |
| Subnet | subnet-{env}-{purpose} |
subnet-prod-gke-nodes |
| GKE Cluster | gke-{env}-{purpose} |
gke-prod-lightning-cluster |
| Cloud SQL | sql-{env}-{purpose} |
sql-prod-lnd-database |
| Storage Bucket | bkt-{env}-{purpose} |
bkt-prod-lightning-backups |
Environment codes: dev, nonprod, prod
- GCP Security Foundations Guide
- Shared VPC Best Practices
- GKE Security Hardening
- Cloud SQL High Availability
- Secret Manager Documentation
- Cloud HSM Documentation