Skip to content

Instantly share code, notes, and snippets.

@pblittle
Created October 30, 2025 21:12
Show Gist options
  • Save pblittle/125d00ade05633bf4aae5dc92edadcbb to your computer and use it in GitHub Desktop.
Save pblittle/125d00ade05633bf4aae5dc92edadcbb to your computer and use it in GitHub Desktop.
LND Infra on Google Cloud Platform

Lightning Network Infrastructure on Google Cloud Platform

System Design Document
Last Updated: March 2025
Author: Barrett Little

Executive Summary

This document outlines the architecture and implementation for deploying Lightning Network Daemon (LND) infrastructure on Google Cloud Platform (GCP). The system is designed to support global, 24/7 financial transactions with enterprise-grade security, high availability, and operational maturity. This infrastructure serves as a foundational platform for building Lightning Network applications and services.

Table of Contents

  1. Introduction
  2. Architecture Principles
  3. Organizational Structure
  4. Network Architecture
  5. Lightning Core Infrastructure
  6. Data Persistence and Backup
  7. Security Model
  8. Observability and Operations
  9. Infrastructure as Code
  10. References

Introduction

Lightning Network Requirements

The Lightning Network (BOLT specifications) is a Layer 2 payment protocol operating on Bitcoin that enables instant, high-volume micropayments. Operating Lightning infrastructure requires:

  • 24/7 Uptime: Nodes must remain online to monitor channels and respond to payments
  • Low Latency: Sub-second response times for payment routing and forwarding
  • Data Integrity: Channel state must be preserved with zero data loss
  • Security: Private keys and channel data must be protected from unauthorized access
  • Disaster Recovery: Channel backup files (Static Channel Backups) must be maintained

Design Goals

This infrastructure is architected to provide:

  1. High Availability: Multi-zone deployment with automated failover
  2. Security in Depth: Network isolation, encryption at rest and in transit, least-privilege IAM
  3. Operational Excellence: Comprehensive observability, automated deployments, documented runbooks
  4. Cost Efficiency: Right-sized resources with budget controls and monitoring
  5. Developer Experience: Clear abstractions and reusable modules for application teams

Architecture Principles

Multi-Environment Isolation

We maintain strict separation between development, non-production, and production environments:

  • Development: Rapid iteration with relaxed controls for experimentation
  • Non-Production: Staging environment mirroring production for integration testing
  • Production: Customer-facing services with maximum security controls

Shared Services Model

Common resources are centralized to reduce operational overhead:

  • DNS, logging, monitoring, and container registries are managed once
  • Billing and security monitoring provide organization-wide visibility
  • Secrets management provides a single source of truth for credentials

Security-First Design

Following GCP's security foundations guide, we implement:

  • VPC Service Controls for data exfiltration prevention
  • Shared VPC architecture for centralized network security
  • Organization policies enforcing security baselines
  • Workload Identity for pod-to-GCP authentication without static credentials

Infrastructure as Code

All infrastructure is defined in Terraform and managed through GitOps workflows:

  • Version-controlled infrastructure changes with PR reviews
  • Automated drift detection and compliance checking
  • Reproducible environments across all stages

Organizational Structure

Folder Hierarchy

foo.xyz (Organization)
│
├── fldr-bootstrap
│   └── Automation service accounts and IAM
│
├── fldr-common
│   ├── prj-common-dns-hub
│   ├── prj-common-logging
│   ├── prj-common-billing
│   ├── prj-common-secrets
│   ├── prj-common-scc (Security Command Center)
│   ├── prj-common-container-registry
│   └── prj-common-monitoring
│
├── fldr-production
│   ├── prj-prod-shared-base-network
│   ├── prj-prod-shared-restricted-network
│   ├── prj-prod-secrets
│   ├── prj-prod-monitoring
│   └── fldr-prod-applications
│       └── Lightning application projects
│
├── fldr-nonproduction
│   ├── prj-nonprod-shared-base-network
│   ├── prj-nonprod-shared-restricted-network
│   ├── prj-nonprod-secrets
│   ├── prj-nonprod-monitoring
│   └── fldr-nonprod-applications
│       └── Lightning application projects
│
└── fldr-development
    ├── prj-dev-shared-base-network
    ├── prj-dev-shared-restricted-network
    ├── prj-dev-secrets
    ├── prj-dev-monitoring
    └── fldr-dev-applications
        └── Lightning application projects

Folder Responsibilities

fldr-bootstrap

Contains automation resources for organization-wide infrastructure management:

  • Terraform Cloud service accounts
  • GitHub Actions OIDC federation
  • CI/CD pipeline IAM roles

fldr-common

Provides shared services consumed by all environments:

  • DNS Hub: Central Cloud DNS zone management
  • Logging: Organization-level Cloud Logging sink aggregation
  • Billing: BigQuery exports and budget alerts
  • Secrets: Secret Manager for cross-environment secrets
  • Security Command Center: Centralized security findings and compliance
  • Container Registry: Artifact Registry for Docker images
  • Monitoring: Cloud Monitoring dashboards and alerting policies

Environment Folders

Each environment folder (fldr-production, fldr-nonproduction, fldr-development) follows identical structure:

  • Shared VPC Host Projects: Network infrastructure with base and restricted tiers
  • Secrets Project: Environment-specific credentials and certificates
  • Monitoring Project: Environment-scoped dashboards and alerts
  • Applications Folder: Contains all application-specific projects

Network Architecture

Shared VPC Design

We implement Shared VPC to centralize network administration while allowing application teams autonomy:

Host Project: prj-{env}-shared-base-network
├── vpc-{env}-shared-base
│   ├── subnet-{env}-gke-nodes (10.0.0.0/20)
│   ├── subnet-{env}-gke-pods (10.1.0.0/16)
│   └── subnet-{env}-gke-services (10.2.0.0/20)

Host Project: prj-{env}-shared-restricted-network
├── vpc-{env}-shared-restricted
│   ├── subnet-{env}-databases (10.10.0.0/24)
│   └── subnet-{env}-management (10.10.1.0/24)

Network Tiers

Base Network: For workloads requiring internet egress and moderate security controls

  • GKE clusters running stateless services
  • Application APIs and web services
  • Cloud NAT for controlled egress

Restricted Network: For sensitive workloads with strict access controls

  • Cloud SQL databases containing channel state
  • Secrets management services
  • VPC Service Controls perimeter enforcement

Connectivity

  • Cloud NAT: Provides outbound internet connectivity without exposing nodes
  • Private Google Access: Enables private access to GCP APIs
  • VPC Peering: Connects to partner networks when required
  • Cloud VPN/Interconnect: For hybrid connectivity (future)

Lightning Core Infrastructure

Application Project Structure

fldr-{env}-applications
└── prj-{env}-lightning-core
    ├── gke-{env}-lightning-cluster
    │   ├── bitcoin-core (StatefulSet)
    │   └── lnd-nodes (StatefulSet)
    ├── sql-{env}-lnd-database (Cloud SQL PostgreSQL)
    └── bkt-{env}-lightning-backups (Cloud Storage)

Bitcoin Core Deployment

Bitcoin Core runs as a StatefulSet to provide blockchain data to LND:

Configuration:

  • Pruned mode disabled (full node required for Lightning)
  • txindex=1 enabled for transaction lookups
  • ZMQ enabled for real-time block and transaction notifications
  • RPC authentication using cookie file stored in Secret Manager

Storage:

  • Persistent disk: 1TB SSD for blockchain data (~600GB as of 2025)
  • Disk snapshots scheduled daily for disaster recovery
  • Local SSD for chainstate database (performance-critical)

Resources:

  • CPU: 4 vCPUs
  • Memory: 8GB RAM
  • Disk: 1TB persistent SSD + 375GB local SSD

LND Deployment

LND (Lightning Network Daemon) runs as a StatefulSet with the following architecture:

Database Backend:

Wallet Security:

  • Multi-signature wallet configuration using LND's native multisig support
  • 2-of-3 signing threshold with keys distributed across Cloud HSM, offline hardware wallet, and hot key in Secret Manager
  • Hot key used for automated channel operations; cold keys required for on-chain fund movements

Configuration:

  • Watchtower client enabled with multiple independent watchtower services for channel monitoring redundancy
  • Anchor outputs commitment type for fee management
  • Neutrino disabled (using Bitcoin Core for chain backend)
  • TLS certificate stored in Secret Manager
  • Macaroon authentication with separate admin/readonly credentials
  • RPC rate limiting: 100 requests/second per client to prevent resource exhaustion

Storage:

  • Channel state: Cloud SQL PostgreSQL
  • Static channel backups (SCB): Cloud Storage with versioning
  • TLS certificates and macaroons: Secret Manager
  • Channel graph cache: Persistent disk (10GB)

Resources:

  • CPU: 4 vCPUs (bursts to 8)
  • Memory: 16GB RAM
  • Network: Premium tier for low latency

High Availability:

  • Pod anti-affinity rules ensure distribution across zones
  • Readiness and liveness probes for automated recovery
  • PodDisruptionBudget to maintain minimum availability during maintenance

Kubernetes Configuration

GKE Cluster Specifications:

  • Release channel: Regular (for balance of stability and features)
  • Node pools: Separate pools for Bitcoin Core and LND for resource isolation
  • Workload Identity enabled for pod-to-GCP authentication
  • Binary Authorization for container image verification
  • Network policy enforcement via Calico

Node Pool: bitcoin-core

machine_type: n2-standard-4
disk_size_gb: 50
disk_type: pd-ssd
min_nodes: 1
max_nodes: 1  # StatefulSet with single replica

Node Pool: lnd-nodes

machine_type: n2-standard-4
disk_size_gb: 50
disk_type: pd-ssd
min_nodes: 2
max_nodes: 4
autoscaling: enabled

Data Persistence and Backup

PostgreSQL Database

Cloud SQL Configuration:

  • Version: PostgreSQL 14
  • Tier: db-custom-4-16384 (4 vCPU, 16GB RAM)
  • High availability: Regional configuration with automatic failover
  • Backup: Automated daily backups with 7-day retention
  • Point-in-time recovery: Enabled with 7-day recovery window
  • Encryption: Customer-managed encryption keys (CMEK) in Cloud KMS

Database Schema:

  • LND manages schema via internal migrations
  • Read replicas for analytics workloads (future)

Static Channel Backups

Static Channel Backups (SCB) are critical for channel recovery:

Backup Strategy:

  • Automated SCB file upload to Cloud Storage on every channel state change
  • Object versioning enabled for historical recovery
  • Cross-region replication to secondary region
  • Cross-cloud replication to AWS S3 via Storage Transfer Service for defense against GCP-specific incidents
  • Lifecycle policy: Retain 30 days of versions

Recovery Procedure:

  • Latest SCB file retrieved from Cloud Storage
  • LND --reset-wallet-transactions flag used to rescan blockchain
  • Channels force-closed to recover funds on-chain

Monitoring Backups

Cloud Storage Bucket Configuration:

bkt-{env}-lightning-backups/
├── scb/
│   └── channel.backup (versioned)
├── macaroons/
│   ├── admin.macaroon
│   └── readonly.macaroon
├── tls/
│   ├── tls.cert
│   └── tls.key
└── config/
    └── lnd.conf

Backup Validation:

  • Daily automated restore tests in non-production environment
  • Alerting on backup failures via Cloud Monitoring

Security Model

Identity and Access Management

Following principle of least privilege:

Service Accounts:

  • sa-lnd-node@prj-{env}-lightning-core.iam.gserviceaccount.com
    • Permissions: Cloud SQL Client, Secret Manager Accessor, Storage Object Creator
  • sa-bitcoin-core@prj-{env}-lightning-core.iam.gserviceaccount.com
    • Permissions: Storage Object Viewer (for snapshots)

Workload Identity Binding:

Kubernetes ServiceAccount: lnd-node
GCP ServiceAccount: sa-lnd-node@...
Binding: iam.gke.io/gcp-service-account=sa-lnd-node@...

Secrets Management

All sensitive data stored in Secret Manager:

  • LND wallet password (auto-unlocking via startup script)
  • Bitcoin Core RPC credentials
  • Database connection strings
  • TLS certificates and private keys
  • Macaroon files
  • Multi-sig hot key (cold keys stored offline in hardware wallets)

Access Controls:

  • Secrets accessed via Workload Identity (no static credentials)
  • Audit logging enabled for all secret access
  • Automatic rotation for database credentials (90-day cycle)
  • Cloud HSM integration for cryptographic operations on multi-sig keys

Network Security

Firewall Rules:

  • Default deny ingress on all VPCs
  • Explicit allow rules for:
    • Bitcoin P2P (port 8333) from Cloud NAT IPs
    • LND P2P (port 9735) from Cloud NAT IPs
    • Internal service-to-service communication
  • Egress restricted via VPC Service Controls in production

TLS Everywhere:

  • LND gRPC/REST API requires TLS 1.3
  • Cloud SQL connections via TLS
  • Internal service mesh with mTLS (future: Istio)

Compliance and Auditing

  • Cloud Audit Logs for all admin and data access
  • Security Command Center for security findings
  • Organization policy constraints enforcing:
    • VM external IP restrictions
    • Uniform bucket-level access on Cloud Storage
    • Domain-restricted sharing
    • OS Login requirement on VMs

Observability and Operations

Monitoring Stack

GCP Native Tools:

  • Cloud Monitoring for metrics and alerting
  • Cloud Logging for structured log aggregation
  • Cloud Trace for distributed request tracing

Datadog Integration:

  • Agent deployed via DaemonSet for enhanced observability
  • Custom dashboards for Lightning-specific metrics
  • Anomaly detection for channel balance changes

Key Metrics

Infrastructure:

  • GKE node CPU/memory utilization
  • Pod restart count and crash loop detection
  • Persistent disk IOPS and throughput

Bitcoin Core:

  • Block height lag (should be 0)
  • Peer connection count
  • Mempool transaction count
  • ZMQ notification latency

LND:

  • Channel count (active, pending, inactive)
  • Total channel capacity (local and remote balance)
  • Payment success/failure rate
  • Forwarding fee revenue
  • HTLC processing latency
  • Database query performance

Alerting

Critical alerts configured with PagerDuty escalation:

  • Bitcoin Core sync lag > 2 blocks
  • LND node offline > 5 minutes
  • Channel force closure detected
  • Database connection failures
  • Backup upload failures
  • Pod crash loop

Logging

Structured Logging:

  • JSON format with standardized fields
  • Correlation IDs for request tracing
  • Log levels: ERROR, WARN, INFO, DEBUG

Log Retention:

  • 30 days in Cloud Logging
  • 1 year in BigQuery for analysis
  • Audit logs retained for 7 years

Cross-Cloud Log Export:

  • Real-time log streaming to AWS S3 via Pub/Sub push subscription and Lambda for tamper-evident audit trail
  • Protects forensic evidence in the event of GCP account compromise
  • S3 bucket configured with Object Lock (compliance mode) and MFA delete

Infrastructure as Code

Terraform Architecture

Repository Structure:

terraform/
├── modules/
│   ├── folder/
│   ├── project/
│   ├── shared-vpc/
│   ├── gke-cluster/
│   └── lightning-core/
├── environments/
│   ├── development/
│   ├── nonproduction/
│   └── production/
└── bootstrap/

State Management:

  • Terraform Cloud for remote state
  • State locking to prevent concurrent modifications
  • Workspace-per-environment pattern

Example Module: Lightning Core Project

# modules/lightning-core/main.tf
variable "org_id" {
  description = "GCP Organization ID"
  type        = string
}

variable "environment_code" {
  description = "Environment code (d/n/p)"
  type        = string
  validation {
    condition     = contains(["d", "n", "p"], var.environment_code)
    error_message = "Must be d (dev), n (nonprod), or p (prod)"
  }
}

variable "folder_id" {
  description = "Parent folder ID"
  type        = string
}

locals {
  env_map = {
    d = "dev"
    n = "nonprod"
    p = "prod"
  }
  environment = local.env_map[var.environment_code]
  project_id  = "prj-${local.environment}-lightning-core"
}

resource "google_project" "lightning_core" {
  name       = local.project_id
  project_id = "${local.project_id}-${random_id.suffix.hex}"
  folder_id  = var.folder_id
  
  labels = {
    environment = local.environment
    application = "lightning"
    managed_by  = "terraform"
  }
}

resource "random_id" "suffix" {
  byte_length = 4
}

# Enable required APIs
resource "google_project_service" "services" {
  for_each = toset([
    "container.googleapis.com",
    "sqladmin.googleapis.com",
    "secretmanager.googleapis.com",
    "monitoring.googleapis.com",
    "logging.googleapis.com",
  ])
  
  project = google_project.lightning_core.project_id
  service = each.key
  
  disable_on_destroy = false
}

CI/CD Pipeline

GitHub Actions Workflow:

name: Terraform Deploy
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Authenticate to GCP
        uses: google-github-actions/auth@v1
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.SA_EMAIL }}
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.5.0
          cli_config_credentials_token: ${{ secrets.TF_CLOUD_TOKEN }}
      
      - name: Terraform Init
        run: terraform init
        
      - name: Terraform Plan
        run: terraform plan -out=tfplan
        
      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        run: terraform apply tfplan

Naming Conventions

Following GCP naming best practices:

Resource Type Pattern Example
Folder fldr-{purpose} fldr-production
Project prj-{env}-{purpose} prj-prod-lightning-core
VPC vpc-{env}-{tier} vpc-prod-shared-base
Subnet subnet-{env}-{purpose} subnet-prod-gke-nodes
GKE Cluster gke-{env}-{purpose} gke-prod-lightning-cluster
Cloud SQL sql-{env}-{purpose} sql-prod-lnd-database
Storage Bucket bkt-{env}-{purpose} bkt-prod-lightning-backups

Environment codes: dev, nonprod, prod

References

Google Cloud Platform

Lightning Network

Infrastructure as Code

Kubernetes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment