1,260 commits across 20+ distinct work streams. For context, the entire infrastructure repository had roughly 1,800 total commits in this period. Laura authored approximately 70% of all infrastructure changes.
Built a replacement for the legacy DynamoDB-based configuration system from scratch, spanning Go packages, Ruby client, Terraform modules, Lambda functions, and CLI tooling.
Scope:
- Created
gopkg/configdbv2package with full DynamoDB-backed configuration storage, rollback logic, changelog generation, metadata management, sensitive data masking, and test stubs - Implemented
ruby/manager: implement ConfigDBv2for the Manager service to read/write from the new system - Built
ruby/ably-env: implement AblyEnv::ConfigDBv2with CLI commands (config v2 get,validate,sync,clear-scope) - Created
configdb-replicatorLambda function for cross-region replication - Added Cartography API views for ConfigDBv2 with changelog and audit endpoints
- Built YACE metrics scraping for ConfigDBv2 DynamoDB tables and Lambda functions
- Created Terraform module for ConfigDBv2 infrastructure
- Rolled out progressively: sandbox → nonprod → prod (with rollbacks and re-enables along the way)
- Added S3 failover for configuration reads in Manager
- Removed legacy ConfigDB code (Go, Ruby, Manager, Terraform) after migration
Impact: Replaced a critical piece of Ably's infrastructure configuration system, affecting how every realtime cluster is configured and deployed.
Migrated the entire ably-env Ruby CLI to ablyctl (Go), then extended it significantly beyond parity.
Migration work (INF-6858):
- Created migration plan with Claude, tracked progress through multiple phases
- Ported all major command groups: autoscaling, config, secrets, crisis, release, instance, terraform, rabbitmq, logs, admin, netmap, routing policies, clusters
- Standardised command conventions with a style guide and CLAUDE.md
- Extracted inline Run closures across every package for testability
- Added shared flags package, toolkit helpers, confirmation prompts
- Split large single-file packages into logical files
New capabilities beyond ably-env:
- VictoriaMetrics querying (
metricscommand) - Loki log querying (
logscommand with LogQL) - Alertmanager querying (
alertscommand) - AWS EBS volume management
- AWS subnet listing
- CloudTrail event lookup
- VM cardinality metrics
- Scalr workspace management (
terraform scalr) - WAF capacity checking
- CloudWatch metrics querying
- Lifecycle testing framework (SSM-based instance testing)
- Auto-update functionality with S3 bucket distribution
- Sentry error reporting and command telemetry
- Shell completion for all commands (cluster, region, site)
- Claude agent safety hook (permission gating for automated operations)
- Interactive version selection for config rollback
- Parallel instance connections and commands
--plainand--tailflags for log output- Config edit command
Impact: The primary infrastructure operations tool used by the entire infrastructure team, now in Go with better performance, testability, and extensibility.
End-to-end backup solution across all AWS regions for EBS, RDS, and Cassandra.
Scope:
- Created
backup-vaultandbackup-planTerraform modules - Deployed backup vaults across all 13 nonprod regions and all prod regions
- Deployed backup plans across all nonprod and prod regions
- Upgraded backup modules to AWS Provider v6
- Created Scalr workspaces for backup-nonprod and backup-prod
- Bootstrapped backup AWS accounts (prod-backup, nonprod-backup)
- Added SSO access for backup accounts
- Created DR backup buckets
- Enabled offsite backups for Cassandra, EBS (vmstorage), and RDS
- Added KMS key management for backup plans
- Configured cross-account backup vault access with unique SIDs per statement
- Created
ablyctl backupcommands
Impact: Ably now has cross-region disaster recovery backups for all critical data stores, a capability that didn't exist before.
Built a complete automated secrets backup system.
Scope:
- Created
crypt-backupLambda function (Go) - Implemented
go/pkg/crypt/serverserverside crypt package - Added regional and S3 failover for secret reads
- Added Manager-level regional failover and S3 failover for secrets
- Integrated Sentry for error tracking
- Added YACE metrics scraping for the Lambda
- Created alerting for Lambda function errors
- Added
cli: add automatic secrets backupcommand - Deployed in prod with monitoring
- Fixed edge cases (deleting recreated secrets, YAML round-trip integer values)
Impact: Automated backup of all encrypted secrets with multi-region failover, protecting against data loss in the secrets management system.
Built a new tool to replace CDKTF for generating Terraform configuration.
Scope:
- Created
go/pkg/tfgenpackage from scratch - Added
tfgen realtimepackage for realtime cluster Terraform generation - Integrated into infratool (
go/tools/infratool: add tfgen command) - Integrated into ablyctl (
go/tools/ablyctl: use tfgen for terraform realtime) - Integrated into ably-env (
ruby/ably-env: use tfgen for terraform realtime) - Removed CDKTF for realtime stacks
- Generated Terraform for all realtime clusters
Impact: Eliminated the CDKTF dependency (TypeScript/Node.js) for Terraform generation, replacing it with a native Go tool that integrates directly into the infrastructure toolchain.
Major overhaul of logging and monitoring systems.
Self-hosted Loki:
- Deployed Loki across nonprod and prod
- Created Loki containers with tuning (query timeout, gRPC message size, clustering)
- Added memcached caching (scaled up multiple times)
- Deployed Loki queriers for read scaling
- Added external NLB for Loki access
- Created basic-auth-proxy (NGINX) for authentication
- Built Logs Cluster View Grafana dashboard
- Switched default data source from Grafana Cloud to self-hosted Loki
- Deleted Grafana Cloud Logs resources after migration
Vector migration (from Promtail):
- Added Vector to AMI, deploy manifests, and manager configuration
- Created Vector aggregator container and configuration
- Added disk buffering, throttling, and pipeline configuration
- Added Vector monitoring alerts
- Deprecated and removed Promtail
VictoriaMetrics:
- Upgraded vmstorage instance types (x2gd.large to r7g.xlarge)
- Added vmselect instances
- Enabled EBS snapshots and backups for vmstorage
- Deprecated vmbackupmanager
- Created victoriametrics-exporter with tests and mocking support
- Added vm commands to ablyctl
Grafana dashboards:
- Migrated multiple dashboards to Grafonnet (CloudFront, Data Center Instance View)
- Created Cassandra Cluster View, Realtime Container View, Logs Cluster View
- Updated Queue Cluster View with percentages and published message rates
Impact: Moved the entire logging pipeline from Grafana Cloud to self-hosted Loki (cost reduction), replaced Promtail with Vector (better performance/reliability), and improved monitoring dashboards across the board.
Migrated push queues to quorum queues and built comprehensive RabbitMQ management tooling.
Scope:
- Created RabbitMQ cluster-config Terraform module
- Migrated push queues to quorum (sandbox → nonprod → prod)
- Created push-quorum vhost and associated resources
- Upgraded RabbitMQ to 4.2
- Migrated from rabbitmq.config to rabbitmq.conf
- Removed custom auth plugin
- Enabled shovel plugin
- Built
rabbitmq migrate-to-quorumably-env command with skip-broken-queues resilience - Added
delete-queue,delete-queues-matching,--filterto list-queues - Added list-users, --json output, queue type display
- Fixed quorum migration for broken shovel status API and unacked queues
- Fixed Reaper for RabbitMQ 4.x
- Added RabbitMQ load testing enhancements to ably-env
- Added
ablybench rabbit receiverfor benchmarking - Updated Grafana dashboards for queue monitoring
- Rationalised queue module security groups and ports
Impact: Migrated to quorum queues for better data safety and fault tolerance, with comprehensive tooling for ongoing RabbitMQ operations.
Systematic performance tuning of the log processor.
Changes:
- Hoisted regex compilation to package level
- Flushed CSV once after processing instead of per line
- Cached AllFields() result instead of recomputing per line
- Pre-allocated and reused CSV row slice
- Eliminated double-buffered compressed data
- Pre-sized LogData maps to avoid growth during population
- Right-sized compressed buffer to avoid over-allocation
- Fixed missing break in S3 download retry loop
- Removed unused FloatRegex and CSV_EOL
Impact: Reduced CPU and memory allocation in a Lambda function that processes every log line from every realtime instance.
Built dynamic enterprise monitoring from Terraform modules.
Scope:
- Created
enterprise-monitoringTerraform module - Built adminapi Go package with tests (account monitoring, app monitoring resources)
- Created adminapi Terraform provider resources
- Added enterprise recording rules and alert expressions
- Enabled monitoring per-account via Terraform
- Added business hours paging configuration
- Created enterprise monitoring dashboard
- Migrated from legacy static monitoring to dynamic per-account monitoring
- Added enterprise account reporting to ably-env (with --markdown, --csv, --missing flags)
Impact: Enterprise customer monitoring is now managed through Terraform rather than manual configuration, with automatic onboarding.
Extended the internal infrastructure API with multiple new capabilities.
Scope:
- Added
/v1/accountsendpoint with DynamoDB storage and tests - Added ConfigDBv2 views with changelog, audit, and
/currentendpoints - Added playbooks API endpoint with linting (INF-6974)
- Added playbooks client and OpenAPI spec regeneration
- Added cluster views (security, observability, deployment, queue, Cassandra)
- Added config filtering, placement constraints view fixes
- Added cluster name alias redirects
- Migrated playbooks from Confluence to repository-based system
Impact: Cartography API is now the central infrastructure intelligence service, with configuration, playbooks, and account data all queryable.
Migrated from legacy "environment/data_center" naming to "cluster/site" across multiple systems.
Scope:
- Migrated ably-env CLI commands to new naming
- Updated alertmanager templates to handle cluster and site
- Updated vmalert rules to use cluster and site in expressions
- Migrated Grafana dashboards to new naming
- Updated enterprise monitoring to use cluster label
- Migrated Concourse pipeline naming
- Added
clusters namingcommand and debugging skills to ably-env
Impact: Standardised naming conventions across the infrastructure stack, eliminating confusion between legacy and modern terminology.
Systematic upgrade of Terraform modules from AWS Provider v5 to v6.
Scope:
- Upgraded backup-vault, backup-plan modules
- Upgraded website modules (nonprod and prod)
- Created Provider v6 upgrade guide and Copilot instructions
- Piloted on crypt-backup and configdb nonprod
- Enabled same-region RDS backups as part of the upgrade
Impact: Keeping Terraform modules current with the latest AWS provider, unblocking new AWS features.
Rebuilt deployment infrastructure (Concourse CI).
Scope:
- Created new deployment module (Terraform)
- Created modern deployment cluster structure
- Migrated Concourse to new nonprod and prod infrastructure
- Created deployment DNS zones
- Fixed Concourse worker for Ubuntu 24.04
- Added autoscaling parameters
- Upgraded deployment RDS
Impact: Modernised the CI/CD deployment infrastructure.
Significant improvements to the build and test pipeline.
Scope:
- Refactored CI workflows (renamed "CI" to "Containers", separated manager/ably-env/on-call-dashboard)
- Created dynamic matrix for per-container build jobs
- Added Docker buildx caching to GHA backend
- Fixed multi-arch manifest race condition
- Created infratool Go CI tool to replace Rake
- Added pull_request triggers for reliable PR path filtering
- Added infratool
image checkand--jsoncommands - Added Copilot setup steps for GitHub
- Moved to standardrb for Ruby linting
Impact: Faster, more reliable CI pipelines with better container build caching and matrix parallelism.
Created a framework for building and packaging Go services.
Scope:
- Created
go/server/directory structure with packaging workflow - Built
go/pkg/clistandard entrypoint for server CLI tools - Created equip CLI for service unit file generation
- Added parrot example service
- Created gktools assert package with tests (running containers, server testing)
- Created host package
- Added CI for uploading gktools
- Restructured assert to use the cli package
- Introduced d/ctl service pattern with port registry
Impact: Standardised framework for building, testing, and deploying infrastructure services.
Major restructuring of the repository.
Scope:
- Monorepo migration (restructured repository into monorepo)
- Migrated Go modules (
go.modtogo/go.mod) - Moved terraform providers to
go/terraform/ - Moved containerised services to
go/services/ - Moved ablybench, ablyctl, infratool to
go/tools/ - Migrated
ablyawsfrom ablyctl to sharedgopkg - Created shared Go packages (configdb, configdbv2, crypt, ablyaws, adminapi)
- Created AWS interface generator for mocking
- Architecture documentation in
docs/directory
Impact: Cleaner repository structure with shared packages enabling code reuse across tools and services.
Scope:
- Created WAF rule to block CloudFront host header requests
- Made WAF rule action configurable (count → block progression)
- Added pusher-pubnub regex pattern set
- Scoped credentials per caller for ably-env and ablyctl (INF-6938)
- Built Claude agent safety hook (permission gating)
- Removed orbit rule group from prod WAF
Impact: Improved security posture for public-facing infrastructure and internal tooling.
Built an internal dashboard for on-call management.
Scope:
- Created incident timeline and statistics views
- Added calendar view
- Built incident view and index page
- Added in-memory caching
- Improved styling and routing
- Fixed incident paging
Built infrastructure-specific AI tooling and skills.
Scope:
- Created CLAUDE.md files across the repository (root, Go, Ruby, Terraform, ablyctl, modules)
- Built skills: jira-ticket, backlog, query-alerts, query-metrics, query-logs, debug-live, test-manager, dev-cluster
- Added agent permission checker hook for safe automated operations
- Created automatic copilot feedback loop for jira-ticket skill
- Shared step scaling investigation and AI workflow learnings with team
Throughout the year, ongoing operational work:
- Customer onboarding: Duolingo, Gorgias CNAMEs, Hivebrite, Kraken, EA, Bloke Design, Reflag, Leya, Aristocrat, Lightspeed scaling
- AMI upgrades and rollouts (multiple cycles)
- Terraform module updates across all clusters (realtime, observability, cassandra, queue, security, deployment, api, website)
- Incident response tooling (incident.io integration, alertmanager webhook config)
- Realtime upgrades (7.27, 7.32, 7.54)
- Docker upgrades and fixes
- Manager bug fixes (ECR retry, Docker API changes, credentials, container age metrics, Reaper)
- ably-env bug fixes (config, crisis, rollback, scaling, validation)
| Area | Scale |
|---|---|
| Total commits | 1,260 |
| Share of all repo commits | ~70% |
| Major systems built from scratch | ConfigDBv2, crypt-backup, tfgen, backup infrastructure, enterprise monitoring, service framework, on-call dashboard |
| Major migrations completed | ablyctl (Ruby → Go), Loki (Grafana Cloud → self-hosted), Vector (Promtail replacement), quorum queues, Provider v6, legacy naming, monorepo restructure |
| Terraform modules created | backup-vault, backup-plan, deployment, enterprise-monitoring, cloudflare-exporter, configdb, rabbit cluster-config |
| Go packages created | configdbv2, configdb, crypt/server, tfgen, adminapi, cli, host, servertest, lifecycle, equip |
| CLI commands added | 40+ new commands across ablyctl and ably-env |
| Grafana dashboards | 6 created or migrated to Grafonnet |