DLQ Alarm Review Automation: Two Approaches and Lessons Learned

The Problem Context

At Intercom, we have hundreds (nearly a thousand) of SQS dead letter queues (DLQs) with various paging priority levels. The challenge we were trying to solve was filtering and verifying which of these queues actually need to be paging our on-call engineers, particularly for lower priority issues.

As the dataset gets bigger, it becomes very difficult to maintain the right signal-to-noise ratio, but this has a real negative impact on those engineers when they're woken up in the middle of the night for something that just isn't really that important. The manual review process was becoming unsustainable - engineers would need to gather data from multiple sources (Terraform infrastructure, Honeycomb observability datasets, production metrics), analyze each queue's health and business impact, make decisions about appropriate paging tiers, and then implement approved changes across infrastructure.

This is exactly the kind of systematic, data-intensive work that Claude excels at, but we ran into some interesting challenges around state management and reliability that led us to explore two different architectural approaches.

The Original Assisted Command Approach

Our first approach was a comprehensive Claude Code command that tried to handle the entire workflow in a guided, interactive manner. The command would:

Enumerate services and let engineers select which one to review
Parse Terraform files across multiple regions to discover tier 1-2 queues
Query two different Honeycomb datasets (RagChecker for worker context, production for metrics)
Generate individual queue analysis files using ERB templates
Create structured GitHub issues with dynamic table of contents and team voting checkboxes
Support a 4-phase workflow: Analysis → Decision Collection → Implementation → Completion

This approach worked well for the data collection and analysis phases. Claude is exceptionally good at parsing infrastructure code, querying APIs, and synthesizing information into clear recommendations. We successfully completed reviews covering 1.2B+ jobs across 14 queues with this method.

However, we hit reliability issues around state management and workflow progression. The command would sometimes lose track of where it was in the multi-phase process, especially when resuming after interruptions. File organization became inconsistent - different phases would create files in different patterns, and the command would sometimes get confused about what data had already been collected. The interactive nature also made it unsuitable for automation scenarios.

The CLI Tool Approach

Recognizing these limitations, we pivoted to building a dedicated Ruby CLI tool that separates concerns between Claude's strengths (analysis, data synthesis) and what needed to be rock-solid reliable (state persistence, workflow orchestration).

The CLI tool provides commands like:

dlq-review init --service inbox --repo intercom/intercom
dlq-review add-queue production-queue-1 --tier 2
dlq-review set-metrics production-queue-1 --volume 1500000 --error-rate 0.0
dlq-review generate-templates

In this model, Claude uses the CLI as a structured data store throughout the process. It still does all the heavy lifting - parsing Terraform, querying Honeycomb, making recommendations - but now it persists every piece of information through explicit CLI commands rather than trying to manage complex file hierarchies and state internally.

Key Architectural Differences

Assisted Command:

Claude manages all state internally
File organization handled by Claude
Phase progression controlled by command logic
Interactive workflow with human checkpoints
Suitable for engineer-guided reviews

CLI Tool:

Explicit state management through structured commands
Predictable file organization (timestamped directories, consistent naming)
Phase progression externally orchestrated
Atomic operations with audit trail
Suitable for automation scenarios

The Remote Automation Driver

A major factor pushing us toward the CLI tool was the desire to make this runnable in remote environments as part of automation pipelines. The assisted command requires constant human interaction - it gets stuck when it encounters unexpected data, needs guidance on decision-making, and requires an engineer to shepherd it through the multi-phase workflow.

For automation scenarios, we need something that can run reliably without human intervention. The CLI tool provides that foundation - Claude can execute a deterministic sequence of commands, each one either succeeds or fails cleanly, and the entire workflow can be scripted and monitored.

The trade-off is increased complexity upfront (building and maintaining the CLI tool) versus the flexibility to run this process automatically on a schedule, integrate it with CI/CD pipelines, or have it triggered by infrastructure changes.

Technical Lessons and Open Questions

Through this process, we learned that Claude is exceptionally strong at:

Multi-source data gathering and synthesis
Infrastructure code parsing and analysis
Template generation and content creation
Making reasoned recommendations based on complex criteria

But struggles with:

Complex state management across long-running processes
Maintaining file organization consistency over time
Self-recovery when workflows get interrupted
Deterministic behavior in automation scenarios

The CLI tool approach addresses these by moving state management out of Claude's direct control, but raises questions about whether we're over-engineering the solution. Are we hitting fundamental limitations of AI-assisted workflows, or could the original command have been structured better to avoid these pitfalls?

Some specific areas where I'm curious about best practices:

State Management: Is there a better pattern for Claude to manage complex, multi-phase workflows internally?
File Organization: Are there techniques to help Claude maintain consistent file hierarchies over long processes?
Error Recovery: What's the best way to make Claude-driven workflows resumable after interruptions?
Automation Balance: Where's the sweet spot between AI flexibility and deterministic automation requirements?

Current Status

We've implemented both approaches and they each serve different use cases:

Assisted Command: Has been merged and is ready for use with some handholding. It works well but requires engineers with good Claude skills to guide it through any issues that arise - it's not universally accessible to any engineer.
CLI Tool: We're pausing development on this approach for now. If we decide we need fully automated execution in the future, this would likely be the pattern we'd pursue, but for now we're satisfied with the assisted command as our next step.

The question remains whether the CLI tool represented the right architectural choice or if there are patterns that would let us achieve better reliability in the assisted approach while keeping the workflow simpler and more directly Claude-managed.

We'd love input from the Anthropic team on whether this evolution makes sense or if we're missing simpler solutions to the state management and reliability challenges we encountered.

looneym/dlq-automation-approaches.md