At Intercom, we have hundreds (nearly a thousand) of SQS dead letter queues (DLQs) with various paging priority levels. The challenge we were trying to solve was filtering and verifying which of these queues actually need to be paging our on-call engineers, particularly for lower priority issues.
As the dataset gets bigger, it becomes very difficult to maintain the right signal-to-noise ratio, but this has a real negative impact on those engineers when they're woken up in the middle of the night for something that just isn't really that important. The manual review process was becoming unsustainable - engineers would need to gather data from multiple sources (Terraform infrastructure, Honeycomb observability datasets, production metrics), analyze each queue's health and business impact, make decisions about appropriate paging tiers, and then implement approved changes across infrastructure.
This is exactly the kind of systematic, data-intensive work that Claude excels at, but we ran into some interesting challenges around state management and reliability that led us to explore two different architectural approaches.
Our first approach was a comprehensive Claude Code command that tried to handle the entire workflow in a guided, interactive manner. The command would:
- Enumerate services and let engineers select which one to review
- Parse Terraform files across multiple regions to discover tier 1-2 queues
- Query two different Honeycomb datasets (RagChecker for worker context, production for metrics)
- Generate individual queue analysis files using ERB templates
- Create structured GitHub issues with dynamic table of contents and team voting checkboxes
- Support a 4-phase workflow: Analysis → Decision Collection → Implementation → Completion
This approach worked well for the data collection and analysis phases. Claude is exceptionally good at parsing infrastructure code, querying APIs, and synthesizing information into clear recommendations. We successfully completed reviews covering 1.2B+ jobs across 14 queues with this method.
However, we hit reliability issues around state management and workflow progression. The command would sometimes lose track of where it was in the multi-phase process, especially when resuming after interruptions. File organization became inconsistent - different phases would create files in different patterns, and the command would sometimes get confused about what data had already been collected. The interactive nature also made it unsuitable for automation scenarios.
Recognizing these limitations, we pivoted to building a dedicated Ruby CLI tool that separates concerns between Claude's strengths (analysis, data synthesis) and what needed to be rock-solid reliable (state persistence, workflow orchestration).
The CLI tool provides commands like:
dlq-review init --service inbox --repo intercom/intercom
dlq-review add-queue production-queue-1 --tier 2
dlq-review set-metrics production-queue-1 --volume 1500000 --error-rate 0.0
dlq-review generate-templates
In this model, Claude uses the CLI as a structured data store throughout the process. It still does all the heavy lifting - parsing Terraform, querying Honeycomb, making recommendations - but now it persists every piece of information through explicit CLI commands rather than trying to manage complex file hierarchies and state internally.
Assisted Command:
- Claude manages all state internally
- File organization handled by Claude
- Phase progression controlled by command logic
- Interactive workflow with human checkpoints
- Suitable for engineer-guided reviews
CLI Tool:
- Explicit state management through structured commands
- Predictable file organization (timestamped directories, consistent naming)
- Phase progression externally orchestrated
- Atomic operations with audit trail
- Suitable for automation scenarios
A major factor pushing us toward the CLI tool was the desire to make this runnable in remote environments as part of automation pipelines. The assisted command requires constant human interaction - it gets stuck when it encounters unexpected data, needs guidance on decision-making, and requires an engineer to shepherd it through the multi-phase workflow.
For automation scenarios, we need something that can run reliably without human intervention. The CLI tool provides that foundation - Claude can execute a deterministic sequence of commands, each one either succeeds or fails cleanly, and the entire workflow can be scripted and monitored.
The trade-off is increased complexity upfront (building and maintaining the CLI tool) versus the flexibility to run this process automatically on a schedule, integrate it with CI/CD pipelines, or have it triggered by infrastructure changes.
Through this process, we learned that Claude is exceptionally strong at:
- Multi-source data gathering and synthesis
- Infrastructure code parsing and analysis
- Template generation and content creation
- Making reasoned recommendations based on complex criteria
But struggles with:
- Complex state management across long-running processes
- Maintaining file organization consistency over time
- Self-recovery when workflows get interrupted
- Deterministic behavior in automation scenarios
The CLI tool approach addresses these by moving state management out of Claude's direct control, but raises questions about whether we're over-engineering the solution. Are we hitting fundamental limitations of AI-assisted workflows, or could the original command have been structured better to avoid these pitfalls?
Some specific areas where I'm curious about best practices:
- State Management: Is there a better pattern for Claude to manage complex, multi-phase workflows internally?
- File Organization: Are there techniques to help Claude maintain consistent file hierarchies over long processes?
- Error Recovery: What's the best way to make Claude-driven workflows resumable after interruptions?
- Automation Balance: Where's the sweet spot between AI flexibility and deterministic automation requirements?
We've implemented both approaches and they each serve different use cases:
-
Assisted Command: Has been merged and is ready for use with some handholding. It works well but requires engineers with good Claude skills to guide it through any issues that arise - it's not universally accessible to any engineer.
-
CLI Tool: We're pausing development on this approach for now. If we decide we need fully automated execution in the future, this would likely be the pattern we'd pursue, but for now we're satisfied with the assisted command as our next step.
The question remains whether the CLI tool represented the right architectural choice or if there are patterns that would let us achieve better reliability in the assisted approach while keeping the workflow simpler and more directly Claude-managed.
We'd love input from the Anthropic team on whether this evolution makes sense or if we're missing simpler solutions to the state management and reliability challenges we encountered.