Generic AI Agent Troubleshooting Prompt: Terraform Cloud Workspace Failures

Context

Use this prompt when investigating any failing Terraform Cloud (TFC) workspaces. This methodology emphasizes temporal analysis, change detection, and environmental discovery to understand when failures began and what changes triggered them. Applicable to any Terraform infrastructure provisioning issues.

Prerequisites

Access to Terraform Cloud API
GitHub CLI tools available
Git repository access
Terraform CLI installed

20-Step Generic Troubleshooting Procedure

Phase 1: Temporal Analysis & Change Detection

Step 1: Establish Failure Timeline

Using TFC API, analyze the workspace's run history:
- Identify when failures first started occurring
- Find the last successful run before failures began
- Map failure frequency and patterns over time
- Note any correlation with deployment schedules

Step 2: Compare Failed vs. Successful Run Configurations

Extract and compare configurations between:
- Last successful run
- First failed run
- Most recent failed run
Identify differences in: module versions, variables, provider versions, Terraform versions

Step 3: Analyze Failure Pattern Evolution

Track how error messages have changed over time:
- Are errors consistent or evolving?
- Do failures occur in different phases (plan vs apply)?
- Are there dependency or timing-related patterns?
- Have workarounds been attempted and what were results?

Step 4: Map External Dependencies and Timing

Investigate external factors during failure timeframe:
- Provider version releases and breaking changes
- Module registry updates
- API deprecations or service changes
- Network, DNS, or connectivity issues

Phase 2: Environmental Discovery & Context Analysis

Step 5: Scan Repository Change History

Analyze git repository during failure period:
git log --oneline --since="[last-success-date]" --until="[failure-date]"
- Identify commits between success and failure
- Review commit messages for infrastructure changes
- Check for merge commits and PR patterns
- Look for version bumps or configuration changes

Step 6: Examine Multi-Workspace Impact

Investigate if this is isolated or systemic:
- Query TFC API for other workspace failures in same timeframe
- Check related workspaces using similar modules
- Analyze organization-wide patterns
- Identify shared dependencies or modules

Step 7: Discover Module and Provider Evolution

Research upstream changes:
- Check module registry for version releases during failure period
- Review provider changelogs and breaking changes
- Investigate dependency chain updates
- Look for security patches or mandatory upgrades

Step 8: Analyze Resource State Drift

Compare Terraform state with actual infrastructure:
- Identify resources that may have been modified outside Terraform
- Check for policy changes or access restrictions
- Look for resource lifecycle or naming conflicts
- Investigate state file corruption or inconsistencies

Phase 3: Systematic Failure Analysis

Step 9: Clone and Reproduce Environment

Set up local investigation environment:
git clone [infrastructure-repo-url]
git checkout [last-successful-commit]
Compare working state with failing configuration locally

Step 10: Progressive Configuration Analysis

Step through changes chronologically:
- Start with last known good configuration
- Apply changes incrementally until failure reproduces
- Identify the specific change that introduced the failure
- Understand the cascade effects of that change

Step 11: Deep Dive Error Analysis

Analyze failure logs systematically:
- Extract complete error traces and stack traces
- Identify root cause vs. symptom errors
- Map error messages to specific resources or modules
- Research error patterns in documentation and communities

Step 12: Validate Hypotheses with Targeted Tests

Create minimal reproduction cases:
- Isolate suspected problematic components
- Test individual modules or resources
- Validate version compatibility matrices
- Confirm environmental assumptions

Phase 4: Solution Development & Validation

Step 13: Research Known Solutions and Workarounds

Search for existing solutions:
- Check module documentation and migration guides
- Review GitHub issues in relevant repositories
- Search community forums and Stack Overflow
- Consult Terraform registry and provider docs

Step 14: Develop Targeted Remediation

Based on root cause analysis, implement fixes such as:
- Version pinning or upgrades
- Configuration parameter adjustments
- Resource adoption or import strategies
- Provider configuration modifications
- State manipulation if necessary

Step 15: Local Validation and Testing

Validate fix comprehensively:
terraform init -upgrade
terraform validate
terraform plan -detailed-exitcode
- Ensure errors are resolved
- Verify no unintended side effects
- Test with realistic data and scenarios

Step 16: Staged Rollout Strategy

Plan deployment approach:
- Consider using terraform plan -target for selective fixes
- Plan rollback procedures
- Test in non-production environments first
- Prepare monitoring and validation checkpoints

Phase 5: Knowledge Capture & Prevention

Step 17: Document Root Cause and Resolution

Create comprehensive documentation:
- Timeline of events and changes
- Root cause explanation with evidence
- Step-by-step resolution procedure
- Code changes and configuration updates
- Testing and validation results

Step 18: Establish Monitoring and Alerting

Prevent future occurrences:
- Set up alerts for similar failure patterns
- Monitor upstream dependencies for changes
- Implement automated testing for critical paths
- Create health checks for key components

Step 19: Create Regression Prevention

Implement safeguards:
- Add validation rules or constraints
- Create automated tests for failure scenarios
- Document upgrade procedures and compatibility matrices
- Establish change management processes

Step 20: Share Knowledge and Update Procedures

Disseminate learnings:
- Update team runbooks and procedures
- Share findings with broader community if applicable
- Create reusable troubleshooting patterns
- Enhance organizational knowledge base

Expected Outcomes

After following these steps, you should have:

✅ Clear timeline of when and why the failure occurred
✅ Identified the specific change that triggered the issue
✅ Developed and validated a targeted fix
✅ Documented the issue for future reference
✅ Implemented prevention measures
✅ Enhanced organizational troubleshooting capabilities

Common Investigation Patterns

1. Version Cascade Failures

Upstream module updates changing defaults
Provider version compatibility issues
Terraform version deprecations
Dependency chain breaking changes

2. Resource Lifecycle Issues

Resources created outside Terraform
State drift from manual changes
Import/adoption conflicts
Resource naming or tagging changes

3. Environmental Changes

Cloud provider API changes
Network or security policy updates
DNS or connectivity modifications
Service deprecations or migrations

4. Configuration Drift

Gradual accumulation of small changes
Merge conflicts or incomplete updates
Variable scope or precedence issues
Module parameter evolution

Investigation Tools and Techniques

TFC API Queries

# Get workspace run history with details
curl -H "Authorization: Bearer $TFC_TOKEN" \
  "https://app.terraform.io/api/v2/workspaces/{workspace_id}/runs?page[size]=20"

# Compare configurations between runs
curl -H "Authorization: Bearer $TFC_TOKEN" \
  "https://app.terraform.io/api/v2/runs/{run_id}/configuration-version"

Git Analysis Commands

# Find changes in specific time window
git log --oneline --since="2025-09-01" --until="2025-10-01"

# Show file changes over time
git log -p --since="2025-09-01" -- main.tf

# Find when specific content was introduced
git log -S "resource_adoption" --source --all

Terraform State Analysis

# Compare state files
terraform show -json > current-state.json
terraform show -json tfstate-backup > previous-state.json
diff current-state.json previous-state.json

# Validate resource drift
terraform plan -detailed-exitcode

Success Indicators

✅ Failure timeline clearly established
✅ Root cause identified with supporting evidence
✅ Specific triggering change pinpointed
✅ Fix validated locally and in staging
✅ Resolution documented for future reference
✅ Prevention measures implemented
✅ Knowledge shared with team

Usage: Copy this prompt and provide it to any AI agent along with the TFC workspace details and basic failure description. The agent will systematically investigate the temporal aspects of the failure and discover the environmental context that led to the issue.

shalomb/Generic_Terraform_Troubleshooting_Methodology.md