Use this prompt when investigating any failing Terraform Cloud (TFC) workspaces. This methodology emphasizes temporal analysis, change detection, and environmental discovery to understand when failures began and what changes triggered them. Applicable to any Terraform infrastructure provisioning issues.
- Access to Terraform Cloud API
- GitHub CLI tools available
- Git repository access
- Terraform CLI installed
Step 1: Establish Failure Timeline
Using TFC API, analyze the workspace's run history:
- Identify when failures first started occurring
- Find the last successful run before failures began
- Map failure frequency and patterns over time
- Note any correlation with deployment schedules
Step 2: Compare Failed vs. Successful Run Configurations
Extract and compare configurations between:
- Last successful run
- First failed run
- Most recent failed run
Identify differences in: module versions, variables, provider versions, Terraform versions
Step 3: Analyze Failure Pattern Evolution
Track how error messages have changed over time:
- Are errors consistent or evolving?
- Do failures occur in different phases (plan vs apply)?
- Are there dependency or timing-related patterns?
- Have workarounds been attempted and what were results?
Step 4: Map External Dependencies and Timing
Investigate external factors during failure timeframe:
- Provider version releases and breaking changes
- Module registry updates
- API deprecations or service changes
- Network, DNS, or connectivity issues
Step 5: Scan Repository Change History
Analyze git repository during failure period:
git log --oneline --since="[last-success-date]" --until="[failure-date]"
- Identify commits between success and failure
- Review commit messages for infrastructure changes
- Check for merge commits and PR patterns
- Look for version bumps or configuration changes
Step 6: Examine Multi-Workspace Impact
Investigate if this is isolated or systemic:
- Query TFC API for other workspace failures in same timeframe
- Check related workspaces using similar modules
- Analyze organization-wide patterns
- Identify shared dependencies or modules
Step 7: Discover Module and Provider Evolution
Research upstream changes:
- Check module registry for version releases during failure period
- Review provider changelogs and breaking changes
- Investigate dependency chain updates
- Look for security patches or mandatory upgrades
Step 8: Analyze Resource State Drift
Compare Terraform state with actual infrastructure:
- Identify resources that may have been modified outside Terraform
- Check for policy changes or access restrictions
- Look for resource lifecycle or naming conflicts
- Investigate state file corruption or inconsistencies
Step 9: Clone and Reproduce Environment
Set up local investigation environment:
git clone [infrastructure-repo-url]
git checkout [last-successful-commit]
Compare working state with failing configuration locally
Step 10: Progressive Configuration Analysis
Step through changes chronologically:
- Start with last known good configuration
- Apply changes incrementally until failure reproduces
- Identify the specific change that introduced the failure
- Understand the cascade effects of that change
Step 11: Deep Dive Error Analysis
Analyze failure logs systematically:
- Extract complete error traces and stack traces
- Identify root cause vs. symptom errors
- Map error messages to specific resources or modules
- Research error patterns in documentation and communities
Step 12: Validate Hypotheses with Targeted Tests
Create minimal reproduction cases:
- Isolate suspected problematic components
- Test individual modules or resources
- Validate version compatibility matrices
- Confirm environmental assumptions
Step 13: Research Known Solutions and Workarounds
Search for existing solutions:
- Check module documentation and migration guides
- Review GitHub issues in relevant repositories
- Search community forums and Stack Overflow
- Consult Terraform registry and provider docs
Step 14: Develop Targeted Remediation
Based on root cause analysis, implement fixes such as:
- Version pinning or upgrades
- Configuration parameter adjustments
- Resource adoption or import strategies
- Provider configuration modifications
- State manipulation if necessary
Step 15: Local Validation and Testing
Validate fix comprehensively:
terraform init -upgrade
terraform validate
terraform plan -detailed-exitcode
- Ensure errors are resolved
- Verify no unintended side effects
- Test with realistic data and scenarios
Step 16: Staged Rollout Strategy
Plan deployment approach:
- Consider using terraform plan -target for selective fixes
- Plan rollback procedures
- Test in non-production environments first
- Prepare monitoring and validation checkpoints
Step 17: Document Root Cause and Resolution
Create comprehensive documentation:
- Timeline of events and changes
- Root cause explanation with evidence
- Step-by-step resolution procedure
- Code changes and configuration updates
- Testing and validation results
Step 18: Establish Monitoring and Alerting
Prevent future occurrences:
- Set up alerts for similar failure patterns
- Monitor upstream dependencies for changes
- Implement automated testing for critical paths
- Create health checks for key components
Step 19: Create Regression Prevention
Implement safeguards:
- Add validation rules or constraints
- Create automated tests for failure scenarios
- Document upgrade procedures and compatibility matrices
- Establish change management processes
Step 20: Share Knowledge and Update Procedures
Disseminate learnings:
- Update team runbooks and procedures
- Share findings with broader community if applicable
- Create reusable troubleshooting patterns
- Enhance organizational knowledge base
After following these steps, you should have:
- ✅ Clear timeline of when and why the failure occurred
- ✅ Identified the specific change that triggered the issue
- ✅ Developed and validated a targeted fix
- ✅ Documented the issue for future reference
- ✅ Implemented prevention measures
- ✅ Enhanced organizational troubleshooting capabilities
- Upstream module updates changing defaults
- Provider version compatibility issues
- Terraform version deprecations
- Dependency chain breaking changes
- Resources created outside Terraform
- State drift from manual changes
- Import/adoption conflicts
- Resource naming or tagging changes
- Cloud provider API changes
- Network or security policy updates
- DNS or connectivity modifications
- Service deprecations or migrations
- Gradual accumulation of small changes
- Merge conflicts or incomplete updates
- Variable scope or precedence issues
- Module parameter evolution
# Get workspace run history with details
curl -H "Authorization: Bearer $TFC_TOKEN" \
"https://app.terraform.io/api/v2/workspaces/{workspace_id}/runs?page[size]=20"
# Compare configurations between runs
curl -H "Authorization: Bearer $TFC_TOKEN" \
"https://app.terraform.io/api/v2/runs/{run_id}/configuration-version"# Find changes in specific time window
git log --oneline --since="2025-09-01" --until="2025-10-01"
# Show file changes over time
git log -p --since="2025-09-01" -- main.tf
# Find when specific content was introduced
git log -S "resource_adoption" --source --all# Compare state files
terraform show -json > current-state.json
terraform show -json tfstate-backup > previous-state.json
diff current-state.json previous-state.json
# Validate resource drift
terraform plan -detailed-exitcode- ✅ Failure timeline clearly established
- ✅ Root cause identified with supporting evidence
- ✅ Specific triggering change pinpointed
- ✅ Fix validated locally and in staging
- ✅ Resolution documented for future reference
- ✅ Prevention measures implemented
- ✅ Knowledge shared with team
Usage: Copy this prompt and provide it to any AI agent along with the TFC workspace details and basic failure description. The agent will systematically investigate the temporal aspects of the failure and discover the environmental context that led to the issue.