AI Agent Troubleshooting Prompt: EKS Terraform Cloud Workspace Failures

Context

Use this prompt when investigating failing Terraform Cloud (TFC) workspaces related to EKS cluster provisioning using Takeda Building Blocks. This methodology was developed from successfully resolving TFC workspace failures involving VPC CNI resource adoption issues.

Prerequisites

Access to Terraform Cloud API
GitHub CLI tools available
Git repository access
Terraform CLI installed

20-Step Troubleshooting Procedure

Phase 1: Initial Investigation & Data Gathering

Step 1: Retrieve TFC Workspace Run Details

Using TFC API, fetch the failing run details from the provided workspace and run ID. Extract:
- Run status and error messages
- Plan/apply logs
- Workspace configuration
- Module versions being used

Step 2: Extract Complete Apply Logs

Retrieve the full apply logs from the TFC run, focusing on:
- Error messages containing "does not exist"
- Resource adoption failures
- Kubernetes provider errors
- Module-specific failures

Step 3: Identify Root Error Patterns

Analyze logs for patterns like:
- "The resource 'aws-node' does not exist"
- "The resource 'amazon-vpc-cni' does not exist"
- Resource import/adoption failures
- Version compatibility issues

Step 4: Determine Building Block Versions

From the TFC logs, identify:
- Terraform module versions in use
- Building block version (e.g., EKSClusterResources v4.1.2)
- Provider versions (AWS, Kubernetes, Helm)
- Any recent version upgrades

Phase 2: Building Block Analysis

Step 5: Clone Building Block Repository

Clone the relevant building block repository:
git clone https://github.com/oneTakeda/terraform-aws-EKSClusterResources.git
Navigate to the specific version being used in the failing workspace

Step 6: Examine Building Block Variables

Review variables.tf in the building block to understand:
- Available configuration options
- Default values for key parameters
- Recent changes in variable defaults
- Resource adoption settings

Step 7: Analyze Version History

Check git history and releases to identify:
- Changes in default values between versions
- Breaking changes or behavioral modifications
- Migration guides or upgrade notes
- Backward compatibility issues

Step 8: Research Resource Adoption Behavior

Investigate how the building block handles:
- Existing EKS-created resources
- Resource adoption vs. creation
- Helm chart management
- Kubernetes resource management

Phase 3: Repository Investigation

Step 9: Clone Target Infrastructure Repository

Clone the failing infrastructure repository:
git clone [infrastructure-repo-url]
cd [repo-name]
git checkout [appropriate-branch]

Step 10: Examine Current Configuration

Review the module configuration in main.tf:
- Building block version being used
- Parameters passed to the module
- Missing or default parameters
- Resource adoption settings

Step 11: Check Variable Definitions

Review parameters.auto.tfvars and variables.tf:
- Shared tags configuration
- Environment-specific settings
- Cluster configuration
- Any version-specific requirements

Step 12: Validate Configuration Structure

Ensure proper configuration format:
- Module source and version
- Required parameters present
- Terraform backend configuration
- Provider versions compatibility

Phase 4: Root Cause Analysis & Fix Development

Step 13: Correlate Building Block Changes with Failures

Compare:
- Building block default changes between versions
- Current configuration parameters
- Missing parameters that could resolve the issue
- Version upgrade impact on existing resources

Step 14: Develop Targeted Fix

Based on analysis, implement fix such as:
- Adding resource_adoption = false parameter
- Updating configuration for new version requirements
- Adjusting provider settings
- Modifying resource management approach

Step 15: Local Validation

Perform local Terraform workflow:
- terraform init (validate module download)
- terraform validate (check syntax)
- terraform plan (verify fix effectiveness)
- Ensure original errors are resolved

Step 16: Handle Local Testing Limitations

Address local vs. remote workspace differences:
- Modify locals.tf for local testing if needed
- Handle workspace name parsing issues
- Ensure validation works in both contexts
- Document temporary modifications

Phase 5: Documentation & Knowledge Sharing

Step 17: Create KEDB Issue

Create comprehensive GitHub issue in terraform-Takeda-KEDB with:
- Clear problem description and symptoms
- Root cause explanation
- Step-by-step resolution
- Code examples and configuration changes
- Version compatibility notes

Step 18: Document Fix Validation

Add detailed validation results to KEDB issue:
- Terraform workflow execution results
- Before/after error comparison
- Plan output confirmation
- Link to original failing TFC run

Step 19: Commit and Prepare Deployment

Commit changes with descriptive message:
- Reference KEDB issue
- Explain root cause briefly
- Note TFC run being fixed
- Stage for deployment (if permissions allow)

Step 20: Create Reusable Methodology

Document the troubleshooting process:
- Create step-by-step guide
- Include common patterns and solutions
- Share methodology for similar issues
- Update team knowledge base

Expected Outcomes

After following these steps, you should have:

✅ Identified the root cause of the TFC workspace failure
✅ Developed and validated a targeted fix
✅ Documented the issue and resolution in KEDB
✅ Created a deployable solution
✅ Established a reusable troubleshooting methodology

Common Patterns to Watch For

Version Upgrade Issues: Building blocks changing default behavior
Resource Adoption Conflicts: New versions trying to manage existing resources
Provider Compatibility: Version mismatches between Terraform providers
Kubernetes Resource Management: Conflicts between Terraform and native EKS resources

Success Indicators

Original error messages no longer appear in terraform plan
Plan shows expected resource creation/modification
No resource adoption or import conflicts
Clean terraform validate and plan execution
Comprehensive documentation created for future reference

Usage: Copy this prompt and provide it to any AI agent along with the specific TFC workspace details and failure symptoms. The agent should be able to systematically work through the investigation and resolution process.

shalomb/AI_Agent_EKS_TFC_Troubleshooting_Guide.md