Automated Failover Solution for COMException Issues

Solution Overview
Background
Architecture
Workflow
Components
Implementation Details
Timing Design Considerations
Configuration
Installation & Setup
Usage Guide
Troubleshooting
Testing & Validation
Limitations & Future Improvements

Solution Overview

The Automated Failover Solution provides a temporary but robust mechanism to automatically mitigate COMException issues in the production environment while the root cause is being investigated. By continuously monitoring event logs on the active server, the system detects patterns of COMException errors and triggers a complete failover cycle to minimize business impact, particularly during working hours.

Key features include:

Automatic detection of COMException events in Windows Event Logs
Configurable threshold-based triggering of failover operations
Complete failover cycle with DNS switching and server management
Dynamic DNS TTL management to optimize failover responsiveness
IIS application pool handling during failover
Automatic server restart and recovery
Cooldown period to prevent cascading failovers
Support for multiple environments (Dev/Prod)
Comprehensive logging and notification

This solution serves as a temporary measure while the root cause of the COMException issues is being investigated, providing business continuity and minimizing user-visible impact.

Background

The production environment has been experiencing intermittent COMException issues which occasionally impact service availability. While the development team is working on identifying and fixing the root cause, a temporary automated solution is needed to minimize business impact.

Currently, the operations process is reactive:

System alerts notify when exceptions occur
Operations staff manually restart the affected server
Service is restored after reboot
Business operations are impacted during the entire process

This manual approach has several disadvantages:

Delay between issue occurrence and resolution
Requires 24/7 staff availability for immediate response
Significant business impact during working hours
Inconsistent handling across incidents

The automated solution addresses these issues by:

Proactively monitoring for early symptoms of the problem
Automatically executing the failover procedure
Minimizing downtime through quick detection and response
Providing consistent handling of incidents
Enabling seamless operation during working hours

Architecture

The solution follows a modular architecture with clear separation of concerns:

┌─────────────────────────┐
│ Tooling Server          │
│                         │
│  ┌───────────────────┐  │
│  │  Scheduled Task   │  │
│  │                   │  │
│  │ Every 5 minutes   │  │
│  └─────────┬─────────┘  │
│            │            │
│  ┌─────────▼─────────┐  │
│  │ autoFailoverMonitor│  │
│  └─────────┬─────────┘  │
│            │            │
│            │            │
│  ┌─────────▼─────────┐  │
│  │  completeFailover │  │
│  │       Cycle       │  │
│  └─────────┬─────────┘  │
│            │            │
│  ┌─────────▼─────────┐  │
│  │   dnsFailover_v2  │  │
│  └───────────────────┘  │
└─────────────────────────┘
           │
           │
           ▼
┌─────────────────────────┐      ┌─────────────────────────┐
│    Active Server        │      │    Standby Server       │
│    (Server0001)         │◄────►│    (Server0002)         │
└─────────────────────────┘      └─────────────────────────┘

Key architectural principles:

Modularity: Each component has a clearly defined responsibility
Robustness: Comprehensive error handling and recovery
Configurability: Environment-specific settings and thresholds
Transparency: Detailed logging for troubleshooting and auditing
Proactive Optimization: Dynamic DNS TTL management for improved failover response

Workflow

The solution implements the following workflow:

Monitoring Phase:
- Scheduled task runs autoFailoverMonitor.ps1 every 5 minutes
- Script checks Windows Event Logs on the active server for COMException events
- If 3 or more events are detected within the last 10 minutes (configurable), the failover process is triggered
- If system is in 45-minute cooldown period from previous failover, monitoring continues without action
- Upon first error detection, DNS TTL is proactively reduced to speed up eventual failover
First Failover (A → B):
- completeFailoverCycle.ps1 is invoked to manage the entire failover process
- DNS records are updated via dnsFailover_v2.ps1 to point to the standby server
- IIS application pools on the new active server are restarted and recycled
- System waits for DNS propagation and confirms the standby server is now active
Server Restart:
- Original server (previous active, now standby) is restarted
- System waits for the server to come back online and stabilize
- Server availability is verified through connectivity checks
Second Failover (B → A):
- DNS failover is performed again to switch back to the original server
- IIS application pools on the original server are restarted and recycled
- System verifies DNS propagation and service availability
- Standard DNS TTL values are restored
- Cooldown period begins to prevent cascading failovers
Notification:
- Email notification is sent with results and timeline
- Detailed logs are available for review

Components

Monitoring Engine

The monitoring engine (autoFailoverMonitor.ps1) is the core component responsible for:

Detecting COMException events in the Windows Event Log
Applying threshold rules to determine when to trigger failover
Managing DNS TTL values based on error detection
Tracking state across multiple runs
Enforcing cooldown periods
Initializing the system

Key features:

Configurable error threshold (default: 3 events)
Configurable time window (default: 10 minutes)
Configurable cooldown period (default: 45 minutes)
State persistence across runs
Multiple operation modes:
- Single-run monitoring
- Continuous service mode
- Initialization mode
- Test mode
- Force failover mode

Failover Mechanism

The failover orchestration component (completeFailoverCycle.ps1) manages the entire failover process:

First failover from primary to secondary server
Server restart operation
Second failover back to primary server
Verification of each step's success

Key features:

End-to-end orchestration of the failover process
Server availability checking with multiple methods
Configurable timeouts for each operation phase
Options to skip phases if needed
Detailed logging of the entire process

DNS Management

The DNS management component (dnsFailover_v2.ps1) handles:

Determining the current active server
Updating DNS records during failover
Verifying DNS propagation
Managing IIS application pools on the target server

Key features:

Environment-specific configurations (Dev/Prod)
Support for different DNS servers and zones
IIS application pool management (restart and recycle)
Comprehensive error handling

DNS TTL Management

The solution incorporates dynamic DNS TTL (Time-To-Live) management to optimize both failover performance and regular operations:

Adaptive TTL Adjustment: Automatically reduces DNS TTL values when early signs of issues are detected
Proactive DNS Caching Preparation: Shortens cache times before potential failover events
Environment-Specific Settings: Different TTL values for Dev vs Prod environments
Post-Failover Normalization: Restores standard TTL values after successful failover

The TTL lifecycle consists of distinct phases:

Standard Operation:
- During normal operations, a standard TTL value (default: 3 minutes) is applied
- This balances normal caching efficiency with reasonable update times
Early Warning Phase:
- When the first COMException is detected (before reaching threshold)
- TTL is proactively reduced (default: 1 minute)
- This prepares for potential failover by ensuring clients will refresh DNS records sooner
During Failover:
- The reduced TTL ensures clients pick up DNS changes quickly
- Minimizes client-side impact during the transition between servers
Post-Failover Recovery:
- After successful failover completion, TTL is reset to the standard value
- This transition is tracked in the state management system

The implementation includes robust retry mechanisms and verification to ensure TTL changes are properly applied. The solution tracks TTL state with the following attributes in the state file:

{
  "TTLStatus": "Standard",  // Can be "Standard" or "Reduced"
  "LastTTLChange": "2025-03-20T09:00:00.0000000+00:00"
}

Benefits of the dynamic TTL management approach include:

Improved failover speed with quicker client transitions
Reduced client impact during server transitions
Optimized cache efficiency during normal operations
Early preparation for potential issues by preemptively reducing TTL

Testing Framework

The testing framework (COMExceptionEventSimulator.ps1) provides:

Simulation of COMException events for testing
Configurable event generation patterns
Realistic error messages and timestamps

Key features:

Event distribution across configurable time periods
Realistic COMException error messages
Random variation in event details
Administrative permission handling

Implementation Details

Error Detection Logic

The system uses Windows Event Log queries to identify COMException errors. The monitoring engine performs the following steps:

Defines a time window based on the configuration (default: 10 minutes)
Creates a filter for the Windows Event Log query
Executes the query on the active server
Filters events containing "COMException" in the message text
Counts matching events and compares with the threshold

# Create filter for event log query
$filter = @{
    LogName   = $logName
    StartTime = (Get-Date).AddMinutes(-$TimeWindowMinutes)
    EndTime   = Get-Date
}

# Execute the query with robust error handling
try {
    $events = Get-WinEvent -FilterHashtable $filter -ComputerName $ComputerName -ErrorAction Stop
    
    # Filter for COMException events
    $comExceptionEvents = $events | Where-Object { $_.Message -match "(?i)COMException" }
} 
catch {
    # Handle "no events found" gracefully
    if ($_.Exception.Message -match "No events were found") {
        # Return empty array, not null
        return @()
    }
    throw $_
}

Failover Process

The complete failover cycle involves multiple steps:

Initial State Assessment:
- Identify the current active server
- Verify server accessibility
First Failover (A → B):
- Update DNS CNAME record to point to the standby server
- Wait for DNS propagation
- Restart and recycle IIS application pools on new active server
Server Restart:
- Attempt restart using multiple methods (for resilience):
  - PowerShell Invoke-Command with Restart-Computer
  - WMI Win32_OperatingSystem
  - shutdown.exe remote command
- Verify server goes offline
- Wait for server to come back online
- Verify server is fully operational
Second Failover (B → A):
- Update DNS CNAME record to switch back to original server
- Wait for DNS propagation
- Restart and recycle IIS application pools on original server
Final Verification:
- Confirm DNS is pointing to intended server
- Verify service availability

State Management

The monitoring system maintains state between executions using a JSON state file:

{
  "Errors": [
    {
      "TimeCreated": "2025-03-20T10:15:30.0000000+00:00",
      "EventID": 1001,
      "Level": "Error",
      "LogName": "Application",
      "Message": "System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80010105..."
    }
  ],
  "LastFailoverTime": "2025-03-20T09:30:00.0000000+00:00",
  "FailoverCount": 2,
  "TTLStatus": "Standard",
  "LastTTLChange": "2025-03-20T09:35:00.0000000+00:00"
}

This allows the system to:

Track the recent error history
Enforce cooldown periods
Maintain continuity across script executions
Provide statistics on failover frequency
Track DNS TTL status changes

Timing Design Considerations

The monitoring system employs specific timing parameters that balance responsiveness, reliability, and resource efficiency. This section analyzes the design considerations behind these timing choices.

Scheduled Task Frequency (5 minutes)

The 5-minute interval for the monitoring task was selected based on the following considerations:

Response Time Balance:
- 5 minutes provides a reasonable balance between quick response and system overhead
- Shorter intervals (1-2 minutes) would increase system load without significant benefit
- Longer intervals (10+ minutes) would delay response to developing issues
Coordination with Error Window:
- The 5-minute interval works well with the 10-minute error detection window
- Two consecutive checks can completely refresh the error window
Resource Consumption:
- The script typically completes in under 30 seconds
- 5-minute interval keeps CPU utilization under 0.5% for monitoring
- Allows for occasional longer runs without causing task overlap
Operational Considerations:
- Aligns with typical system monitoring intervals
- Frequent enough to catch issues before significant user impact
- Task execution timeout set to 10 minutes provides sufficient buffer

Error Detection Window (10 minutes)

The 10-minute window for counting COMException events provides these benefits:

Error Pattern Recognition:
- Long enough to identify patterns (vs. isolated incidents)
- Short enough to respond quickly to sudden issue clusters
Operational Alignment:
- Matches typical user-reported issue timeframes
- Provides sufficient context for error correlation
Memory Management:
- Keeps state file size reasonable with limited event history
- Prevents memory consumption issues with large event collections
Statistical Significance:
- 10 minutes provides enough samples to differentiate between:
  - Normal background errors (1-2 random events)
  - Actual system failure conditions (3+ clustered events)

Error Threshold (3 events)

The threshold of 3 COMException events was determined through:

Baseline Analysis:
- Historical data showed healthy systems typically have 0-1 COMException events in 10 minutes
- Problem states consistently show 3+ events
- Setting threshold at 3 provides clear separation between normal and problem states
False Positive Prevention:
- 3 events threshold minimizes chance of triggering on random errors
- Lower threshold (1-2) produced unacceptable false positive rate in testing
Response Time Optimization:
- 3 events typically occur within 2-5 minutes of beginning of issue
- Allows for quick response while maintaining accuracy

Cooldown Period (45 minutes)

The 45-minute cooldown between failover operations provides:

System Stability:
- Allows both servers to fully stabilize after a complete failover cycle
- Prevents rapid oscillation between servers (A→B→A→B)
Error Pattern Observation:
- Provides sufficient time to determine if failover resolved the issue
- Allows collection of new baseline metrics post-failover
Administrative Response Window:
- Gives operations team time to analyze logs before next automatic action
- Aligns with typical incident response mobilization time (30-60 minutes)
Resource Protection:
- Prevents excessive DNS updates in short timeframes
- Avoids potential RPC and network congestion from frequent operations

Timing Parameter Interactions

The relationship between the various timing parameters creates a cohesive monitoring system:

Detection Scenario:
- Task runs every 5 minutes
- Each run considers events from past 10 minutes
- This creates overlapping detection windows
- Ensures no brief error bursts are missed between runs
Maximum Detection Time:
- Worst case: Issue begins immediately after a check completes
- Next check occurs in 5 minutes
- Maximum time from first error to detection: 5 minutes
- Average detection time: 2.5 minutes
Full Cycle Timing:
- Detection phase: ~2.5 minutes average
- First failover: ~5 minutes
- Server restart: ~10 minutes
- Second failover: ~5 minutes
- Total cycle: ~22.5 minutes
- 45-minute cooldown provides 2x buffer for complete cycle

Configuration

The solution allows configuration through script parameters:

Monitoring Configuration

# Environment selection
-Env "Dev"  # or "Prod"

# Error detection settings
-ErrorThreshold 3  # Number of events to trigger failover
-TimeWindowMinutes 10  # Time window for counting events

# Protection settings
-CooldownPeriodMinutes 45  # Minimum time between failovers

# TTL Management
-DefaultTTLMinutes 3  # Standard TTL value in minutes
-ReducedTTLMinutes 1  # Reduced TTL when errors are detected

# Paths
-CompleteFailoverScriptPath "C:\Scripts\completeFailoverCycle.ps1"
-logFilePath "C:\Logs\AutoFailoverMonitor.log"
-stateFilePath "C:\Logs\AutoFailoverState.json"

Failover Configuration

# DNS settings
-dnsServer "dns.example.com"
-lookupZone "example.com"

# Timeout settings
-serverRestartTimeout 600  # Seconds to wait for server restart
-serverAvailabilityCheckInterval 15  # Seconds between availability checks
-maxFailoverWaitTime 1800  # Maximum total failover time

# Operation control
-SkipServerRestart  # Skip server restart phase
-SkipSecondFailover  # Skip second failover phase

Environment-Specific Settings

Dev environment:

Servers: NTATCA1515 and NTATCA1516
DNS TTL: Default 3 minutes, Reduced 1 minute
Initial Wait: 30 seconds
Check Interval: 10 seconds

Prod environment:

DNS TTL: Default 3 minutes, Reduced 1 minute
Initial Wait: 300 seconds
Check Interval: 60 seconds

Installation & Setup

Prerequisites

Windows Server with PowerShell 5.1 or later
Administrative privileges on the tooling server
DNS management permissions
Remote management access to application servers
Network connectivity to all systems

Installation Steps

Create Script Directory:
```
mkdir C:\Scripts\FailoverAutomation
```
Copy Scripts to Directory:
- autoFailoverMonitor.ps1
- completeFailoverCycle.ps1
- dnsFailover_v2.ps1
- COMExceptionEventSimulator.ps1 (for testing only)
Create Log Directory:
```
mkdir C:\Logs\FailoverAutomation
```

Initialize the Monitoring System:

.\autoFailoverMonitor.ps1 -Initialize -Env Prod

Create Scheduled Task:

# Alternative to built-in task creation
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -ExecutionPolicy Bypass -File `"C:\Scripts\FailoverAutomation\autoFailoverMonitor.ps1`" -Env `"Prod`""
$trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 5)
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 10) -RestartCount 3
Register-ScheduledTask -TaskName "FailoverMonitor_Prod" -Action $action -Trigger $trigger -Settings $settings -RunLevel Highest

Test the Setup:

# Test in Dev environment first
.\autoFailoverMonitor.ps1 -Env Dev -TestMode

Usage Guide

Routine Monitoring

The scheduled task will handle routine monitoring automatically. To manually run a monitoring check:

.\autoFailoverMonitor.ps1 -Env Prod

Force Immediate Failover

To trigger a failover regardless of error threshold:

.\autoFailoverMonitor.ps1 -Env Prod -ForceFailover

Testing and Simulation

To test the system without actual failover:

# Test monitoring with simulated errors
.\autoFailoverMonitor.ps1 -Env Dev -TestMode -SimulateError

# Generate test events manually
.\COMExceptionEventSimulator.ps1 -EventCount 5 -TimeSpanMinutes 10

Custom Failover Options

To run failover with custom options:

# Skip server restart during failover
.\completeFailoverCycle.ps1 -Env Prod -SkipServerRestart

# Only perform first failover (A → B)
.\completeFailoverCycle.ps1 -Env Prod -SkipSecondFailover

Checking Status

To view the current error state:

# Read the state file directly
Get-Content -Path "C:\Logs\FailoverAutomation\AutoFailoverState.json" | ConvertFrom-Json

Troubleshooting

Common Issues

Script Error: "Cannot determine active server"
- Verify DNS configuration parameters
- Check network connectivity to DNS server
- Ensure DNS records exist for monitored services
- Review dnsFailover_v2.ps1 log for more details
Script Error: "Cannot bind argument to parameter 'NewErrors'"
- This is a compatibility issue with PowerShell in Scheduled Tasks
- Update the Update-ErrorState function to use try-catch instead of TryParse
- Initialize LASTEXITCODE before script calls
Failover triggers but server restart fails
- Ensure remote management is enabled on servers
- Verify administrative credentials
- Check firewall rules for remote management
- Review Windows Event Logs on the target server
TTL Updates Fail
- Verify DNS server permissions
- Check for DNS zone restrictions
- Ensure DNS record exists
- Review DNSFailover.log for detailed error information

Log Analysis

Key log locations:

C:\Logs\FailoverAutomation\AutoFailoverMonitor.log
C:\Logs\FailoverAutomation\completeFailoverCycle.log
C:\Logs\FailoverAutomation\DNSFailover.log

Important log patterns:

[ERROR] - Critical issues requiring attention
[WARNING] - Potential issues or non-critical failures
Found X COMException events across all logs - Detection of exceptions
Triggering complete failover cycle - Start of failover process
Failover successful - Successful DNS update
Server X has been successfully restarted - Server restart confirmation
TTL reduced to X minute(s) successfully - DNS TTL modification

Testing & Validation

Test Scenarios

Basic Monitoring:
- Run monitoring script in Dev environment
- Verify it correctly identifies active server
- Confirm it handles zero-event case correctly
Error Detection:
- Create simulated events with COMExceptionEventSimulator.ps1
- Run monitoring script
- Verify it correctly counts the events
- Confirm it triggers failover when threshold is met
Failover Process:
- Run complete failover cycle in test mode
- Verify all phases execute correctly
- Confirm server restart handling
- Validate second failover completes
TTL Management:
- Verify TTL is reduced upon first error detection
- Confirm TTL resets after successful failover
- Test TTL values match configuration parameters
Timing Parameters:
- Validate error detection window functions correctly (10-minute window)
- Test cooldown period enforcement (45 minutes)
- Confirm multiple errors within window trigger failover
Edge Cases:
- Test cooldown period enforcement
- Validate handling when server is unavailable
- Test recovery from partial failover
- Verify behavior with corrupted state

Validation Approach

Before deploying to production:

Conduct full testing in Dev environment
Perform controlled testing during non-business hours
Monitor first production deployments closely
Review logs after each automated failover
Validate timing parameters in realistic scenarios
Confirm TTL management functions as expected

Limitations & Future Improvements

Current Limitations

Temporary Solution: This system addresses symptoms, not the root cause of COMException issues
Server Specificity: Current implementation is specific to the NTATCA datacenter requirements
Fixed Failover Path: System always attempts to return to the original server
Manual Validation: No automated verification of application functionality after failover
Simple Monitoring Logic: Only monitors for COMException events, not other potential indicators

Future Improvements

Root Cause Resolution: Replace with permanent fix once root cause is identified
Enhanced Monitoring: Add additional monitoring metrics beyond COMException events
Application Validation: Implement automated health checks of applications post-failover
Multi-Server Support: Extend to handle more complex server architectures
Notification Enhancements: Add SMS/Teams notifications for critical events
Reporting Dashboard: Create web dashboard for failover history and statistics
Adaptive TTL Management: Further optimize TTL based on time of day and traffic patterns
Machine Learning Detection: Implement predictive failure detection

This solution provides a robust automated approach to mitigate the business impact of COMException issues while the development team works on identifying and fixing the root cause. By proactively detecting issues and executing failover procedures automatically, the system minimizes service disruption, especially during critical business hours.

davidlu1001/Automated_Failover_Solution.md

Automated Failover Solution for COMException Issues

Table of Contents

Solution Overview

Background

Architecture

Workflow

Components

Monitoring Engine

Failover Mechanism

DNS Management

DNS TTL Management

Testing Framework

Implementation Details

Error Detection Logic

Failover Process

State Management

Timing Design Considerations

Scheduled Task Frequency (5 minutes)

Error Detection Window (10 minutes)

Error Threshold (3 events)

Cooldown Period (45 minutes)

Timing Parameter Interactions

Configuration

Monitoring Configuration

Failover Configuration

Environment-Specific Settings

Installation & Setup

Prerequisites

Installation Steps

Usage Guide

Routine Monitoring

Force Immediate Failover

Testing and Simulation

Custom Failover Options

Checking Status

Troubleshooting

Common Issues

Log Analysis

Testing & Validation

Test Scenarios

Validation Approach

Limitations & Future Improvements

Current Limitations

Future Improvements