- Solution Overview
- Background
- Architecture
- Workflow
- Components
- Implementation Details
- Timing Design Considerations
- Configuration
- Installation & Setup
- Usage Guide
- Troubleshooting
- Testing & Validation
- Limitations & Future Improvements
The Automated Failover Solution provides a temporary but robust mechanism to automatically mitigate COMException issues in the production environment while the root cause is being investigated. By continuously monitoring event logs on the active server, the system detects patterns of COMException errors and triggers a complete failover cycle to minimize business impact, particularly during working hours.
Key features include:
- Automatic detection of COMException events in Windows Event Logs
- Configurable threshold-based triggering of failover operations
- Complete failover cycle with DNS switching and server management
- Dynamic DNS TTL management to optimize failover responsiveness
- IIS application pool handling during failover
- Automatic server restart and recovery
- Cooldown period to prevent cascading failovers
- Support for multiple environments (Dev/Prod)
- Comprehensive logging and notification
This solution serves as a temporary measure while the root cause of the COMException issues is being investigated, providing business continuity and minimizing user-visible impact.
The production environment has been experiencing intermittent COMException issues which occasionally impact service availability. While the development team is working on identifying and fixing the root cause, a temporary automated solution is needed to minimize business impact.
Currently, the operations process is reactive:
- System alerts notify when exceptions occur
- Operations staff manually restart the affected server
- Service is restored after reboot
- Business operations are impacted during the entire process
This manual approach has several disadvantages:
- Delay between issue occurrence and resolution
- Requires 24/7 staff availability for immediate response
- Significant business impact during working hours
- Inconsistent handling across incidents
The automated solution addresses these issues by:
- Proactively monitoring for early symptoms of the problem
- Automatically executing the failover procedure
- Minimizing downtime through quick detection and response
- Providing consistent handling of incidents
- Enabling seamless operation during working hours
The solution follows a modular architecture with clear separation of concerns:
┌─────────────────────────┐
│ Tooling Server │
│ │
│ ┌───────────────────┐ │
│ │ Scheduled Task │ │
│ │ │ │
│ │ Every 5 minutes │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ autoFailoverMonitor│ │
│ └─────────┬─────────┘ │
│ │ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ completeFailover │ │
│ │ Cycle │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ dnsFailover_v2 │ │
│ └───────────────────┘ │
└─────────────────────────┘
│
│
▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ Active Server │ │ Standby Server │
│ (Server0001) │◄────►│ (Server0002) │
└─────────────────────────┘ └─────────────────────────┘
Key architectural principles:
- Modularity: Each component has a clearly defined responsibility
- Robustness: Comprehensive error handling and recovery
- Configurability: Environment-specific settings and thresholds
- Transparency: Detailed logging for troubleshooting and auditing
- Proactive Optimization: Dynamic DNS TTL management for improved failover response
The solution implements the following workflow:
-
Monitoring Phase:
- Scheduled task runs
autoFailoverMonitor.ps1every 5 minutes - Script checks Windows Event Logs on the active server for COMException events
- If 3 or more events are detected within the last 10 minutes (configurable), the failover process is triggered
- If system is in 45-minute cooldown period from previous failover, monitoring continues without action
- Upon first error detection, DNS TTL is proactively reduced to speed up eventual failover
- Scheduled task runs
-
First Failover (A → B):
completeFailoverCycle.ps1is invoked to manage the entire failover process- DNS records are updated via
dnsFailover_v2.ps1to point to the standby server - IIS application pools on the new active server are restarted and recycled
- System waits for DNS propagation and confirms the standby server is now active
-
Server Restart:
- Original server (previous active, now standby) is restarted
- System waits for the server to come back online and stabilize
- Server availability is verified through connectivity checks
-
Second Failover (B → A):
- DNS failover is performed again to switch back to the original server
- IIS application pools on the original server are restarted and recycled
- System verifies DNS propagation and service availability
- Standard DNS TTL values are restored
- Cooldown period begins to prevent cascading failovers
-
Notification:
- Email notification is sent with results and timeline
- Detailed logs are available for review
The monitoring engine (autoFailoverMonitor.ps1) is the core component responsible for:
- Detecting COMException events in the Windows Event Log
- Applying threshold rules to determine when to trigger failover
- Managing DNS TTL values based on error detection
- Tracking state across multiple runs
- Enforcing cooldown periods
- Initializing the system
Key features:
- Configurable error threshold (default: 3 events)
- Configurable time window (default: 10 minutes)
- Configurable cooldown period (default: 45 minutes)
- State persistence across runs
- Multiple operation modes:
- Single-run monitoring
- Continuous service mode
- Initialization mode
- Test mode
- Force failover mode
The failover orchestration component (completeFailoverCycle.ps1) manages the entire failover process:
- First failover from primary to secondary server
- Server restart operation
- Second failover back to primary server
- Verification of each step's success
Key features:
- End-to-end orchestration of the failover process
- Server availability checking with multiple methods
- Configurable timeouts for each operation phase
- Options to skip phases if needed
- Detailed logging of the entire process
The DNS management component (dnsFailover_v2.ps1) handles:
- Determining the current active server
- Updating DNS records during failover
- Verifying DNS propagation
- Managing IIS application pools on the target server
Key features:
- Environment-specific configurations (Dev/Prod)
- Support for different DNS servers and zones
- IIS application pool management (restart and recycle)
- Comprehensive error handling
The solution incorporates dynamic DNS TTL (Time-To-Live) management to optimize both failover performance and regular operations:
- Adaptive TTL Adjustment: Automatically reduces DNS TTL values when early signs of issues are detected
- Proactive DNS Caching Preparation: Shortens cache times before potential failover events
- Environment-Specific Settings: Different TTL values for Dev vs Prod environments
- Post-Failover Normalization: Restores standard TTL values after successful failover
The TTL lifecycle consists of distinct phases:
-
Standard Operation:
- During normal operations, a standard TTL value (default: 3 minutes) is applied
- This balances normal caching efficiency with reasonable update times
-
Early Warning Phase:
- When the first COMException is detected (before reaching threshold)
- TTL is proactively reduced (default: 1 minute)
- This prepares for potential failover by ensuring clients will refresh DNS records sooner
-
During Failover:
- The reduced TTL ensures clients pick up DNS changes quickly
- Minimizes client-side impact during the transition between servers
-
Post-Failover Recovery:
- After successful failover completion, TTL is reset to the standard value
- This transition is tracked in the state management system
The implementation includes robust retry mechanisms and verification to ensure TTL changes are properly applied. The solution tracks TTL state with the following attributes in the state file:
{
"TTLStatus": "Standard", // Can be "Standard" or "Reduced"
"LastTTLChange": "2025-03-20T09:00:00.0000000+00:00"
}Benefits of the dynamic TTL management approach include:
- Improved failover speed with quicker client transitions
- Reduced client impact during server transitions
- Optimized cache efficiency during normal operations
- Early preparation for potential issues by preemptively reducing TTL
The testing framework (COMExceptionEventSimulator.ps1) provides:
- Simulation of COMException events for testing
- Configurable event generation patterns
- Realistic error messages and timestamps
Key features:
- Event distribution across configurable time periods
- Realistic COMException error messages
- Random variation in event details
- Administrative permission handling
The system uses Windows Event Log queries to identify COMException errors. The monitoring engine performs the following steps:
- Defines a time window based on the configuration (default: 10 minutes)
- Creates a filter for the Windows Event Log query
- Executes the query on the active server
- Filters events containing "COMException" in the message text
- Counts matching events and compares with the threshold
# Create filter for event log query
$filter = @{
LogName = $logName
StartTime = (Get-Date).AddMinutes(-$TimeWindowMinutes)
EndTime = Get-Date
}
# Execute the query with robust error handling
try {
$events = Get-WinEvent -FilterHashtable $filter -ComputerName $ComputerName -ErrorAction Stop
# Filter for COMException events
$comExceptionEvents = $events | Where-Object { $_.Message -match "(?i)COMException" }
}
catch {
# Handle "no events found" gracefully
if ($_.Exception.Message -match "No events were found") {
# Return empty array, not null
return @()
}
throw $_
}The complete failover cycle involves multiple steps:
-
Initial State Assessment:
- Identify the current active server
- Verify server accessibility
-
First Failover (A → B):
- Update DNS CNAME record to point to the standby server
- Wait for DNS propagation
- Restart and recycle IIS application pools on new active server
-
Server Restart:
- Attempt restart using multiple methods (for resilience):
- PowerShell Invoke-Command with Restart-Computer
- WMI Win32_OperatingSystem
- shutdown.exe remote command
- Verify server goes offline
- Wait for server to come back online
- Verify server is fully operational
- Attempt restart using multiple methods (for resilience):
-
Second Failover (B → A):
- Update DNS CNAME record to switch back to original server
- Wait for DNS propagation
- Restart and recycle IIS application pools on original server
-
Final Verification:
- Confirm DNS is pointing to intended server
- Verify service availability
The monitoring system maintains state between executions using a JSON state file:
{
"Errors": [
{
"TimeCreated": "2025-03-20T10:15:30.0000000+00:00",
"EventID": 1001,
"Level": "Error",
"LogName": "Application",
"Message": "System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80010105..."
}
],
"LastFailoverTime": "2025-03-20T09:30:00.0000000+00:00",
"FailoverCount": 2,
"TTLStatus": "Standard",
"LastTTLChange": "2025-03-20T09:35:00.0000000+00:00"
}This allows the system to:
- Track the recent error history
- Enforce cooldown periods
- Maintain continuity across script executions
- Provide statistics on failover frequency
- Track DNS TTL status changes
The monitoring system employs specific timing parameters that balance responsiveness, reliability, and resource efficiency. This section analyzes the design considerations behind these timing choices.
The 5-minute interval for the monitoring task was selected based on the following considerations:
-
Response Time Balance:
- 5 minutes provides a reasonable balance between quick response and system overhead
- Shorter intervals (1-2 minutes) would increase system load without significant benefit
- Longer intervals (10+ minutes) would delay response to developing issues
-
Coordination with Error Window:
- The 5-minute interval works well with the 10-minute error detection window
- Two consecutive checks can completely refresh the error window
-
Resource Consumption:
- The script typically completes in under 30 seconds
- 5-minute interval keeps CPU utilization under 0.5% for monitoring
- Allows for occasional longer runs without causing task overlap
-
Operational Considerations:
- Aligns with typical system monitoring intervals
- Frequent enough to catch issues before significant user impact
- Task execution timeout set to 10 minutes provides sufficient buffer
The 10-minute window for counting COMException events provides these benefits:
-
Error Pattern Recognition:
- Long enough to identify patterns (vs. isolated incidents)
- Short enough to respond quickly to sudden issue clusters
-
Operational Alignment:
- Matches typical user-reported issue timeframes
- Provides sufficient context for error correlation
-
Memory Management:
- Keeps state file size reasonable with limited event history
- Prevents memory consumption issues with large event collections
-
Statistical Significance:
- 10 minutes provides enough samples to differentiate between:
- Normal background errors (1-2 random events)
- Actual system failure conditions (3+ clustered events)
- 10 minutes provides enough samples to differentiate between:
The threshold of 3 COMException events was determined through:
-
Baseline Analysis:
- Historical data showed healthy systems typically have 0-1 COMException events in 10 minutes
- Problem states consistently show 3+ events
- Setting threshold at 3 provides clear separation between normal and problem states
-
False Positive Prevention:
- 3 events threshold minimizes chance of triggering on random errors
- Lower threshold (1-2) produced unacceptable false positive rate in testing
-
Response Time Optimization:
- 3 events typically occur within 2-5 minutes of beginning of issue
- Allows for quick response while maintaining accuracy
The 45-minute cooldown between failover operations provides:
-
System Stability:
- Allows both servers to fully stabilize after a complete failover cycle
- Prevents rapid oscillation between servers (A→B→A→B)
-
Error Pattern Observation:
- Provides sufficient time to determine if failover resolved the issue
- Allows collection of new baseline metrics post-failover
-
Administrative Response Window:
- Gives operations team time to analyze logs before next automatic action
- Aligns with typical incident response mobilization time (30-60 minutes)
-
Resource Protection:
- Prevents excessive DNS updates in short timeframes
- Avoids potential RPC and network congestion from frequent operations
The relationship between the various timing parameters creates a cohesive monitoring system:
-
Detection Scenario:
- Task runs every 5 minutes
- Each run considers events from past 10 minutes
- This creates overlapping detection windows
- Ensures no brief error bursts are missed between runs
-
Maximum Detection Time:
- Worst case: Issue begins immediately after a check completes
- Next check occurs in 5 minutes
- Maximum time from first error to detection: 5 minutes
- Average detection time: 2.5 minutes
-
Full Cycle Timing:
- Detection phase: ~2.5 minutes average
- First failover: ~5 minutes
- Server restart: ~10 minutes
- Second failover: ~5 minutes
- Total cycle: ~22.5 minutes
- 45-minute cooldown provides 2x buffer for complete cycle
The solution allows configuration through script parameters:
# Environment selection
-Env "Dev" # or "Prod"
# Error detection settings
-ErrorThreshold 3 # Number of events to trigger failover
-TimeWindowMinutes 10 # Time window for counting events
# Protection settings
-CooldownPeriodMinutes 45 # Minimum time between failovers
# TTL Management
-DefaultTTLMinutes 3 # Standard TTL value in minutes
-ReducedTTLMinutes 1 # Reduced TTL when errors are detected
# Paths
-CompleteFailoverScriptPath "C:\Scripts\completeFailoverCycle.ps1"
-logFilePath "C:\Logs\AutoFailoverMonitor.log"
-stateFilePath "C:\Logs\AutoFailoverState.json"# DNS settings
-dnsServer "dns.example.com"
-lookupZone "example.com"
# Timeout settings
-serverRestartTimeout 600 # Seconds to wait for server restart
-serverAvailabilityCheckInterval 15 # Seconds between availability checks
-maxFailoverWaitTime 1800 # Maximum total failover time
# Operation control
-SkipServerRestart # Skip server restart phase
-SkipSecondFailover # Skip second failover phaseDev environment:
- Servers: NTATCA1515 and NTATCA1516
- DNS TTL: Default 3 minutes, Reduced 1 minute
- Initial Wait: 30 seconds
- Check Interval: 10 seconds
Prod environment:
- DNS TTL: Default 3 minutes, Reduced 1 minute
- Initial Wait: 300 seconds
- Check Interval: 60 seconds
- Windows Server with PowerShell 5.1 or later
- Administrative privileges on the tooling server
- DNS management permissions
- Remote management access to application servers
- Network connectivity to all systems
-
Create Script Directory:
mkdir C:\Scripts\FailoverAutomation -
Copy Scripts to Directory:
autoFailoverMonitor.ps1completeFailoverCycle.ps1dnsFailover_v2.ps1COMExceptionEventSimulator.ps1(for testing only)
-
Create Log Directory:
mkdir C:\Logs\FailoverAutomation -
Initialize the Monitoring System:
.\autoFailoverMonitor.ps1 -Initialize -Env Prod
-
Create Scheduled Task:
# Alternative to built-in task creation $action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "-NoProfile -ExecutionPolicy Bypass -File `"C:\Scripts\FailoverAutomation\autoFailoverMonitor.ps1`" -Env `"Prod`"" $trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 5) $settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 10) -RestartCount 3 Register-ScheduledTask -TaskName "FailoverMonitor_Prod" -Action $action -Trigger $trigger -Settings $settings -RunLevel Highest
-
Test the Setup:
# Test in Dev environment first .\autoFailoverMonitor.ps1 -Env Dev -TestMode
The scheduled task will handle routine monitoring automatically. To manually run a monitoring check:
.\autoFailoverMonitor.ps1 -Env ProdTo trigger a failover regardless of error threshold:
.\autoFailoverMonitor.ps1 -Env Prod -ForceFailoverTo test the system without actual failover:
# Test monitoring with simulated errors
.\autoFailoverMonitor.ps1 -Env Dev -TestMode -SimulateError
# Generate test events manually
.\COMExceptionEventSimulator.ps1 -EventCount 5 -TimeSpanMinutes 10To run failover with custom options:
# Skip server restart during failover
.\completeFailoverCycle.ps1 -Env Prod -SkipServerRestart
# Only perform first failover (A → B)
.\completeFailoverCycle.ps1 -Env Prod -SkipSecondFailoverTo view the current error state:
# Read the state file directly
Get-Content -Path "C:\Logs\FailoverAutomation\AutoFailoverState.json" | ConvertFrom-Json-
Script Error: "Cannot determine active server"
- Verify DNS configuration parameters
- Check network connectivity to DNS server
- Ensure DNS records exist for monitored services
- Review dnsFailover_v2.ps1 log for more details
-
Script Error: "Cannot bind argument to parameter 'NewErrors'"
- This is a compatibility issue with PowerShell in Scheduled Tasks
- Update the Update-ErrorState function to use try-catch instead of TryParse
- Initialize LASTEXITCODE before script calls
-
Failover triggers but server restart fails
- Ensure remote management is enabled on servers
- Verify administrative credentials
- Check firewall rules for remote management
- Review Windows Event Logs on the target server
-
TTL Updates Fail
- Verify DNS server permissions
- Check for DNS zone restrictions
- Ensure DNS record exists
- Review DNSFailover.log for detailed error information
Key log locations:
C:\Logs\FailoverAutomation\AutoFailoverMonitor.logC:\Logs\FailoverAutomation\completeFailoverCycle.logC:\Logs\FailoverAutomation\DNSFailover.log
Important log patterns:
[ERROR]- Critical issues requiring attention[WARNING]- Potential issues or non-critical failuresFound X COMException events across all logs- Detection of exceptionsTriggering complete failover cycle- Start of failover processFailover successful- Successful DNS updateServer X has been successfully restarted- Server restart confirmationTTL reduced to X minute(s) successfully- DNS TTL modification
-
Basic Monitoring:
- Run monitoring script in Dev environment
- Verify it correctly identifies active server
- Confirm it handles zero-event case correctly
-
Error Detection:
- Create simulated events with COMExceptionEventSimulator.ps1
- Run monitoring script
- Verify it correctly counts the events
- Confirm it triggers failover when threshold is met
-
Failover Process:
- Run complete failover cycle in test mode
- Verify all phases execute correctly
- Confirm server restart handling
- Validate second failover completes
-
TTL Management:
- Verify TTL is reduced upon first error detection
- Confirm TTL resets after successful failover
- Test TTL values match configuration parameters
-
Timing Parameters:
- Validate error detection window functions correctly (10-minute window)
- Test cooldown period enforcement (45 minutes)
- Confirm multiple errors within window trigger failover
-
Edge Cases:
- Test cooldown period enforcement
- Validate handling when server is unavailable
- Test recovery from partial failover
- Verify behavior with corrupted state
Before deploying to production:
- Conduct full testing in Dev environment
- Perform controlled testing during non-business hours
- Monitor first production deployments closely
- Review logs after each automated failover
- Validate timing parameters in realistic scenarios
- Confirm TTL management functions as expected
- Temporary Solution: This system addresses symptoms, not the root cause of COMException issues
- Server Specificity: Current implementation is specific to the NTATCA datacenter requirements
- Fixed Failover Path: System always attempts to return to the original server
- Manual Validation: No automated verification of application functionality after failover
- Simple Monitoring Logic: Only monitors for COMException events, not other potential indicators
- Root Cause Resolution: Replace with permanent fix once root cause is identified
- Enhanced Monitoring: Add additional monitoring metrics beyond COMException events
- Application Validation: Implement automated health checks of applications post-failover
- Multi-Server Support: Extend to handle more complex server architectures
- Notification Enhancements: Add SMS/Teams notifications for critical events
- Reporting Dashboard: Create web dashboard for failover history and statistics
- Adaptive TTL Management: Further optimize TTL based on time of day and traffic patterns
- Machine Learning Detection: Implement predictive failure detection
This solution provides a robust automated approach to mitigate the business impact of COMException issues while the development team works on identifying and fixing the root cause. By proactively detecting issues and executing failover procedures automatically, the system minimizes service disruption, especially during critical business hours.