-
Hadoop HBase Region Server Failures are not Random, nor are they caused by cloudera manager, nor are they a result of GC activity
-
That there is a root cause of the HBase failure
-
That the impact caused by the region server failure is preventable through improved monitoring
-
That if this is not possible logging or metric collection should be improved to diagnose this failure in the future
-
Define time of occurance
-
Track back from hbase regionserver logs to possible underlying cause
-
Determine actor that initiated activity that resulted in the failure
Great first cut. My main concerns are the following: