exp-001.asciidoc

Title: EXP-001 HBase Region Server Failure

Hadoop HBase Region Server Failures are not Random, nor are they caused by cloudera manager, nor are they a result of GC activity

That there is a root cause of the HBase failure
That the impact caused by the region server failure is preventable through improved monitoring
That if this is not possible logging or metric collection should be improved to diagnose this failure in the future

That there exists enough logs/metrics to track down the issue to an actual root cause

Great first cut. My main concerns are the following:

Problem statement - I believe this might be closer to the hypothesis than that problem statement. I think a better problem statement might be something like "There are HBase region server failures that have occurred only in the production environment over the last N weeks. There is no obvious root causes or the root causes previously hypothesized have been ruled out. These included: ... they were ruled out because..."
Hypothesis: rewrite your previous problem statement a little to make it testable. Maybe something like, "The root cause of the HBase region server failures observed in only production are the result of increased load, new M/R jobs added, increased data size, rate of data size increase, a misconfiguration, a combination of these, or some other external factor which explains the recent rise/occurrence of such failures.".