Skip to content

Instantly share code, notes, and snippets.

@polynomial
Created February 26, 2015 02:38
Show Gist options
  • Save polynomial/ce7f672a03e7c37292fb to your computer and use it in GitHub Desktop.
Save polynomial/ce7f672a03e7c37292fb to your computer and use it in GitHub Desktop.

Title: EXP-001 HBase Region Server Failure

Problem Statement

  • Hadoop HBase Region Server Failures are not Random, nor are they caused by cloudera manager, nor are they a result of GC activity

Hypothesis

  • That there is a root cause of the HBase failure

  • That the impact caused by the region server failure is preventable through improved monitoring

  • That if this is not possible logging or metric collection should be improved to diagnose this failure in the future

Definitions:

  • This may not be needed?

Assumptions:

  • That there exists enough logs/metrics to track down the issue to an actual root cause

Tools:

  • Logs: hbase region server, master, hdfs datanode, namenode

  • Metrics: datadog

Procedure:

  • Define time of occurance

  • Track back from hbase regionserver logs to possible underlying cause

  • Determine actor that initiated activity that resulted in the failure

Results (Data):

Limitations:

Conclusions:

Future Work:

@mbbx6spp
Copy link

Great first cut. My main concerns are the following:

  • Problem statement - I believe this might be closer to the hypothesis than that problem statement. I think a better problem statement might be something like "There are HBase region server failures that have occurred only in the production environment over the last N weeks. There is no obvious root causes or the root causes previously hypothesized have been ruled out. These included: ... they were ruled out because..."
  • Hypothesis: rewrite your previous problem statement a little to make it testable. Maybe something like, "The root cause of the HBase region server failures observed in only production are the result of increased load, new M/R jobs added, increased data size, rate of data size increase, a misconfiguration, a combination of these, or some other external factor which explains the recent rise/occurrence of such failures.".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment