Skip to content

Instantly share code, notes, and snippets.

@polynomial
Created February 26, 2015 18:05
Show Gist options
  • Save polynomial/0ed7433fdd830298b87f to your computer and use it in GitHub Desktop.
Save polynomial/0ed7433fdd830298b87f to your computer and use it in GitHub Desktop.
EXP-001

Title: EXP-001 HBase Region Server Failure

Problem Statement

  • Timeout values, alerting, and current metrics in the hadoop cluster aren’t good enough to prevent availability impacting outage scenarios.

Hypothesis

  • That recent HBase region server failures are not a result of GC or CMF actions

  • That tuning could be done to improve automated recovery

  • That there are un-auditable activities happening on the cluster which make triage/RCA more difficult

Definitions:

  • This may not be needed?

Assumptions:

  • That there exists enough logs/metrics to track down the issue to an actual root cause

Tools:

  • Logs: hbase region server, master, hdfs datanode, namenode

  • Metrics: datadog

Procedure:

  • Define time of occurance

  • Track back from hbase regionserver logs to possible underlying cause

  • Determine actor that initiated activity that resulted in the failure

Results (Data):

  • Example of HBase region server shut down from hadoop053:

    2015-02-23 23:21:33,589 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer
    org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/hadoop053.flexd-sf0.local,60020,1422991587643/hadoop053.flexd-sf0.local%2C60020%2C1422991587643.1424762493366 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391)
    at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
    at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536)
    2015-02-23 23:21:33,590 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server hadoop053.flexd-sf0.local,60020,1422991587643: Failed log close in log roller
    org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1424762493366
    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanupCurrentWriter(FSHLog.java:787)
    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:575)
    at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/hadoop053.flexd-sf0.local,60020,1422991587643/hadoop053.flexd-sf0.local%2C60020%2C1422991587643.1424762493366 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391)
    at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
    at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
    at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536)
    2015-02-23 23:21:33,615 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
    2015-02-23 23:21:33,663 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed log close in log roller
    2015-02-23 23:21:33,663 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
    2015-02-23 23:21:33,663 INFO org.apache.hadoop.ipc.RpcServer: Stopping server on 60020
    2015-02-23 23:21:33,664 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.listener,port=60020: stopping
    2015-02-23 23:21:33,664 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopped
    2015-02-23 23:21:33,664 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopping
    2015-02-23 23:21:33,664 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
    2015-02-23 23:21:33,670 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
    2015-02-23 23:21:33,670 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker interrupted while waiting for task, exiting: java.lang.InterruptedException
    2015-02-23 23:21:33,671 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker hadoop053.flexd-sf0.local,60020,1422991587643 exiting
    2015-02-23 23:21:33,683 INFO org.mortbay.log: Stopped [email protected]:60030
    2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: Stopping RegionServerSnapshotManager abruptly.
    2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.1 exiting
    2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting
    2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.ServerNonceManager$1: regionserver60020.nonceCleaner exiting
    2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: regionserver60020.compactionChecker exiting
    2015-02-23 23:21:33,785 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server hadoop053.flexd-sf0.local,60020,1422991587643
    2015-02-23 23:21:33,786 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x44a5edb26a25cbe
  • HBase region server shutdown causes from hadoop049:

    2015-02-23 21:14:51,695 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-308208786-192.168.50.36-1363989756031:blk_1260528925_1099716500363 java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.82.15.139:54612 remote=/10.82.15.210:50010]
          at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
          at java.io.FilterInputStream.read(FilterInputStream.java:83)
          at java.io.FilterInputStream.read(FilterInputStream.java:83)
          at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2101)
          at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
    2015-02-23 23:15:38,189 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/data/default/device_events/5dd4aa420dd063a48492db378c49499d/.tmp/12a9dd99a89f4c2d8db1a979d8846203 could only be replicated to 0 nodes instead of minReplication (=1).  There are 114 datanode(s) running and no node(s) are excluded in this operation.
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
          at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
          at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
          at org.apache.hadoop.ipc.Client.call(Client.java:1411)
          at org.apache.hadoop.ipc.Client.call(Client.java:1364)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
          at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391)
          at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
          at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
          at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
          at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
          at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536)
    2015-02-23 23:15:47,370 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server hadoop049.flexd-sf0.local,60020,1422991589849: Replay of HLog required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: device_events,6721a53b,1406930979994.5dd4aa420dd063a48492db378c49499d.
          at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1804)
          at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1662)
          at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1575)
          at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:461)
          at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:435)
          at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:66)
          at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:253)
          at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/data/default/device_events/5dd4aa420dd063a48492db378c49499d/.tmp/6ea0038a2db245e6a5250360e8e0891e could only be replicated to 0 nodes instead of minReplication (=1).  There are 114 datanode(s) running and no node(s) are excluded in this operation.
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
          at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614)
          at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
          at org.apache.hadoop.ipc.Client.call(Client.java:1411)
          at org.apache.hadoop.ipc.Client.call(Client.java:1364)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
          at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391)
          at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
          at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
          at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
          at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
          at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290)
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536)
    2015-02-23 23:15:47,394 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
    2015-02-23 23:15:47,440 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Replay of HLog required. Forcing server shutdown
    2015-02-23 23:15:47,440 INFO org.apache.hadoop.ipc.RpcServer: Stopping server on 60020
    2015-02-23 23:15:47,440 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.1 exiting
    2015-02-23 23:15:47,441 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.listener,port=60020: stopping
    2015-02-23 23:15:47,441 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopped
    2015-02-23 23:15:47,441 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopping
    2015-02-23 23:15:47,442 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
    2015-02-23 23:15:47,447 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
    2015-02-23 23:15:47,447 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker interrupted while waiting for task, exiting: java.lang.InterruptedException
    2015-02-23 23:15:47,447 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker hadoop049.flexd-sf0.local,60020,1422991589849 exiting
    2015-02-23 23:15:47,473 INFO org.mortbay.log: Stopped [email protected]:60030
    2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting
    2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: Stopping RegionServerSnapshotManager abruptly.
    2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.ServerNonceManager$1: regionserver60020.nonceCleaner exiting
    2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
    2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: regionserver60020.compactionChecker exiting
    2015-02-23 23:15:47,578 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server hadoop049.flexd-sf0.local,60020,1422991589849
    2015-02-23 23:15:47,578 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x14a5edb27fedb6e
    2015-02-23 23:15:47,580 INFO org.apache.zookeeper.ZooKeeper: Session: 0x14a5edb27fedb6e closed
    2015-02-23 23:15:47,580 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
    2015-02-23 23:15:47,640 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 4 regions to close
    2015-02-23 23:15:47,658 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf
    2015-02-23 23:15:47,658 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 1065600
    2015-02-23 23:15:47,662 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf
    2015-02-23 23:15:47,663 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 3910824
    2015-02-23 23:15:47,664 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf
    2015-02-23 23:15:47,665 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 953712
    2015-02-23 23:15:47,676 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,e2be2bba,1406930979997.27add37206ade98a929e4c6ff27011a6.
    2015-02-23 23:15:47,676 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,6721a53b,1406930979994.5dd4aa420dd063a48492db378c49499d.
    2015-02-23 23:15:47,677 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,c869533e,1406930979996.1d9e2be0e0d4d93cc97ab78370508d7d.
    2015-02-23 23:15:47,679 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf
    2015-02-23 23:15:47,679 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 1135752
    2015-02-23 23:15:47,679 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,bd70a3b5,1406930979996.053af1cdf4455d391f6816cd5c5c2f76.
    2015-02-23 23:15:47,840 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server hadoop049.flexd-sf0.local,60020,1422991589849; all regions closed.
    2015-02-23 23:15:47,841 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: regionserver60020-WAL.AsyncNotifier exiting
    2015-02-23 23:15:47,841 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: regionserver60020-WAL.AsyncSyncer0 exiting
    2015-02-23 23:15:47,842 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: regionserver60020-WAL.AsyncSyncer1 exiting
  • Cause of write failure, from namenode:

    hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:21:32,878 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
    hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:21:32,879 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.IOException: File /hbase/WALs/hadoop053.flexd-sf0.local,60020,1422991587643/hadoop053.flexd-sf0.local%2C60020%2C1422991587643.1424762492868 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation.
    hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:21:32,879 INFO org.apache.hadoop.ipc.Server: IPC Server handler 106 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.82.14.136:49261 Call#118756 Retry#0
  • Namenode experienced failure for multiple minutes, starting at:

    hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 22:49:00,465 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
  • Ending on:

    hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:32:56,504 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
  • Capacity loss, under replicated climbs to 171k blocks at https://app.datadoghq.com/dash/dash/41141?from_ts=1424760391510&to_ts=1424760691510&tile_size=l

    02/23 22:48 PST
  • There was another operation putting both heavy ingress load, and increased storage requirements on the cluster https://app.datadoghq.com/dash/dash/41141?from_ts=1424738017225&to_ts=1424766524058&tile_size=l

    Not a problem but probably caused the 'discovery' of the failing disk because of write attempts on a non-scrubbed spindle. http://en.wikipedia.org/wiki/Data_scrubbing
  • Timeout failures on datanodes:

    bens@hadoop032:/var/log/hadoop-hdfs$ cat hadoop-cmf-hdfs-DATANODE-hadoop032.flexd-sf0.local.log.out* | grep ERROR | sed 's/[0-9]:[0-9]*,[0-9]* .*//'   | sort | uniq -c
        1 2015-02-21 19:0
        1 2015-02-21 19:3
        1 2015-02-21 20:0
        1 2015-02-21 20:3
        1 2015-02-23 11:2
        3 2015-02-23 17:2
        5 2015-02-23 17:3
        8 2015-02-23 18:3
        2 2015-02-23 18:4
        6 2015-02-23 19:3
        3 2015-02-23 20:1
       15 2015-02-23 20:2
       27 2015-02-23 20:3
       90 2015-02-23 20:4
       19 2015-02-23 20:5
       27 2015-02-23 21:0
       52 2015-02-23 21:1
       54 2015-02-23 21:2
       49 2015-02-23 21:3
       67 2015-02-23 21:4
       43 2015-02-23 21:5
      129 2015-02-23 22:0
       27 2015-02-23 22:1
       75 2015-02-23 22:2
      182 2015-02-23 22:3
       63 2015-02-23 22:4
       25 2015-02-23 22:5
        4 2015-02-23 23:0
        2 2015-02-23 23:1
        4 2015-02-23 23:2
        2 2015-02-23 23:3
        1 2015-02-24 00:0
        1 2015-02-24 13:5
  • Not all nodes:

    root@hadoop049:/var/log/hadoop-hdfs# cat *DATANODE* | grep ERROR | sed 's/[0-9]:[0-9]*,[0-9]* .*//'   | sort | uniq -c
        1 2015-02-23 15:3
        1 2015-02-23 16:0
        1 2015-02-23 16:3
        2 2015-02-23 17:2
        9 2015-02-23 18:3
        2 2015-02-23 18:4
        1 2015-02-23 19:3
        8 2015-02-23 20:1
       13 2015-02-23 20:2
        7 2015-02-23 20:3
        6 2015-02-23 20:4
        1 2015-02-23 20:5
        1 2015-02-23 21:0
        2 2015-02-23 21:1
        2 2015-02-23 22:5
        7 2015-02-23 23:1
        1 2015-02-24 11:1
       39 2015-02-24 13:5
  • Cluster wide:

    Collection methodology:
      $ for i in `cat hadoop-ips `; do ssh $i 'cat /var/log/hadoop-hdfs/*DATANODE* | grep ERROR' ;done >/tmp/hadoop-ERROR
      $ cat /tmp/hadoop-ERROR | sed -e 's/[0-9]:[0-9]*,[0-9]* ERROR org.apache.hadoop.hdfs.server.datanode.DataNode://' -e 's/\(.*\) \(hadoop[0-9]*\).*/\1/'  | grep -v DataN | sort | uniq -c
       1 2015-02-23 13:0
       6 2015-02-23 13:3
       3 2015-02-23 13:4
       1 2015-02-23 13:5
       2 2015-02-23 14:0
       1 2015-02-23 14:2
       6 2015-02-23 14:3
       9 2015-02-23 14:4
       3 2015-02-23 14:5
      17 2015-02-23 15:0
      38 2015-02-23 15:1
       3 2015-02-23 15:3
       2 2015-02-23 16:0
       3 2015-02-23 16:3
      10 2015-02-23 17:0 <--- first ingress spike
     111 2015-02-23 17:2
     295 2015-02-23 17:3
       1 2015-02-23 17:5
       2 2015-02-23 18:0
       2 2015-02-23 18:1
    1342 2015-02-23 18:3 <--- ingress spike drops off, assuming errors on job side (needs to be further investigated)
     449 2015-02-23 18:4
       1 2015-02-23 19:0
       1 2015-02-23 19:1
       1 2015-02-23 19:2
     519 2015-02-23 19:3
     124 2015-02-23 19:4
       2 2015-02-23 20:0
     436 2015-02-23 20:1 <--- ingress plateau starts to waffle, assuming cluster data issues causing ingress performance problems
    1150 2015-02-23 20:2
     769 2015-02-23 20:3
     857 2015-02-23 20:4
    1180 2015-02-23 20:5
     724 2015-02-23 21:0
    1418 2015-02-23 21:1
    1790 2015-02-23 21:2
    2123 2015-02-23 21:3
    1980 2015-02-23 21:4
    1732 2015-02-23 21:5
    1083 2015-02-23 22:0
    1942 2015-02-23 22:1
    1075 2015-02-23 22:2
    1809 2015-02-23 22:3
    1148 2015-02-23 22:4 <--- disk failure on one node
     845 2015-02-23 22:5
     552 2015-02-23 23:0
     589 2015-02-23 23:1
     222 2015-02-23 23:2
     269 2015-02-23 23:3
     161 2015-02-24 00:0
     109 2015-02-24 00:3
     109 2015-02-24 01:0
       2 2015-02-24 01:3
       4 2015-02-24 02:0
       4 2015-02-24 02:3
       1 2015-02-24 03:3
       6 2015-02-24 04:0

Limitations:

Conclusions:

  • 1 Failed disk in hadoop did cause an under-replicated block spike that was redistributed accordingly

  • The disk failure, in combination with the load caused datanode errors that pushed HLog replication so far behind that some but not all of the nodes exited

  • The increase in datanode failures was a result of increased cluster usage

  • Determining the cause of the work is currently difficult/impossible

Future Work:

  • More audit logging is needed of operations

  • Discussion about quotes (the cluster was almost full, <10% free)

  • Monitoring for network saturation should be improved

  • Timeouts for HBase should be investigated

  • Timeouts for datanode operations should be investigated

  • Alerts should be set for regionserver failure, datanode errors, and under replicated blocks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment