-
Timeout values, alerting, and current metrics in the hadoop cluster aren’t good enough to prevent availability impacting outage scenarios.
-
That recent HBase region server failures are not a result of GC or CMF actions
-
That tuning could be done to improve automated recovery
-
That there are un-auditable activities happening on the cluster which make triage/RCA more difficult
-
Define time of occurance
-
Track back from hbase regionserver logs to possible underlying cause
-
Determine actor that initiated activity that resulted in the failure
-
Example of HBase region server shut down from hadoop053:
2015-02-23 23:21:33,589 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/hadoop053.flexd-sf0.local,60020,1422991587643/hadoop053.flexd-sf0.local%2C60020%2C1422991587643.1424762493366 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy21.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391) at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy22.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) at com.sun.proxy.$Proxy23.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536) 2015-02-23 23:21:33,590 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server hadoop053.flexd-sf0.local,60020,1422991587643: Failed log close in log roller org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1424762493366 at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanupCurrentWriter(FSHLog.java:787) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:575) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/hadoop053.flexd-sf0.local,60020,1422991587643/hadoop053.flexd-sf0.local%2C60020%2C1422991587643.1424762493366 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy21.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391) at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy22.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) at com.sun.proxy.$Proxy23.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536) 2015-02-23 23:21:33,615 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2015-02-23 23:21:33,663 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed log close in log roller 2015-02-23 23:21:33,663 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2015-02-23 23:21:33,663 INFO org.apache.hadoop.ipc.RpcServer: Stopping server on 60020 2015-02-23 23:21:33,664 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.listener,port=60020: stopping 2015-02-23 23:21:33,664 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopped 2015-02-23 23:21:33,664 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopping 2015-02-23 23:21:33,664 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread 2015-02-23 23:21:33,670 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2015-02-23 23:21:33,670 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker interrupted while waiting for task, exiting: java.lang.InterruptedException 2015-02-23 23:21:33,671 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker hadoop053.flexd-sf0.local,60020,1422991587643 exiting 2015-02-23 23:21:33,683 INFO org.mortbay.log: Stopped [email protected]:60030 2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: Stopping RegionServerSnapshotManager abruptly. 2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.1 exiting 2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting 2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.ServerNonceManager$1: regionserver60020.nonceCleaner exiting 2015-02-23 23:21:33,784 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: regionserver60020.compactionChecker exiting 2015-02-23 23:21:33,785 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server hadoop053.flexd-sf0.local,60020,1422991587643 2015-02-23 23:21:33,786 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x44a5edb26a25cbe
-
HBase region server shutdown causes from hadoop049:
2015-02-23 21:14:51,695 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-308208786-192.168.50.36-1363989756031:blk_1260528925_1099716500363 java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.82.15.139:54612 remote=/10.82.15.210:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2101) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
2015-02-23 23:15:38,189 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/data/default/device_events/5dd4aa420dd063a48492db378c49499d/.tmp/12a9dd99a89f4c2d8db1a979d8846203 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy21.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391) at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy22.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) at com.sun.proxy.$Proxy23.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536)
2015-02-23 23:15:47,370 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server hadoop049.flexd-sf0.local,60020,1422991589849: Replay of HLog required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: device_events,6721a53b,1406930979994.5dd4aa420dd063a48492db378c49499d. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1804) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1662) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1575) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:461) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:435) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:66) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:253) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/data/default/device_events/5dd4aa420dd063a48492db378c49499d/.tmp/6ea0038a2db245e6a5250360e8e0891e could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:614) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:188) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:476) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy21.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:391) at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy22.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) at com.sun.proxy.$Proxy23.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1473) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1290) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:536) 2015-02-23 23:15:47,394 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2015-02-23 23:15:47,440 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Replay of HLog required. Forcing server shutdown 2015-02-23 23:15:47,440 INFO org.apache.hadoop.ipc.RpcServer: Stopping server on 60020 2015-02-23 23:15:47,440 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.1 exiting 2015-02-23 23:15:47,441 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.listener,port=60020: stopping 2015-02-23 23:15:47,441 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopped 2015-02-23 23:15:47,441 INFO org.apache.hadoop.ipc.RpcServer: RpcServer.responder: stopping 2015-02-23 23:15:47,442 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread 2015-02-23 23:15:47,447 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2015-02-23 23:15:47,447 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker interrupted while waiting for task, exiting: java.lang.InterruptedException 2015-02-23 23:15:47,447 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker hadoop049.flexd-sf0.local,60020,1422991589849 exiting 2015-02-23 23:15:47,473 INFO org.mortbay.log: Stopped [email protected]:60030 2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: MemStoreFlusher.0 exiting 2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: Stopping RegionServerSnapshotManager abruptly. 2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.ServerNonceManager$1: regionserver60020.nonceCleaner exiting 2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2015-02-23 23:15:47,574 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: regionserver60020.compactionChecker exiting 2015-02-23 23:15:47,578 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server hadoop049.flexd-sf0.local,60020,1422991589849 2015-02-23 23:15:47,578 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x14a5edb27fedb6e 2015-02-23 23:15:47,580 INFO org.apache.zookeeper.ZooKeeper: Session: 0x14a5edb27fedb6e closed 2015-02-23 23:15:47,580 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-23 23:15:47,640 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 4 regions to close 2015-02-23 23:15:47,658 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf 2015-02-23 23:15:47,658 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 1065600 2015-02-23 23:15:47,662 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf 2015-02-23 23:15:47,663 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 3910824 2015-02-23 23:15:47,664 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf 2015-02-23 23:15:47,665 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 953712 2015-02-23 23:15:47,676 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,e2be2bba,1406930979997.27add37206ade98a929e4c6ff27011a6. 2015-02-23 23:15:47,676 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,6721a53b,1406930979994.5dd4aa420dd063a48492db378c49499d. 2015-02-23 23:15:47,677 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,c869533e,1406930979996.1d9e2be0e0d4d93cc97ab78370508d7d. 2015-02-23 23:15:47,679 INFO org.apache.hadoop.hbase.regionserver.HStore: Closed cf 2015-02-23 23:15:47,679 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Memstore size is 1135752 2015-02-23 23:15:47,679 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed device_events,bd70a3b5,1406930979996.053af1cdf4455d391f6816cd5c5c2f76. 2015-02-23 23:15:47,840 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server hadoop049.flexd-sf0.local,60020,1422991589849; all regions closed. 2015-02-23 23:15:47,841 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: regionserver60020-WAL.AsyncNotifier exiting 2015-02-23 23:15:47,841 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: regionserver60020-WAL.AsyncSyncer0 exiting 2015-02-23 23:15:47,842 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: regionserver60020-WAL.AsyncSyncer1 exiting
-
Cause of write failure, from namenode:
hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:21:32,878 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:21:32,879 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:java.io.IOException: File /hbase/WALs/hadoop053.flexd-sf0.local,60020,1422991587643/hadoop053.flexd-sf0.local%2C60020%2C1422991587643.1424762492868 could only be replicated to 0 nodes instead of minReplication (=1). There are 114 datanode(s) running and no node(s) are excluded in this operation. hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:21:32,879 INFO org.apache.hadoop.ipc.Server: IPC Server handler 106 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.82.14.136:49261 Call#118756 Retry#0
-
Namenode experienced failure for multiple minutes, starting at:
hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 22:49:00,465 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
-
Ending on:
hadoop-cmf-hdfs-NAMENODE-sf1r005s017.flexd-sf0.local.log.out.10:2015-02-23 23:32:56,504 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3. For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
-
Capacity loss, under replicated climbs to 171k blocks at https://app.datadoghq.com/dash/dash/41141?from_ts=1424760391510&to_ts=1424760691510&tile_size=l
02/23 22:48 PST
-
There was another operation putting both heavy ingress load, and increased storage requirements on the cluster https://app.datadoghq.com/dash/dash/41141?from_ts=1424738017225&to_ts=1424766524058&tile_size=l
Not a problem but probably caused the 'discovery' of the failing disk because of write attempts on a non-scrubbed spindle. http://en.wikipedia.org/wiki/Data_scrubbing
-
Timeout failures on datanodes:
bens@hadoop032:/var/log/hadoop-hdfs$ cat hadoop-cmf-hdfs-DATANODE-hadoop032.flexd-sf0.local.log.out* | grep ERROR | sed 's/[0-9]:[0-9]*,[0-9]* .*//' | sort | uniq -c 1 2015-02-21 19:0 1 2015-02-21 19:3 1 2015-02-21 20:0 1 2015-02-21 20:3 1 2015-02-23 11:2 3 2015-02-23 17:2 5 2015-02-23 17:3 8 2015-02-23 18:3 2 2015-02-23 18:4 6 2015-02-23 19:3 3 2015-02-23 20:1 15 2015-02-23 20:2 27 2015-02-23 20:3 90 2015-02-23 20:4 19 2015-02-23 20:5 27 2015-02-23 21:0 52 2015-02-23 21:1 54 2015-02-23 21:2 49 2015-02-23 21:3 67 2015-02-23 21:4 43 2015-02-23 21:5 129 2015-02-23 22:0 27 2015-02-23 22:1 75 2015-02-23 22:2 182 2015-02-23 22:3 63 2015-02-23 22:4 25 2015-02-23 22:5 4 2015-02-23 23:0 2 2015-02-23 23:1 4 2015-02-23 23:2 2 2015-02-23 23:3 1 2015-02-24 00:0 1 2015-02-24 13:5
-
Not all nodes:
root@hadoop049:/var/log/hadoop-hdfs# cat *DATANODE* | grep ERROR | sed 's/[0-9]:[0-9]*,[0-9]* .*//' | sort | uniq -c 1 2015-02-23 15:3 1 2015-02-23 16:0 1 2015-02-23 16:3 2 2015-02-23 17:2 9 2015-02-23 18:3 2 2015-02-23 18:4 1 2015-02-23 19:3 8 2015-02-23 20:1 13 2015-02-23 20:2 7 2015-02-23 20:3 6 2015-02-23 20:4 1 2015-02-23 20:5 1 2015-02-23 21:0 2 2015-02-23 21:1 2 2015-02-23 22:5 7 2015-02-23 23:1 1 2015-02-24 11:1 39 2015-02-24 13:5
-
Cluster wide:
Collection methodology: $ for i in `cat hadoop-ips `; do ssh $i 'cat /var/log/hadoop-hdfs/*DATANODE* | grep ERROR' ;done >/tmp/hadoop-ERROR $ cat /tmp/hadoop-ERROR | sed -e 's/[0-9]:[0-9]*,[0-9]* ERROR org.apache.hadoop.hdfs.server.datanode.DataNode://' -e 's/\(.*\) \(hadoop[0-9]*\).*/\1/' | grep -v DataN | sort | uniq -c
1 2015-02-23 13:0 6 2015-02-23 13:3 3 2015-02-23 13:4 1 2015-02-23 13:5 2 2015-02-23 14:0 1 2015-02-23 14:2 6 2015-02-23 14:3 9 2015-02-23 14:4 3 2015-02-23 14:5 17 2015-02-23 15:0 38 2015-02-23 15:1 3 2015-02-23 15:3 2 2015-02-23 16:0 3 2015-02-23 16:3 10 2015-02-23 17:0 <--- first ingress spike 111 2015-02-23 17:2 295 2015-02-23 17:3 1 2015-02-23 17:5 2 2015-02-23 18:0 2 2015-02-23 18:1 1342 2015-02-23 18:3 <--- ingress spike drops off, assuming errors on job side (needs to be further investigated) 449 2015-02-23 18:4 1 2015-02-23 19:0 1 2015-02-23 19:1 1 2015-02-23 19:2 519 2015-02-23 19:3 124 2015-02-23 19:4 2 2015-02-23 20:0 436 2015-02-23 20:1 <--- ingress plateau starts to waffle, assuming cluster data issues causing ingress performance problems 1150 2015-02-23 20:2 769 2015-02-23 20:3 857 2015-02-23 20:4 1180 2015-02-23 20:5 724 2015-02-23 21:0 1418 2015-02-23 21:1 1790 2015-02-23 21:2 2123 2015-02-23 21:3 1980 2015-02-23 21:4 1732 2015-02-23 21:5 1083 2015-02-23 22:0 1942 2015-02-23 22:1 1075 2015-02-23 22:2 1809 2015-02-23 22:3 1148 2015-02-23 22:4 <--- disk failure on one node 845 2015-02-23 22:5 552 2015-02-23 23:0 589 2015-02-23 23:1 222 2015-02-23 23:2 269 2015-02-23 23:3 161 2015-02-24 00:0 109 2015-02-24 00:3 109 2015-02-24 01:0 2 2015-02-24 01:3 4 2015-02-24 02:0 4 2015-02-24 02:3 1 2015-02-24 03:3 6 2015-02-24 04:0
-
1 Failed disk in hadoop did cause an under-replicated block spike that was redistributed accordingly
-
The disk failure, in combination with the load caused datanode errors that pushed HLog replication so far behind that some but not all of the nodes exited
-
The increase in datanode failures was a result of increased cluster usage
-
Determining the cause of the work is currently difficult/impossible
-
More audit logging is needed of operations
-
Discussion about quotes (the cluster was almost full, <10% free)
-
Monitoring for network saturation should be improved
-
Timeouts for HBase should be investigated
-
Timeouts for datanode operations should be investigated
-
Alerts should be set for regionserver failure, datanode errors, and under replicated blocks