This gist is to document what we've seen, understood and which actions have been taken
- Users cannot log in - after klicking the login-button, there was neither a response nor a timeout
- Messages sent remain grayed out for a long time
- The login page could be opened
- A websocket connection from the client has been established
- Once logged in (once a session was established), users could browse also channels that they had not cached earlier
What we investigated in order to come to conclusions
- The system load at the application server was close to
0
- Load of the MongoDB was ~50%, but the primary node had had bursts earlier the day
- Connection from chat to database was working properly
- The DB accessed from the chat client in a mongo shell was responding quickly (read access only)
- When sending a message from a slow client, the message was handed over to webhooks quite quickly. Also, it was written into the db with a looooong delay
- The application itself has not been restarted be the process manager (forever was not complainign about a dead node due to e. g. an uncaught exception)
- API access was always possible, thus also the health check accessing
/api/v1/info
was fine all the time - Interaction via the API does not create a session, logging in via the browser does
- Restarting the node process via the forever service (service rocketchat restart) resolved the state: After being attached to the load balancer, the node was immediately catching client requests and responded to them properly
- The chat servers were not busy since there hardly were any sessions connected to them
- It seems as if the Database was not able to provide a session to the client with new sessions
- Since all read interaction was possible quickly, it seems as if the primary mongo node was the bottleneck (writes only via primary, reads also via secondaries)
- Since restarting the process with the same configuration resolved the issue, it seems as if the primary was ignored by the node process running earlier => An unresponsive primary may cause this behaviour.
- Writing to the replica set is done using OPLOG. May there be any configurations which influence how quickly the OPLOG is operated upon?
If there was a mechanism in Meteor or node to ignore a member of the replica set, why did some documents finally still get written to the database (even with the very long delay)?
- Implement a liveness probe on browser level in order to automatically detect these situations, restart the process if it fails. (Code already written, needs to be added to the ops-layers)
- Scale the DB to avoid 100% load bursts
- Improve monitoring and alert management
- Reproduce the situation
@thomasroehl could reproduce a state which is very close to our productive problems:
The setup consisted of a Rocketchat and a external MongoDB (single instance, no oplog) running on localhost. I did the following testflow:
- Login and Logout from RC, writing some messages and loading some new rooms -> everything worked as expected
- Set a write-lock on the MongoDB for the RC database:
db.fsyncLock()
- Try to Login with a User -> After clicking the login button, it got stuck in
please wait
-> no errors in RC - Try to write messages and load some new rooms with an already logged in user -> messages stuck in grey color; new rooms loaded as expected
- Unset the write-lock on the MongoDB for the RC database:
db.fsyncUnlock()
- Try to Login with a User -> worked as expected
Some insights:
- During the whole testflow no errors occured in RC
- After unlocking the MongoDB, all grey marked messages have been sent
- After having the write-lock on the MongoDB for some longer time (aprox. 10+min), a Mongo connection Socket broke with the following error (we didn't see this error in our production running an oplog-based replicaset!):
I20190401-15:08:38.980(2)? at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1091:12)
I20190401-15:08:38.980(2)? at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1105:22)
I20190401-15:08:38.980(2)? at SynchronousCursor.map (packages/mongo/mongo_driver.js:1115:10)
I20190401-15:08:38.980(2)? at SynchronousCursor.fetch (packages/mongo/mongo_driver.js:1139:17)
I20190401-15:08:38.980(2)? at Cursor.(anonymous function) [as fetch] (packages/mongo/mongo_driver.js:887:44)
I20190401-15:08:38.980(2)? at Object.<anonymous> (packages/matb33_collection-hooks.js:445:81)
I20190401-15:08:38.980(2)? at Object.collection.(anonymous function) [as update] (packages/matb33_collection-hooks.js:146:21)
I20190401-15:08:38.980(2)? at ns.Collection.update (packages/mongo/collection.js:585:31)
I20190401-15:08:38.980(2)? at ns.Collection.Mongo.Collection.(anonymous function) [as update] (packages/dispatch_run-as-user.js:325:19)
I20190401-15:08:38.980(2)? at Object.ping (packages/konecty_multiple-instances-status.js:136:25)
I20190401-15:08:38.980(2)? at packages/konecty_multiple-instances-status.js:124:19
I20190401-15:08:38.981(2)? at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1304:12)
I20190401-15:08:38.981(2)? at packages/meteor.js:620:25
I20190401-15:08:38.981(2)? at runWithEnvironment (packages/meteor.js:1356:24)
I20190401-15:08:38.981(2)? name: 'MongoNetworkError',
I20190401-15:08:38.981(2)? errorLabels: [ 'TransientTransactionError' ],
I20190401-15:08:38.981(2)? [Symbol(mongoErrorContextSymbol)]: {} }
I20190401-15:08:38.981(2)? Exception in setInterval callback: { MongoNetworkError: connection 5 to localhost:27017 timed out
I20190401-15:08:38.981(2)? at Socket.<anonymous> (/Users/user/.meteor/packages/npm-mongo/.3.1.1.1mmptof.qcqo++os+web.browser+web.browser.legacy+web.cordova/npm/node_modules/mongodb-core/lib/connection/connection.js:259:7)
- After that, RC was not able to reastablish the write connection to the Mongo again -> all reading operations such as loading new room content as still working, but writing operations as login or sending messages failed got stuck
- This state was only fixable by restarting the RC node process.
- After restarting the node process, all pending updates were pushed to the database, messages turned black. All tested with the following version:
+-----------------------------------------------+
| SERVER RUNNING |
+-----------------------------------------------+
| |
| Rocket.Chat Version: 0.73.0-0.9.11 |
| NodeJS Version: 8.11.4 - x64 |
| Platform: darwin |
| Process Port: 25984 |
| Site URL: http://localhost:3000/ |
| ReplicaSet OpLog: Disabled |
| Commit Hash: 50b5157c1a |
| Commit Branch: develop |
| |
+-----------------------------------------------+