Skip to content

Instantly share code, notes, and snippets.

@mrsimpson
Last active April 4, 2019 22:15
Show Gist options
  • Save mrsimpson/432a65113c1d7394b2d0b421213f38c8 to your computer and use it in GitHub Desktop.
Save mrsimpson/432a65113c1d7394b2d0b421213f38c8 to your computer and use it in GitHub Desktop.
Analysis: Writing to MongoDB with Rocket.Chat not possible or heavily delayed

This gist is to document what we've seen, understood and which actions have been taken

Phenomena

Obvious issues

  • Users cannot log in - after klicking the login-button, there was neither a response nor a timeout
  • Messages sent remain grayed out for a long time

At the same time

  • The login page could be opened
  • A websocket connection from the client has been established
  • Once logged in (once a session was established), users could browse also channels that they had not cached earlier

Analysis

What we investigated in order to come to conclusions

Infrastructure

  • The system load at the application server was close to 0
  • Load of the MongoDB was ~50%, but the primary node had had bursts earlier the day
  • Connection from chat to database was working properly

Application

  • The DB accessed from the chat client in a mongo shell was responding quickly (read access only)
  • When sending a message from a slow client, the message was handed over to webhooks quite quickly. Also, it was written into the db with a looooong delay
  • The application itself has not been restarted be the process manager (forever was not complainign about a dead node due to e. g. an uncaught exception)
  • API access was always possible, thus also the health check accessing /api/v1/info was fine all the time
  • Interaction via the API does not create a session, logging in via the browser does
  • Restarting the node process via the forever service (service rocketchat restart) resolved the state: After being attached to the load balancer, the node was immediately catching client requests and responded to them properly

Conclusions and guesses

  • The chat servers were not busy since there hardly were any sessions connected to them
  • It seems as if the Database was not able to provide a session to the client with new sessions
  • Since all read interaction was possible quickly, it seems as if the primary mongo node was the bottleneck (writes only via primary, reads also via secondaries)
  • Since restarting the process with the same configuration resolved the issue, it seems as if the primary was ignored by the node process running earlier => An unresponsive primary may cause this behaviour.
  • Writing to the replica set is done using OPLOG. May there be any configurations which influence how quickly the OPLOG is operated upon?

What's still really strange

If there was a mechanism in Meteor or node to ignore a member of the replica set, why did some documents finally still get written to the database (even with the very long delay)?

Actions

  • Implement a liveness probe on browser level in order to automatically detect these situations, restart the process if it fails. (Code already written, needs to be added to the ops-layers)
  • Scale the DB to avoid 100% load bursts
  • Improve monitoring and alert management
  • Reproduce the situation

Reproduction

@thomasroehl could reproduce a state which is very close to our productive problems:

The setup consisted of a Rocketchat and a external MongoDB (single instance, no oplog) running on localhost. I did the following testflow:

  1. Login and Logout from RC, writing some messages and loading some new rooms -> everything worked as expected
  2. Set a write-lock on the MongoDB for the RC database: db.fsyncLock()
  3. Try to Login with a User -> After clicking the login button, it got stuck in please wait -> no errors in RC
  4. Try to write messages and load some new rooms with an already logged in user -> messages stuck in grey color; new rooms loaded as expected
  5. Unset the write-lock on the MongoDB for the RC database: db.fsyncUnlock()
  6. Try to Login with a User -> worked as expected

Some insights:

  • During the whole testflow no errors occured in RC
  • After unlocking the MongoDB, all grey marked messages have been sent
  • After having the write-lock on the MongoDB for some longer time (aprox. 10+min), a Mongo connection Socket broke with the following error (we didn't see this error in our production running an oplog-based replicaset!):
I20190401-15:08:38.980(2)?     at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1091:12)
I20190401-15:08:38.980(2)?     at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1105:22)
I20190401-15:08:38.980(2)?     at SynchronousCursor.map (packages/mongo/mongo_driver.js:1115:10)
I20190401-15:08:38.980(2)?     at SynchronousCursor.fetch (packages/mongo/mongo_driver.js:1139:17)
I20190401-15:08:38.980(2)?     at Cursor.(anonymous function) [as fetch] (packages/mongo/mongo_driver.js:887:44)
I20190401-15:08:38.980(2)?     at Object.<anonymous> (packages/matb33_collection-hooks.js:445:81)
I20190401-15:08:38.980(2)?     at Object.collection.(anonymous function) [as update] (packages/matb33_collection-hooks.js:146:21)
I20190401-15:08:38.980(2)?     at ns.Collection.update (packages/mongo/collection.js:585:31)
I20190401-15:08:38.980(2)?     at ns.Collection.Mongo.Collection.(anonymous function) [as update] (packages/dispatch_run-as-user.js:325:19)
I20190401-15:08:38.980(2)?     at Object.ping (packages/konecty_multiple-instances-status.js:136:25)
I20190401-15:08:38.980(2)?     at packages/konecty_multiple-instances-status.js:124:19
I20190401-15:08:38.981(2)?     at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1304:12)
I20190401-15:08:38.981(2)?     at packages/meteor.js:620:25
I20190401-15:08:38.981(2)?     at runWithEnvironment (packages/meteor.js:1356:24)
I20190401-15:08:38.981(2)?   name: 'MongoNetworkError',
I20190401-15:08:38.981(2)?   errorLabels: [ 'TransientTransactionError' ],
I20190401-15:08:38.981(2)?   [Symbol(mongoErrorContextSymbol)]: {} }
I20190401-15:08:38.981(2)? Exception in setInterval callback: { MongoNetworkError: connection 5 to localhost:27017 timed out
I20190401-15:08:38.981(2)?     at Socket.<anonymous> (/Users/user/.meteor/packages/npm-mongo/.3.1.1.1mmptof.qcqo++os+web.browser+web.browser.legacy+web.cordova/npm/node_modules/mongodb-core/lib/connection/connection.js:259:7)

  • After that, RC was not able to reastablish the write connection to the Mongo again -> all reading operations such as loading new room content as still working, but writing operations as login or sending messages failed got stuck
  • This state was only fixable by restarting the RC node process.
  • After restarting the node process, all pending updates were pushed to the database, messages turned black. All tested with the following version:
+-----------------------------------------------+
|                 SERVER RUNNING                |
+-----------------------------------------------+
|                                               |
|  Rocket.Chat Version: 0.73.0-0.9.11           |
|       NodeJS Version: 8.11.4 - x64            |
|             Platform: darwin                  |
|         Process Port: 25984                   |
|             Site URL: http://localhost:3000/  |
|     ReplicaSet OpLog: Disabled                |
|          Commit Hash: 50b5157c1a              |
|        Commit Branch: develop                 |
|                                               |
+-----------------------------------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment