MongoDB is document database that supports range and field queries (https://github.com/foxish/docker-mongodb/tree/master/kubernetes)
A single server can run either standalone or as part of a replica set. A "replica set" is set of mongod instances with 1 primary. Primary: receives writes, services reads. Can step down and become secondary. Secondary: replicate the primary's oplog. If the primary goes down, secondaries will hold an election. Arbiter: used to achieve majority vote with even members, do not hold data, don't need dedicated nodes. Never becomes primary.
Replication is asynchronous. Failover: If a primary doesn't communicate with the others for > 10s, secondaries conduct election.
Write concern: { w: <value>, j: <boolean>, wtimeout: <number> }
; How writes are acknowledged in the system.
- wtimeout is how long to wait for ack.
- w: 0; no ack of write)
- w: 1; ack when write has propagated to primary. (default)
- w: majority; there needs to be an ack from a majority of voting nodes
- w: n; ack from n voting members.
- w:
<tag set>
; ack from members having a particular tag.
Priority values assigned to each node, and are floating point numbers between 0 and 1000. Priority 0 members cannot vote. Higher-priority members are more likely to call elections, and are more likely to win. Read concern: local/majority. Local means read from primary, majority might read from secondaries.
- Arbiter: Only votes, holds no data. Don't deploy more than 1 per replica set.
- Hidden: just like priority 0 but cannot service reads, only vote. Does maintain a copy of master data.
- Delayed: Typically hidden, records master copies with a delay to avoid eg: human error.
A simple mongodb replicaset, with three members. We start with an image which turns on replicasets for the instance by supplying the right commandline flags. This becomes the image that we supply to our petset with 3 replicas.
- After the pods are created, we pick any one pod and execute
rs.initiate()
after connecting to its mongo instance. That node turns into primary. Thenrs.add()
the other two pods using their cluster domain names. - For example:
rs.add("mongodb-1.mongodb.default.svc.cluster.local")
rs.add("mongodb-2.mongodb.default.svc.cluster.local")
- Automatic Failover works with petsets out of the box.
- Adding new nodes involves finding the PRIMARY and running the corresponding
rs.add(...)
commands on it. - Reading from slaves require execution of
rs.slaveOk()
on connections to slaves.
Number of members that can become unavailable and the cluster can still elect primary. 50 members, 7 voting members => 46 can go down (but only 3 of the voting members). WAN deployment: 1 member per DC in 3 DCs, can tolerate a single DC going down.
OpLog size: depends on storage engine, 3 types: in-memory, wiredTiger, mmapv1.
- mmapv1 was default and preferred, due to maturity.
- wiredTiger known to have some issues in the past but is default since 3.2.
New members or secondaries that fall behind too far must resync everything. Starting mongo with an empty datadir will force an initial sync. Starting it with a copy of a recent datadir from another member in the set will also hasten the initial sync. This could be done using snapshots.
- Change hostnames of secondary members, remove the old hostname and add the new hostname to the replicaset.
- Stop all members, reconfigure offline using same datadir.
Rollbacks - network partition, secondary can't keep up with primary, primary goes down, stale secondary becomes master, master rejoins as primary -- master needs to rollback writes it accepted. Such a rollback will not happen if the write propagates to a healthy reachable secondary, because it will become master.
Rebooting 2 secondaries simultaneously in a 3 member replica set forces the primary to step down, meaning it closes all sockets (Connection reset by peer) till one of the secondaries becomes available.