We were discussing the design of an HA environment for Puppet, and what the potential impacts were of a "split-brain" situation. For the model that I was discussing...
...the "worst case" scenario is probably that the two Postgres databases become out-of-sync. This would require a network interruption that would cause:
- the two front-end Puppet masters to lose contact with each other, causing both assume the master role, with the result that pgpool would be running on both nodes and each
puppetdb
process would be talking to it's localpgpool
instance. pgpool
on one of the nodes to lose contact with the current master database server, causing it to erroneously promote the slave database.- some puppet clients in the environment to see one master and others to see the other master
At this point, the replication link between the two databases would be broken, but the data itself would not be corrupt. Assuming that each pgpool
instance was now pointing at a different backend database, the content of the two databases could begin to diverge. This seems an unlikely failure mode, and one that is easy to detect (a simple SQL statement can be used to determine if a database is currently acting as a master or a slave, so it would be easy create an alert for this situation). Given the nature of the data (primarily collected facts and exported resources), the net impact of this would be very low (possibly some nodes would see stale exported resources), and our recovery would simply be to declare the data on one of the backends invalid and place the system back into slave mode. Multiple communication channels between the front-end nodes would almost completely eliminate this failure.