TL;DR Backup cluster-wide configuration virtual filesystem in a safe manner, plan for disaster recovery for the case of corrupt database. A situation more common than anticipated.
ORIGINAL POST Backup Cluster configuration - /etc/pve
A no-nonsense way to safely backup your /etc/pve
files (pmxcfs)^ is actually very simple:
sqlite3 /var/lib/pve-cluster/config.db .dump > ~/config.dump.$(date --utc +%Z%Y%m%d%H%M%S).sql
This is safe to execute on a running node and is only necessary on any single node of the cluster, the results (at specific point in time) will be exactly the same.
Obviously, it makes more sense to save this somewhere else than the home directory ~
, especially if you have dependable shared storage off the cluster. Ideally, you want a systemd timer, cron job or a hook to your other favourite backup method launching this.
You will ideally never need to recover from this backup. In case of single node’s corrupt config database, you are best off to copy over /var/lib/pve-cluster/config.db
(while inactive) from a healthy node and let the implantee catch up with the cluster.
However, failing everything else, you will want to stop cluster service, put aside the (possibly) corrupt database and get the last good state back:
systemctl stop pve-cluster killall pmxcfs mv /var/lib/pve-cluster/config.db{,.corrupt} sqlite3 /var/lib/pve-cluster/config.db < ~/config.dump.<timestamp>.sql systemctl start pve-cluster
NOTE Any leftover WAL will be ignored.
If you already have a corrupt .db
file at hand (and nothing better), you may try your luck with .recover
.^
> TIP
> There’s a dedicated post on the topic of extracting only selected files.
The .dump
command^ reads the database as if with a SELECT
statement within a single transaction. It will block concurrent writes, but once it finishes, you have a “snapshot”. The result is a perfectly valid SQL set of commands to recreate your database.
There’s an alternative .save
command (equivalent to .backup
), it would produce a valid copy of the actual .db
file, and while it is non-blocking copying the base page by page, if they get dirty in the process, the process needs to start over. You could receive Error: database is locked
failure on the attempt. If you insist on this method, you may need to append .timeout <milliseconds>
to get more luck with it.
Another option yet would be to use VACUUM
command with an INTO
clause,^ but it does not fsync the result on its own!