cfl0ws · April 24, 2023 14:25
diff --git a/sui-validator-recover.txt b/sui-validator-recover.txt
 # Sui Validator Recovery Options

 ## Background

 There's been discussion about the best way to recover a Sui validator in event of a primary server failure. The options discussed involve running a secondary server to reduce downtime due to resyncing. It is expected that the time required to resync from scratch would cause untolerable downtime.

 Discussion has focused on running a secondary server in parallel to the primary validator. An original assumption was that the database from the secondary server could be used to quickly resync the primary validator.

 ## Open Questions

 However, a number of questions emerged about this assumption. Through the discussion, three possible scenarios emerged and with them, a number of open questions.

 ## Situations

 ### Situation A - Validator can be stopped cleanly

 This situation assumes a validator can be gracefully shutdown, e.g. for planned maintenance.

 ### Situation B - Validator can't be stopped cleanly

 This situation assumes a validator cannot be gracefully shutdown, e.g. the validator goes offline unexpectedly.

 ## Scenarios

 ### Scenario 1 - Run secondary server as an RPC fullnode

 In scenario 1, the secondary server would run as an RPC fullnode in parallel with the primary validator. If the primary validator fails, the database from the RPC fullnode would be used to resync the primary validator by transferring it to the primary server or the validaor keys would be transferred to the RPC fullnode, which would be restarted in validator mode using ``validator.toml``. 

 However, through analysis it became clear that the RPC fullnode database was different from the validator database. Most notably, the RPC fullnode database does not include the ``consensus_db`` and ``authority_db`` databases that the validator requires.

 As shown by [denysk | Stardust Staking](https://discord.com/channels/916379725201563759/1093849389778350110/1096090811793428500) 

 ```
 fullnode:
 ..........
 352K    /opt/sui/db/suidb/live/store/epoch_611
 314G    /opt/sui/db/suidb/live/store
 29M    /opt/sui/db/suidb/live/epochs
 15G    /opt/sui/db/suidb/live/checkpoints
 435G    /opt/sui/db/suidb/live
 435G    /opt/sui/db/suidb
 435G    /opt/sui/db/



 validator node:
 54G    /opt/sui/db/consensus_db/756
 74G    /opt/sui/db/consensus_db/755
 128G    /opt/sui/db/consensus_db
 12G    /opt/sui/db/authorities_db/live/store/epoch_755
 8.9G    /opt/sui/db/authorities_db/live/store/epoch_756
 11G    /opt/sui/db/authorities_db/live/store/epoch_754
 321G    /opt/sui/db/authorities_db/live/store/perpetual
 351G    /opt/sui/db/authorities_db/live/store
 31M    /opt/sui/db/authorities_db/live/epochs
 15G    /opt/sui/db/authorities_db/live/checkpoints
 366G    /opt/sui/db/authorities_db/live
 366G    /opt/sui/db/authorities_db
 493G    /opt/sui/db/
 ```

 ### Scenario 2 - Run a second validator

 In scenario 2, the secondary server would run as a second validator. The second validator, of course, would use different keys that the primary validator. If the primary validator fails, the keys from the primary validator would be transferred to the secondary validator.

 However, a point was raised about this approach causing unnecessary and possibly disruptive p2p traffic. It's also unclear if the validator's consensus database is unique to each validator. If it is, for example, the secondary validator database couldn't be used by the primary validator.

 ### Appendix

 It appears that ``sui-node`` has a [built-in snapshot capability](https://docs.sui.io/build/snapshot). However, it's unclear if the snapshot capability is useful for validators or only fullnodes.

 "While validators can enable snapshots, they are typically most valuable for Full node operators.

 Snapshots of the Sui network enable Full node operators a way to bootstrap a Full node without having to execute all the transactions that occurred after genesis."

 ### Open Questions 

 - Which of these options is viable and preferred?

 - If neither of these options is viable, e.g. ``consensus_db`` is unique to each validator, what is the best risk mitigation plan avaialble for validators to minimize downtime?

 - Is built-in snapshot capability useful for validators or only fullnodes?

 ### Contributors

 Bethose | SDL
 chris / chainflow
 denysk | Stardust Staking
 Ivan Merín Rodríguez
	# Sui Validator Recovery Options

	## Background

	There's been discussion about the best way to recover a Sui validator in event of a primary server failure. The options discussed involve running a secondary server to reduce downtime due to resyncing. It is expected that the time required to resync from scratch would cause untolerable downtime.

	Discussion has focused on running a secondary server in parallel to the primary validator. An original assumption was that the database from the secondary server could be used to quickly resync the primary validator.

	## Open Questions

	However, a number of questions emerged about this assumption. Through the discussion, three possible scenarios emerged and with them, a number of open questions.

	## Situations

	### Situation A - Validator can be stopped cleanly

	This situation assumes a validator can be gracefully shutdown, e.g. for planned maintenance.

	### Situation B - Validator can't be stopped cleanly

	This situation assumes a validator cannot be gracefully shutdown, e.g. the validator goes offline unexpectedly.

	## Scenarios

	### Scenario 1 - Run secondary server as an RPC fullnode

	In scenario 1, the secondary server would run as an RPC fullnode in parallel with the primary validator. If the primary validator fails, the database from the RPC fullnode would be used to resync the primary validator by transferring it to the primary server or the validaor keys would be transferred to the RPC fullnode, which would be restarted in validator mode using ``validator.toml``.

	However, through analysis it became clear that the RPC fullnode database was different from the validator database. Most notably, the RPC fullnode database does not include the ``consensus_db`` and ``authority_db`` databases that the validator requires.

	As shown by [denysk \| Stardust Staking](https://discord.com/channels/916379725201563759/1093849389778350110/1096090811793428500)

	```
	fullnode:
	..........
	352K /opt/sui/db/suidb/live/store/epoch_611
	314G /opt/sui/db/suidb/live/store
	29M /opt/sui/db/suidb/live/epochs
	15G /opt/sui/db/suidb/live/checkpoints
	435G /opt/sui/db/suidb/live
	435G /opt/sui/db/suidb
	435G /opt/sui/db/



	validator node:
	54G /opt/sui/db/consensus_db/756
	74G /opt/sui/db/consensus_db/755
	128G /opt/sui/db/consensus_db
	12G /opt/sui/db/authorities_db/live/store/epoch_755
	8.9G /opt/sui/db/authorities_db/live/store/epoch_756
	11G /opt/sui/db/authorities_db/live/store/epoch_754
	321G /opt/sui/db/authorities_db/live/store/perpetual
	351G /opt/sui/db/authorities_db/live/store
	31M /opt/sui/db/authorities_db/live/epochs
	15G /opt/sui/db/authorities_db/live/checkpoints
	366G /opt/sui/db/authorities_db/live
	366G /opt/sui/db/authorities_db
	493G /opt/sui/db/
	```

	### Scenario 2 - Run a second validator

	In scenario 2, the secondary server would run as a second validator. The second validator, of course, would use different keys that the primary validator. If the primary validator fails, the keys from the primary validator would be transferred to the secondary validator.

	However, a point was raised about this approach causing unnecessary and possibly disruptive p2p traffic. It's also unclear if the validator's consensus database is unique to each validator. If it is, for example, the secondary validator database couldn't be used by the primary validator.

	### Appendix

	It appears that ``sui-node`` has a [built-in snapshot capability](https://docs.sui.io/build/snapshot). However, it's unclear if the snapshot capability is useful for validators or only fullnodes.

	"While validators can enable snapshots, they are typically most valuable for Full node operators.

	Snapshots of the Sui network enable Full node operators a way to bootstrap a Full node without having to execute all the transactions that occurred after genesis."

	### Open Questions

	- Which of these options is viable and preferred?

	- If neither of these options is viable, e.g. ``consensus_db`` is unique to each validator, what is the best risk mitigation plan avaialble for validators to minimize downtime?

	- Is built-in snapshot capability useful for validators or only fullnodes?

	### Contributors

	Bethose \| SDL
	chris / chainflow
	denysk \| Stardust Staking
	Ivan Merín Rodríguez