New and Simple Consensus Algorithms for ThunderCore’s MainNet

Speaker: Elaine Shi

Synchronous, with a change of partition tolerance
State machine replication (e.g., blockchain consensus)
- Requires safety (nothing bad ever happens), and liveness (eventually something good happens)
Thunderella
- If you don't know Thunderella, go look it up (check out the BPASE '18 talk)
Wait, there's a flaw in Thunderella?!
- A confirmed transaction can be undone, even in a somewhat benign setting
- wtf? they wrote 74 pages of proofs, so how can a provably correct protocol be flawed?
  - The proof is fine... the problem is the synchronous network model
Thunderella gives two properties:
- With honest majority, you get consistency
- With 3/4 honest and an honest leader/accelerator, you get liveness & good performance
- Can also get liveness with an honest majority by...
  - Falling back to a slow chain (blockchain-based) whenever the fast chain halts
How does fallback happen?
- Committee on fast path posts periodic heartbeats to the slow chain
  - Posts notarized (3/4 signed) hashes of fast path log
  - Both serves as heartbeat & checkpoint
- If long number of blocks without heartbeat (k blocks), then all agree to fallback to slow chain
  - However, not everyone agrees on the number of transactions before the fast chain failed!
  - Everyone post notarized but UNCHECKPOINTED transactions onto the slowchain to achieve consensus on where the fastchain halted
    - Basically, everyone gets together and shares their notes on what was confirmed on the fast chain
  - Now you can achieve both consistency and liveness—you get optimistic liveness!
- Thunderella two sentences: single round of voting when things are good, slow blockchain for fallback when things are bad
So what's the flaw??
- Leader makes a proposal for a block
- Everyone signs, leader collects the signatures and produces a notarized version of the block
- Leader starts broadcasting the notarization to one committee member
  - But then leader crashes and goes offline before broadcasting to anyone else!
- Committee member considers the block finalized and for example, pays out an end user
- But that committee member then ALSO goes offline
- So the only people who knew/saw the notarized block are now offline
- The rest of the committee members go back to the slow chain, but because no one has those transactions...
  - They continue moving forward as though the transactions never happened
  - Thus, the transactions are REVERTED despite having been earlier confirmed.
- This is bad! This should not happen!
Why is such a flaw possible in a provably correct protocol?
- Classical synchronous models permit flawed protocols
What is the synchronous model?
- We assume that when honest nodes send messages, they arrive in a bounded amount of time
  - (And the protocol is aware of this bound)
- Message delay = 1 round
- In the synchronous model, having a temporary outage means you're faulty
  - In a synchronous model, a "faulty" node gets no guarantees about correctness!
- In the partially synchronous model, the message delay can be arbitrarily long
  - Having a temporary outage is isomorphic to having a long message delay
- We say that partially synchronous models are arbitrarily partition-tolerant
- But ANY partially synchronous model cannot tolerate more than 1/3 corruptions
- Can we achieve the best of both worlds?
  - We want to tolerate 1/2 corruptions (which we can do in the synchronous model, a la Nakamoto consensus)
  - We want to let temporary outages = long message delays
- The answer should be no, given classical consensus protocols...
What if we are more subtle in how we model how "synchronous" a network is?
- Imagine we have some nodes that are honest and online
- But other nodes that are honest, with unstable connections
- Then we have corrupt nodes who we don't care about
- We want to achieve consensus among the first two groups: is this possible?
  - Turns out, you cannot do this unless the honest and online group is LARGE
    - greater than or equal to 1/2 of all nodes
    - This makes some synchrony assumptions about the network, since you assume at least half of the nodes have fast connectivity
      - But a minority of nodes can be unstable
      - "Best-possible partition tolerance"
  - But it's commonly assumed that these honest and online nodes are persistent...
    - But Bitcoin has been around for 10 years—those nodes churn over long enough timespans
  - More realistically: every node is offline at some point!
- You just need at least 1/2 nodes at all times to be continuously online from round to round
How to fix it? — Making Thunderella Best-Possible Partition Tolerant
- For Fast Path:
  - Before, you confirmed a transaction whenever you saw a notarized block
    - But you don't know that you aren't the first person to see this and then go offline
  - So instead, you only confirm transactions in notarized blocks ONE BLOCK BACK
    - This ensures that all notaries must have seen the previous block
    - And thus should be able to write it to the slow chain if you go offline
New theoretical and practical results around "best-possible partition tolerance"
- Can be considered a practical network model
  - Guarantees liveness only during "periods of synchrony"
- Some philosophical observations:
  - This class of protocols with best-possible partition tolerance is a subset of "Synchronous honest majority protocols"
  - None of the classical synchronous honest majority protocols are best-possible partition tolerant
- Classical corrupt majority protocols cannot be best-partition tolerant (e.g., Dolev Strong)
  - Classical synchronous model is a mismatch for what we need in real-world deployments
Thunder is launching in Q1 2019
- Whoo

Haseeb-Qureshi/thunder_mainnet.md

New and Simple Consensus Algorithms for ThunderCore’s MainNet

Speaker: Elaine Shi