Practical Byzantine Fault Tolerance
Suppose you have N replicas, f of which might crash (non-Byzantine failure)
What quorum size Q do you need to guarantee liveness and safety?
- Liveness: (or pseudo-liveness, i.e., avoiding stuck states)
There must be a non-failed quorum (quorum availability)
Hence: Q <= N - f - Safety: Any two quorums must intersect at one or more nodes
Otherwise, two quorums could independently accept operations, diverge
This property is often known as the quorum intersection property
Hence: 2Q - N > 0
So: N < 2Q <= 2(N - f)
Note highest possible f: N < 2N-2f; f < N/2
And if N = 2f + 1, smallest Q is 2Q > 2f + 1; Q = f + 1
Now say we throw in Byzantine failures. One view...
Say you have N nodes, f of which might experience Byzantine failure.
First, how can Byzantine failures be worse than non-Byzantine?
Byzantine nodes can vote for both a statement and its contradiction
Make different statements to different nodes
Consequences
Risks driving non-failed nodes into divergent states
Risks driving non-failed nodes into "stuck states"
E.g., cause split vote on seemingly irrefutable statement
Paxos example: You think majority aborted some ballot b v
You vote to commit b' v' (where b' > b, v' != v)
Can't convince other nodes it is safe to vote for b'
What quorum size Q do we need in Byzantine setting?
- Liveness: Q <= N - f
As in non-Byzantine case, failed nodes might not reply - Safety: Quorum intersection must contain one non-faulty node
Idea: out of f+1 nodes, at most one can be faulty
Hence: 2Q - N > f (since f could be malicious)
So: N + f < 2Q <= 2(N - f)
Highest f: N+f < 2N-2f; 3f < N; f < N/3
And if N = 3f + 1, the smallest Q is:
N + f < 2Q; 3f + 1 + f < 2Q; 2f + 1/2 < Q; Q_min = 2f + 1
So how does PBFT protocol work?
Number replica cohorts 1, 2, 3, ..., 3f+1
Number requests with consecutive sequence numbers (not viewstamps)
System goes through a series of views
In view v, replica number v mod (3f+1) is designated the primary
Primary is responsible for selecting the order of operations
Assigns an increasing sequence number to each operation
In normal-case operation, use two-round protocol for request r:
Round 1 (pre-prepare, prepare) goal:
Ensure at least f+1 honest replicas agree that
If request r executes in view v, will execute with sequence no. n
Round 2 (commit) goal:
Ensure at least f+1 honest replicas agree that
Request r has executed in view v with sequence no. n
Protocol for normal-case operation
Let c be client
r_i be replica i, or p primary, b_i backup i
R set of all replicas
c -> p: m = {REQUEST, o, t, c}_Kc
p -> R: {PRE-PREPARE, v, n, d}_Kp, m (note d = H(m))
b_i -> R: {PREPARE, v, n, d, i}_K{r_i}
[Note all messages signed, so will omit signatures and use < > henceforth.]
replica r_i now waits for PRE-PREPARE + 2f matching PREPARE messages
puts these messages in its log
then we say prepared(m, v, n, i) is TRUE
Note: If prepared(m, v, n, i) is TRUE for honest replica r_i
then prepared(m', v, n, j) where m' != m FALSE for any honest r_j
So no other operation can execute with view v sequence number n
Are we done? Just reply to client? No
Just because some other m' won't execute at (v,n) doesn't mean m will
Suppose r_i is compromised right after prepared(m, v, n, i)
Suppose no other replica received r_i's prepare message
Suppose f replicas are slow and never even received the PRE-PREPARE
No other honest replica will know the request prepared!
Particularly if p fails, request might not get executed!
So we say operation doesn't execute until
prepared(m, v, n, i) is TRUE for f+1 non-faulty replicas r_i
We say committed(m, v, n) is TRUE when this property holds
So how does a replica know committed(m, v, n) holds?
Add one more message:
r_i -> R: <COMMIT, v, n, d, i> (sent only after prepared(m,v,n,i))
replica r_i waits for 2f+1 identical COMMIT messages (including its own)
committed-local(m, v, n, i) is TRUE when:
prepared(m, v, n, i) is TRUE, and
r_i has 2f+1 matching commits in its log
Note: If committed-local(m, v, n, i) is TRUE for any non-faulty r_i
Then means committed(m, v, n) is TRUE.
r_i knows when committed-local is TRUE
So committed-local is a replica's way of knowing that committed is TRUE
r_i replies to client when committed-local(m, v, n, i) is TRUE
Client waits for f+1 matching replies, then returns to client
Why f+1 and not 2f+1?
Because of f+1, at least one replica r_i is non-faulty
So client knows committed-local(m, v, n, i)
Which in turn implies committed(m, v, n)
Note tentative reply optimization:
r_i can send tentative reply to client after prepared(m, v, n, i)
Client can accept result after 2f+1 matching tentative replies. Why?
f+1 of those replies must be from honest nodes
And at least 1 of those f+1 will be part of 2f+1 forming a new view
So that 1 node will make sure operation makes it to new view
Garbage collecting the message log
make periodic checkpoints
Broadcast <CHECKPOINT, n, d, i>, where d = digest of state
When 2f+1 signed CHECKPOINTs received
restrict sequence numbers are between h and H
h = sequence number of last stable checkpoint
H = h + k (e.g., k might be 2 * checkpoint interval of 100)
delete all messages below sequence number of stable checkpoint
View changes
When client doesn't get an answer, broadcasts message to all replicas
If a backup notices primary is slow/unresponsive:
- broadcast <VIEW-CHANGE v+1, n, C, P, i>
C is 2f+1 signed checkpoint messages for last stable checkpoint
P = {P_m} where each P_m is signed PRE-PREPARE + 2f signed PREPARES
i.e., P is set of all PREPAREd messages since checkpoint
+ proof that the messages really are prepared
When primary of view v+1 sees 2f signed VIEW-CHANGE messages from others
- New primary broadcasts <NEW-VIEW, v+1, V, O>
V is set of at lesat 2f+1 VIEW-CHANGE messages (including by new primary)
O is a set of pre-prepare messages, for operations that are:
- after last stable checkpoint
- appear in the set P of one of the VIEW-CHANGE messages
O also contains dummy messages to fill in sequence number gaps
Replicas may optain any missing state from each other
(e.g., stable checkpoint data, or missing operation, since
reissued pre-prepare messages only contain digest of request)
What happens if primary creates incorrect O in NEW-VIEW message?
E.g., might send null requests for operations that prepared
Other replicas can compute O from V, and can reject NEW-VIEW message
What happens if primary sends different V's to different backups?
Still okay, because any committed operation will be in 2f+1 VIEW-CHANGE msgs
of which f+1 must be honest, so at least one member of V will have operation
So new primary cannot cause committed operations to be dropped
Only operations for which client has not yet seen the answer
Discussion
what problem does BFS solve?
- is IS going to run BFS to deal with byzantine failures?
- what failures are we talking about?
compromised servers - what about compromised clients?
authentication and authorization
how can we extend the system to allow for more than (n-1)/3
failures over its lifetime? - detect failed replicas using proactive recovery
- recover the system periodically, no matter what
- makes bad nodes good again
- tricky stuff
- an attacker might steal compromised replica's keys
- with how many replicas will BFS work reasonably well?