Practical Byzantine Fault Tolerance

Suppose you have N replicas, f of which might crash (non-Byzantine failure)
What quorum size Q do you need to guarantee liveness and safety?

Liveness: (or pseudo-liveness, i.e., avoiding stuck states)
There must be a non-failed quorum (quorum availability)
Hence: Q <= N - f
Safety: Any two quorums must intersect at one or more nodes
Otherwise, two quorums could independently accept operations, diverge
This property is often known as the quorum intersection property
Hence: 2Q - N > 0
So: N < 2Q <= 2(N - f)
Note highest possible f: N < 2N-2f; f < N/2
And if N = 2f + 1, smallest Q is 2Q > 2f + 1; Q = f + 1

Now say we throw in Byzantine failures. One view...
Say you have N nodes, f of which might experience Byzantine failure.
First, how can Byzantine failures be worse than non-Byzantine?
Byzantine nodes can vote for both a statement and its contradiction
Make different statements to different nodes
Consequences
Risks driving non-failed nodes into divergent states
Risks driving non-failed nodes into "stuck states"
E.g., cause split vote on seemingly irrefutable statement
Paxos example: You think majority aborted some ballot b v
You vote to commit b' v' (where b' > b, v' != v)
Can't convince other nodes it is safe to vote for b'

What quorum size Q do we need in Byzantine setting?

Liveness: Q <= N - f
As in non-Byzantine case, failed nodes might not reply
Safety: Quorum intersection must contain one non-faulty node
Idea: out of f+1 nodes, at most one can be faulty
Hence: 2Q - N > f (since f could be malicious)
So: N + f < 2Q <= 2(N - f)
Highest f: N+f < 2N-2f; 3f < N; f < N/3
And if N = 3f + 1, the smallest Q is:
N + f < 2Q; 3f + 1 + f < 2Q; 2f + 1/2 < Q; Q_min = 2f + 1

So how does PBFT protocol work?
Number replica cohorts 1, 2, 3, ..., 3f+1
Number requests with consecutive sequence numbers (not viewstamps)
System goes through a series of views
In view v, replica number v mod (3f+1) is designated the primary
Primary is responsible for selecting the order of operations
Assigns an increasing sequence number to each operation
In normal-case operation, use two-round protocol for request r:
Round 1 (pre-prepare, prepare) goal:
Ensure at least f+1 honest replicas agree that
If request r executes in view v, will execute with sequence no. n
Round 2 (commit) goal:
Ensure at least f+1 honest replicas agree that
Request r has executed in view v with sequence no. n

Protocol for normal-case operation
Let c be client
r_i be replica i, or p primary, b_i backup i
R set of all replicas

c -> p:  m = {REQUEST, o, t, c}_Kc
p -> R:  {PRE-PREPARE, v, n, d}_Kp, m     (note d = H(m))

b_i -> R: {PREPARE, v, n, d, i}_K{r_i}
[Note all messages signed, so will omit signatures and use < > henceforth.]

replica r_i now waits for PRE-PREPARE + 2f matching PREPARE messages
puts these messages in its log
then we say prepared(m, v, n, i) is TRUE

Note: If prepared(m, v, n, i) is TRUE for honest replica r_i
then prepared(m', v, n, j) where m' != m FALSE for any honest r_j
So no other operation can execute with view v sequence number n

Are we done? Just reply to client? No
Just because some other m' won't execute at (v,n) doesn't mean m will
Suppose r_i is compromised right after prepared(m, v, n, i)
Suppose no other replica received r_i's prepare message
Suppose f replicas are slow and never even received the PRE-PREPARE
No other honest replica will know the request prepared!
Particularly if p fails, request might not get executed!

So we say operation doesn't execute until
prepared(m, v, n, i) is TRUE for f+1 non-faulty replicas r_i
We say committed(m, v, n) is TRUE when this property holds

So how does a replica know committed(m, v, n) holds?
Add one more message:

r_i -> R: <COMMIT, v, n, d, i> (sent only after prepared(m,v,n,i))

replica r_i waits for 2f+1 identical COMMIT messages (including its own)
committed-local(m, v, n, i) is TRUE when:
prepared(m, v, n, i) is TRUE, and
r_i has 2f+1 matching commits in its log

Note: If committed-local(m, v, n, i) is TRUE for any non-faulty r_i
Then means committed(m, v, n) is TRUE.
r_i knows when committed-local is TRUE
So committed-local is a replica's way of knowing that committed is TRUE

r_i replies to client when committed-local(m, v, n, i) is TRUE
Client waits for f+1 matching replies, then returns to client
Why f+1 and not 2f+1?
Because of f+1, at least one replica r_i is non-faulty
So client knows committed-local(m, v, n, i)
Which in turn implies committed(m, v, n)
Note tentative reply optimization:
r_i can send tentative reply to client after prepared(m, v, n, i)
Client can accept result after 2f+1 matching tentative replies. Why?
f+1 of those replies must be from honest nodes
And at least 1 of those f+1 will be part of 2f+1 forming a new view
So that 1 node will make sure operation makes it to new view

Garbage collecting the message log
make periodic checkpoints
Broadcast <CHECKPOINT, n, d, i>, where d = digest of state
When 2f+1 signed CHECKPOINTs received
restrict sequence numbers are between h and H
h = sequence number of last stable checkpoint
H = h + k (e.g., k might be 2 * checkpoint interval of 100)
delete all messages below sequence number of stable checkpoint

View changes
When client doesn't get an answer, broadcasts message to all replicas
If a backup notices primary is slow/unresponsive:

broadcast <VIEW-CHANGE v+1, n, C, P, i>
C is 2f+1 signed checkpoint messages for last stable checkpoint
P = {P_m} where each P_m is signed PRE-PREPARE + 2f signed PREPARES
i.e., P is set of all PREPAREd messages since checkpoint
+ proof that the messages really are prepared

When primary of view v+1 sees 2f signed VIEW-CHANGE messages from others

New primary broadcasts <NEW-VIEW, v+1, V, O>
V is set of at lesat 2f+1 VIEW-CHANGE messages (including by new primary)
O is a set of pre-prepare messages, for operations that are:
- after last stable checkpoint
- appear in the set P of one of the VIEW-CHANGE messages
O also contains dummy messages to fill in sequence number gaps

Replicas may optain any missing state from each other
(e.g., stable checkpoint data, or missing operation, since
reissued pre-prepare messages only contain digest of request)

What happens if primary creates incorrect O in NEW-VIEW message?
E.g., might send null requests for operations that prepared
Other replicas can compute O from V, and can reject NEW-VIEW message
What happens if primary sends different V's to different backups?
Still okay, because any committed operation will be in 2f+1 VIEW-CHANGE msgs
of which f+1 must be honest, so at least one member of V will have operation
So new primary cannot cause committed operations to be dropped
Only operations for which client has not yet seen the answer

Discussion
what problem does BFS solve?

is IS going to run BFS to deal with byzantine failures?
what failures are we talking about?
compromised servers
what about compromised clients?
authentication and authorization
how can we extend the system to allow for more than (n-1)/3
failures over its lifetime?
detect failed replicas using proactive recovery
- recover the system periodically, no matter what
- makes bad nodes good again
tricky stuff
- an attacker might steal compromised replica's keys
with how many replicas will BFS work reasonably well?

PBFT详细分析n大于3f+1

PBFT详细分析n大于3f+1

Practical Byzantine Fault Tolerance

推荐阅读更多精彩内容