Partially Ordered Chains
Introduction
There is a tension between each node’s subjective view of the network (who is connected to me at the moment?) and the “objective” eventual agreement the network needs to reach about its own structure (and, later, data). This is an attempt to get from the former to the latter without a multi-round consensus algorithm, by instead voting on state transitions and accepting that during churn, more than one state could be valid at the same time. That means that even without splits and merges, a block can have more than one predecessor or successor, and the chain segment for any given address is not necessarily totally ordered.
Specifically, votes are signed edges in a directed graph, whose vertices, or blocks, are section states: prefixes and member lists of particular sections. An edge accumulates once a majority of its source’s members have voted for it, and it then makes its destination valid, too, given that the source was valid.
Another kind of edge points from every state to its potential successors. The valid blocks from which no such edge points to another valid block represent the current state of the network, and whenever a node observes a new state, it signs votes pointing from current states to the observed one.
Votes and blocks
The core of this algorithm is essentially a function that assigns to a set (unordered!) of Vote
s a set of valid and a subset of current Block
s that represent prefixes and member lists that were valid at some point in the past, and prefixes and member lists that are considered to be currently valid under the assumption that no Vote
s are missing. In particular, adding new Vote
s to the set will only cause more blocks to become valid.
/// A record of the state of a section at some point in time.
struct Block {
/// The section is responsible for all addresses that match this prefix.
prefix: Prefix,
/// If the same prefix and members occur twice, the version number
/// will distinguish the blocks.
version: u64,
/// Section members and their vote weights.
members: Map<Name, u64>,
}
/// A testimonial by the `signatory`: It votes as part of the block `from`
/// for the block `to`.
struct Vote {
/// The block that the `signatory` is confirming.
to: Block,
/// A block that was valid at the time of this vote.
from: Block,
/// The signature of the above.
signature: Signature,
/// The name of the signing node.
signatory: Name,
}
A vote
can be understood as a testimonial by a member of vote.from
that vote.to
has been observed. If a quorum of from
agree, these signed votes are a proof that to
is valid.
A quorum with respect to members: Map<Name, u64>
is a set q: Set<Name>
such that
q.len() * 2 > members.len() &&
members.iter().map(|(m, w)| if q.contains(m) { w } else { -w }).sum() > 0
i.e. q
has a simple majority both in terms of vote weights and node count.
A set of votes: Set<Vote>
has a quorum with respect to members
, if the set of signatories does.
A block b1
is admissible after block b0
if b1.version > b0.version
and
- either (the split case)
b0.prefix == b1.prefix.popped()
andb1.members
contains exactly those entries ofb0.members
that matchb1.prefix
,
- or (the merge case)
b0.prefix.popped() == b1.prefix
andb0.members
contains exactly those entries ofb1.members
that matchb0.prefix
,
- or (the add and remove cases)
b0.prefix == b1.prefix
andb0.members
andb1.members
differ by exactly one entry.
A block b0
witnesses block b1
with votes votes: Set<Vote>
if all of the following conditions are satisfied:
- for every
v
invotes
,v.from == b0 && v.to == b1
- either:
b1
is admissible afterb0
, orb1.prefix
andb0.prefix
are neighbours
- the
votes
have a quorum with respect tob0.members
, or, in the case thatb1
removes a node fromb0
, thevotes
have a quorum with respect tob1.members
From the perspective of a single node, we say that block b0
witnesses block b1
if there is such a set of votes received by that node.
By requiring a quorum of b1.members
when b1
is a remove block, it is slightly more likely that node removals accumulate during periods of rapid node loss. For example, if there are 10 nodes in b0.members
, 6 would ordinarily be required for a majority quorum, whereas b1.members
will contain 9 nodes, of which only 5 are required for a majority quorum.
Valid and current blocks
Given a set of initial known-valid blocks, e.g. the genesis block or a set of blocks from another trusted node, we use the received votes to validate blocks:
The set of valid blocks is the smallest superset of the initial blocks such that every block witnessed by a valid block is also valid.
A set of blocks C
buries a block b
if b.prefix
is covered by the prefixes in C
with a higher version, i.e.: for every name n
with b.prefix.matches(n)
, there is a c
in C
with c.version > b.version && c.prefix.matches(n)
.
A prefix p
is an ancestor of a prefix q
if p
is compatible with and shorter than q
. Then q
is a descendant of p
.
A block is a candidate if it is not buried by any valid block. A block b
is current if it is a candidate and there is no other candidate c
such that:
- either
c.prefix
is an ancestor ofb.prefix
, - or
c.prefix == b.prefix
andc.members.len() > b.members.len()
- or
c.prefix == b.prefix
andc.members.len() == b.members.len()
andc.members > b.members
.
That last inequality is lexicographic comparison of the iterators but it does not really matter: It is just a tie-breaker to ensure that there is always only one current block per section. In fact, if the valid blocks cover the namespace (e.g. if there is a genesis block with the empty prefix), it is guaranteed that for every name n
, there is always exactly one current block whose prefix matches n
: The blocks with the highest version whose prefix matches n
are guaranteed to be candidates. The candidates with the shortest prefix matching n
are not ruled out by point 1. Among those, the ones with the most members are not ruled out by point 2. And among those, the one with the lexicographically greatest member list is current.
The set valid_blocks: Set<Block>
can be computed by recursively adding every block b1
to it for which there exists a block b0
in valid_blocks
and b0
witnesses b1
with some subset votes
of the received votes.
What to vote for
In general, a node votes for blocks that are admissible after current ones, and that are closer to what the node itself is observing. E.g. if it lost a connection, it will remove that node from current blocks and vote for the results.
At the same time, it votes for neighbouring blocks that it has already verified as valid: the horizontal witnessing of neighbours is complementary to the vertical witnessing of admissible successors, so that together they define a web of trust linking the blocks of the whole network together in a way that can be used both for proving facts about the network’s history and validating messages.
Adding nodes
For every current block b0
containing our own name but missing the name
of a node that should be added, vote with from == b0
for a block with b0.version + 1
and with name
in addition to b0.members
. Nodes that should be added are candidates that passed our resource proof within the last 60 seconds (Add rule).
The weight of name
is according to Node Ageing, i.e. 0
for a newly added node and a + 1
for a relocated node that previously had age a
.
Removing nodes
For every node name
that should be removed, and for every current block b0
containing both name
and ourselves, vote for a block with b0.version + 1
and b0.members
without name
, using from == b0
. Nodes that should be removed are:
- nodes that have disconnected from us and we haven’t managed to reconnect or establish a tunnel to yet, and (RmDc rule), and
- nodes that sent us malformed or inherently invalid messages or where we strongly suspect malicious behaviour (RmFail rule).
The details about what constitutes malicious behaviour need to be worked out in detail and are only partly related to this proposal.
Splitting
For every current block b
containing our name, let b0
and b1
be blocks with b.version + 1
, the children of b.prefix
and the corresponding subsets of s.members
.
If b0
, b1
and all current blocks whose prefixes are siblings of ancestors of b0.prefix
have at least MIN_SECTION_SIZE + SPLIT_BUFFER
members, vote for b0
and b1
with from == s
(Split rule).
Merging
For every current block b0
containing our name, and every current block b1
such that b0.prefix
and b1.prefix
are siblings, if at least one of the following conditions is satisfied, vote for b
with from == b0
, where b.prefix == b0.prefix.popped()
, b.members
is the union of b0.members
and b1.members
and b.version
is max(b0.version, b1.version) + 1
:
b0
,b1
or any current sibling of any ancestor ofb0.prefix
has fewer thanMIN_SECTION_SIZE
members (Merge rule),- or we disconnected from too many members of
b0
,b1
or any current sibling of any ancestor ofb0.prefix
, so that the ones we are still connected to don’t have a quorum (ForceMerge rule).
Neighbouring sections
For every current block b1
whose prefix doesn’t match our own name, and every current block b0
containing our own name, vote for b1
with from == b0
(Neighbour rule).
This is to witness all neighbouring sections, and it replaces the section list cache.
Admissible valid blocks
Vote for b1
with from == b0
if (Admissible rule):
b0
andb1
are both valid,b0
contains our name,b1
is admissible afterb0
,b0
does not yet witnessb1
and- there are no blocks
c
in betweenb0
andb1
such thatc
is admissible afterb0
andb1
is admissible afterc
.
This is essentially a shortcut to prove b1
from b0
to another node without taking a detour via other blocks. In particular, if our sibling section wants to merge with us and has successfully made the merged section valid, this rule will cause us to sign the merge from our side of the section, too.
Relevant blocks
Every node has to know the current blocks for its own section and its own section’s neighbouring sections. Therefore, when a node observes a split, it should drop all non-neighbouring blocks from the set of current blocks. Any existing votes and valid blocks pertaining to a non-neighbouring section should be retained – the node should just stop trying to stay up to date with that part of the network.
Conversely, when a node observes a merge, it might gain new neighbours from a hitherto unknown section. The message flow is such that nodes receive merge votes from their new neighbours, and can then use the Proof request algorithm to fetch the rest of the votes. See the Relaying votes section below for details.
Relaying votes
Conceptually, there are two types of messages that we send in order to inform other nodes about our current state:
- Vote Messages: a message containing our vote on some change.
- Agreement Messages: a message containing at least a quorum of votes on a change, indicative of our agreement to the validity of a new block.
In the implementation, both such messages can be represented by a Vote
and a collection of signatures, like so:
struct SignedVote {
vote: Vote,
signatures: BTreeMap<PublicId, Signature>,
}
Whenever we wish to vote on something, we broadcast a vote message to the members of the from
and to
blocks. In most cases, this means sending votes only to nodes in our own section. When witnessing, the to
block is that of a neighbour, so we broadcast the vote to them as well. New candidates receive the vote that makes them a member, and can use the Proof request algorithm to retrieve the rest of the history.
Whenever a vote accumulates, and we are in either or both of the from
or to
blocks, we broadcast the vote to any current neighbour section N
such that:
N
is a neighbour of thefrom
orto
prefix, and- Our current prefix is the closest (by XOR distance) to
N
’s prefix, amongst the set of current sections which have prefixes compatible with either thefrom
prefix or theto
prefix.
When taking the XOR distance between two prefixes, we use the lower bounds of both prefixes, so that for example Prefix(01)
⊕ Prefix(00)
= XorName(010000...)
⊕ XorName(00000...)
= 0100000...
.
Message Complexity
Let n be the number of nodes in a given section S, and m be the number of nodes in the neighbourhood of S (consisting of S and its neighbours).
Each new block that is agreed requires O(n) nodes to vote, and each node broadcasts their vote to O(n) other nodes. Hence the message complexity is at least O(n2).
Further, broadcasting the agreement message to the neighbourhood requires that O(n) nodes broadcast to O(m) other nodes, for a contribution of O(nm). We have m ≫ n, so O(nm) ≫ O(n2), and hence the total message complexity is approximately O(nm) per block agreed.
Proof request algorithm
The proof request algorithm can be invoked by a node when it receives a message about an unknown block. We’re fairly sure that there exists an algorithm which works most of the time, but we are still working out the details.
The current algorithm
The algorithm that is currently implemented in Ewok works the following way:
- If we receive a vote or agreement with a
from
block we don’t consider valid, we reply with aRequestProof
message to the sender. This message contains the block in question, as well as a set of blocks we consider current at the given moment. - Upon receiving a
RequestProof
message, we check whether the requested block is valid according to us. If not, we reply withNoProof
. If yes, we proceed to search for a proof. - We conduct the search for a proof, which is essentially a Breadth-First Search in the graph of blocks that are connected by votes:
- Start with a single “path” in the graph, containing just the requested block.
- For each path: let
b
be the last block in that path. Find all blocksx
such that there are accumulated votes fromx
tob
. For every suchx
that we haven’t yet visited, generate a new path by appendingx
to the initial path. - If there is a path among the generated ones, such that its last block is either among the current blocks from the
RequestProof
message, or is for a compatible prefix and a lower version than a block among the current blocks, stop. - If there is no such path, and we haven’t extended any of the paths in the previous step, stop.
- Reply with all the votes along the path satisfying the condition above, or with the immediate predecessors of the requested block if no such path exists.
Changes in the current blocks
With each new signature, the set of valid and current blocks may change. The rule for the routing table is that whenever there exist current blocks b0
and b1
which are compatible or neighbours, all members of b0
and b1
need to try and directly connect to each other, and need to have proofs for each other’s current blocks.
Whenever we learn about new votes in a way such that we now have two current blocks b0
and b1
that are neighbouring or compatible, we inform both of them about each other (see Relaying votes), so they will know whom to connect to.
After each change, disconnect from all nodes that no current block implies we need to be connected to anymore.
Secure hop messages
Nodes that relay a message need to insert Vote
s into it in such a way that every member of the receiving group or section can verify the message, i.e. such that the votes create a chain of witnesses relations from the receiver’s block to a block of the sender a quorum of which has signed the message itself. In detail:
If a node sends a section message on a route
(i.e. after route
failures to receive an Ack
), it sends its signature of the message to the accumulating node, i.e. the route
-th member, sorted by distance from the destination, of every current block containing the node’s own name.
If the node is the accumulating node itself, it waits until it has collected signatures from a quorum of at least one valid block b0
. Then it sends it on to the route
-th member of every current neighbouring block that lies in the right direction (i.e. has a longer prefix in common with the destination address, and agrees with the destination address in all bits that don’t exist in our prefix). Since we immediately broadcast votes that make blocks valid, that node will already know the block b0
. We append b0
, too.
To relay a message, insert votes that (over one or more steps) validate the previous hop’s block from one of our own current blocks. Append that block.
Repairing broken sections
If too many members of a section go offline, the sibling will vote for a merge, and since the sibling’s votes alone are already enough, this automatically force-merges the section.
Optimisations
Nodes occasionally need to apply the block validation algorithm to blocks that are not neighbours, to validate messages. After validating a message, they can drop all blocks again that are not compatible or neighbouring to them.
Nodes can even drop all blocks that have not been current for some period, as it is unlikely that they will ever be used as a source (from
) again. However, if a client or node rejoins after being offline for longer than that period, the network won’t be able to prove to that client or node anymore that it is the successor of the previous network.
If n
events happen roughly at the same time it is not unlikely with the current rules that a section will go through 2
n
versions of itself before reaching agreement again (all combinations of those events). This could be avoided by introducing short delays (maybe just one or two seconds) between our own votes in the cases where we would currently send multiple votes instantaneously: If we always start with the lexicographically lowest vote, then it is likely that some branches accumulate before we consider voting on others, and we wouldn’t exhaust all the possible combinations.
Broadcasting votes
Broadcasting all votes that we received is wasteful since most recipients will already have them. We could be more optimistic here and e.g. just send the now valid blocks. If a node doesn’t agree that they are valid, it can request the missing votes:
So whenever a new block becomes current for a node n
, it sends that block b
in a message CurrentBlock(b)
to all known nodes that are neighbouring or in that block. When a node m
receives that, there are three different cases:
- If
b
is current inm
’s view, too (the normal case), it doesn’t do anything. - If
b
is valid but not current form
,m
sends the votes fromb
to its own current block(s) with compatible prefix back ton
:n
must still be missing some of those votes. - If
b
is not valid form
,m
sends aCurrentBlock(c)
with its own current blocks back ton
.n
will then respond with the missing votes.
Examples
Let’s assume we have a constructor:
fn new<I>(prefix: &str, version: u64, nodes: I) -> Block
where I: IntoIterator<Item = &Name>
Add and remove
Block b0
with nodes 0, …, 4
agrees that it is valid and current. Now node 5
joins and at roughly the same time, node 4
leaves. Consider the following section states:
let b0 = Block::new("", 5, &[0, 1, 2, 3, 4 ]);
let br = Block::new("", 6, &[0, 1, 2, 3 ]);
let ba = Block::new("", 6, &[0, 1, 2, 3, 4, 5]);
let b1 = Block::new("", 7, &[0, 1, 2, 3, 5]);
Nodes 0, 1
first vote for ba
with from == b0
. Everyone receives their votes.
Next, nodes 2, 3
see 4
leaving and vote for br
with from == b0
. Everyone receives their votes. So far, nothing has accumulated.
Now nodes 0, 1
see 4
leaving, too, and vote br
with from == b0
. They now have proof that br
succeeds b0
. The former is now current, the latter isn’t anymore.
Before receiving those votes, however, nodes 2, 3
see node 5
passing the resource proof, too, and vote for ba
with from == b0
.
Everyone exchanges their votes and has now two candidate blocks ba
and br
. Since ba
has more members, it is current and br
isn’t (although it remains valid).
Since they all are still disconnected from 4
, they vote for b1
with from == ba
.
As soon as that accumulates, b1
becomes the only current block.
Merge
The block b00
with prefix 00
wants to merge with block b01
with prefix 01
, because block b01
has too few members. The members of b00
sign the block b0
with prefix 0
, with the union of the members of b00
and b01
, and from == b00
.
Having a valid block b0
now, all members of b01
need to learn about and connect to b10
. Since b0
is valid and has a higher version number than b01
, b01
is now not current anymore, regardless of whether b01
also voted for it.