proj-plbook-plChConsensus

Table of Contents for Programming Languages: a survey

Chapter: Consensus

Eventual consistency

also called best-effort consistency (cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer; that article reserves the term 'eventual consistency' when there is a maximum time duration until consistency)

Last write wins

Logical clocks consistency schemes

Vector clocks

Consider the following example (slightly modified from http://basho.com/why-vector-clocks-are-easy/)

Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. Dave Alice emails everyone and suggests they meet on Wednesday. Ben emails Dave and says he can do any day, but he'd prefer Tuesday or Thursday. Dave replies privately to Ben that he'd prefer Tuesday, so unless someone can't make it, it'll be Tuesday. Then Cathy emails Dave and says she absolutely can't make any day but Thursday, so Dave replies privately to Cathy that it'll be Thursday. Then Dave gets sick and stops checking his email. Alice emails everyone again to find out what day was decided, but she gets mixed messages. Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can't be reached, and Ben and Cathy's computers' clocks are unreliable, so their email timestamps can't be used to determine which email from Dave was sent latest. So none of Alice, Ben, and Cathy know which day has been chosen (assume for the purpose of the example that they can't just work it out themselves).

Consider the decision about which day it is to be a document with different versions. We want to know which version is most recent. The vector clock solution is this: a list of clocks is appended to each message. A 'clock' is just an integer. There is one clock for each person who has seen any previous version of the message. Whenever a person is about to send a message, they increase their clock.

So in the examples, the messages are:

Dave (to all): "Dinner? Vector clock: {Dave: 1}" Alice (to all): "Yeah how about Wednesday? Vector clock: {Alice: 1, Dave: 1}" Ben (to Dave and Alice): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Vector clock: {Alice: 1, Ben: 1, Dave: 1}" Dave (to Ben): "I like Tuesday too if no one else minds. Vector clock: {Alice: 1, Ben: 1, Dave: 2}" Cathy (to Dave): "I can only do Thursday. Is that okay? Vector clock: {Alice: 1, Cathy: 1, Dave: 1}" Dave (to Cathy): "OK. Vector clock: {Alice: 1, Ben: 1, Cathy: 1, Dave: 3}"

Now even if Dave goes silent, Ben and Cathy can compare the latest messages they've seen, which would be their replies from Dave, and determine that the latest version, that is, the version which is a descendent of every earlier version, is Dave's message to Cathy confirming Thursday. They can agree that since Dave has seen everything that everyone had said before replying to Cathy that Thursday would be okay, that Thursday is reasonable.

Vector clocks can also distinguish the case in which there is no single 'latest version' (multiple conflicting versions). For example, if instead of emailing Dave about Tuesday and Thursday, Ben had emailed only Alice, and she had replied, we would have had:

Dave (to all): "Dinner? Vector clock: {Dave: 1}" Alice (to all): "Yeah how about Wednesday? Vector clock: {Alice: 1, Dave: 1}" Ben (to Alice): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Vector clock: {Alice: 1, Ben: 1, Dave: 1}" Alice (to Ben): "I like Tuesday too if no one else minds. Vector clock: {Alice: 2, Ben: 1, Dave: 1}" Cathy (to Dave): "I can only do Thursday. Is that okay? Vector clock: {Alice: 1, Cathy: 1, Dave: 1}" Dave (to Cathy): "OK. Vector clock: {Alice: 1, Cathy: 1, Dave: 2}"

Now if Ben and Cathy compare the latest messages they've seen, they see:

"I like Tuesday too if no one else minds. Vector clock: {Alice: 2, Ben: 1, Dave: 1}"

and

"OK. Vector clock: {Alice: 1, Cathy: 1, Dave: 2}"

Neither of these has the property that the clocks in its vector clock are all greater than or equal to the corresponding clocks in the other. So neither is a descendent of the other; we have a version conflict.

One problem with vector clocks is that they are not scalable, because each message must be padded with metadata whose length is worst-case proportional to the total number of processes in the system. Charron-Bost proved that there is no scalable system (one using clocks with less entries than the total number of processes in the system) that can correctly determine whether or not one message is a descendent of the other in all cases (cite Bernadette Charron-Bost. Concerning the Size of Logical Clocks in Distributed Systems. Information Processing Letters 01/1991; 39:11-16).

Another problem is that if someone is also interacting with some of the same people on a different document, e must remember the last clock number that he sent out for each conversation.

(note: technically speaking i think vector clocks are supposed to be incremented by the receiver, not the sender but in the examples above i had the sender increment)

Lamport clocks

Lamport clocks are like vector clocks where there is only one integer attached to each message. Like vector clocks, each process also keeps track of one integer. They work like this:

In the previous example, this would be

Dave (to all): "When shall we meet? Lamport clock: 0" Alice (to all): "Let's meet on Wednesday. Lamport clock: 1" Ben (to Dave): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Lamport clock: 2" Dave (to Ben): "I like Tuesday too if no one else minds. Lamport clock: 3" Cathy (to Dave): "I can only do Thursday. Is that okay? Lamport clock: 2" Dave (to Cathy): "OK Thursday it is then. Lamport clock: 4"

Lamport clocks have the property that if one message is a descendent of another. the descendent will always have a greater clock value than its ancestor. But if the messages are incomparable (neither is a descendent of another), the Lamport clock might be equal for both of them, or it might be greater for either one of them.

Lamport clocks are scalable, unlike vector clocks. The functionality that vector clocks have that this doesn't is that you can't always tell if there is no single latest version (if there is a conflict). If there is no version which is a descendent of all other versions, it is still possible for one message to have a greater Lamport clock value.

Other clock-ish schemes

"Plausible clocks" are clock schemes in which, if one message is a descendent of the other, the clock scheme says so; but it might also say that even if the two messages are incomparable (incomparable here is sometimes called "concurrent").

"Matrix clocks" are when each process remembers, for each other process that it receives messages from, the latest vector clock that it saw in any message from that other process. In other words, it keeps track not only of what it has seen, but also of what other processes have seen.

"Version vectors" are like vector clocks but for synchronization of documents; when two parties synch, they set the clocks in the document that they synched to the max of their individual clocks.

You can 'prune' the vectors in vector clocks; in the vector, for each clock, have a 'last modified' timestamp (it's okay that the processes' clocks aren't synchronized, this timestamp is only an optimization, if it's wrong it won't affect correctness). Now when the vector clock gets 'too big', delete the oldest counters from the vector. If you compare the resulting document to a very old ancestor, the new document won't appear to be an ancestor because it is missing some clocks that the old one has; so this system will sometimes mistakenly tell you that there is no single latest version even when there is. But it will never tell you that there is a latest version when there isn't one.

Interval tree clocks

Links:

The hard part is conflict resolution

If one is using vector clocks as part of providing some sort of consistency, all the clocks do is to detect conflicts. You still have to find some way to resolve those conflicts, and how conflicts should be resolved is an application specific question.

CRDTs

https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type

" if any two versions of the document can be merged associatively, commutatively, and idempotently; e.g. they form a CRDT " -- [1]

Some CRDTs:

Links:

Gossip protocols

https://en.wikipedia.org/wiki/Gossip_protocol

consensus

paxos

raft

Byzantine faults

what is consensus good for?

a) " ongardie 19 hours ago

You end up needing consensus for a lot of fault-tolerant systems that need to provide consistent results. For example, if your system allows users to choose their own usernames, and you're trying to guarantee that all usernames are unique, and your system needs to automatically deal with server failures, then I think you also need consensus.

Another way to think about it is that consensus gets you the equivalent of a compare-and-swap operation in a distributed setting. Just as compare-and-swap is useful for building synchronization primitives with shared memory, consensus is useful for building synchronization primitives across a network.

[1] https://en.wikipedia.org/wiki/Compare-and-swap

reply "

b) "Consensus protocols are the basis for the state machine replication approach to distributed computing" -- [3] "a replicated log for consensus." -- [4]

" Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers are available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).

Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems. Each server has a state machine and a log. The state machine is the component that we want to make fault-tolerant, such as a hash table. It will appear to clients that they are interacting with a single, reliable state machine, even if a minority of the servers in the cluster fail. Each state machine takes as input commands from its log. In our hash table example, the log would include commands like set x to 3. A consensus algorithm is used to agree on the commands in the servers' logs. The consensus algorithm must ensure that if any state machine applies set x to 3 as the nth command, no other state machine will ever apply a different nth command. As a result, each state machine processes the same series of commands and thus produces the same series of results and arrives at the same series of states. " -- [5]