Thomas Wang's blog

From journeyman to master.

DDIA Chapter 9. Consistency and Consensus

Buy the book

The main idea of this chapter is:

Consistency Guarantees

The deeply linked topics this chapter discusses:


What Makes a System Linearizable?

Figure 9-2 If a read request is concurrent with a write request, it may return either the old or the new value

Figure 9-3 After any one read has returned the new value, all following reads  (on the same or other clients) must also return the new value

Figure 9-4 Visualizing the points in time at which the reads and writes appear  to have taken effect. The final read by B is not linearizable

Linearizability VS Serializability?

Relying on Linearizability

Let's talk about few examples that linearizability is a requirement.

Locking and leader election

One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader. The acquiring lock must be linearizable.

Apache ZooKeeper and etcd can be used to implement distributed locks and leader election.

Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale, since by default they can be served by any one of the replicas. You can optionally request a linearizable read: etcd calls this a quorum read, and in ZooKeeper you need to call sync() before the read

Constraints and uniqueness guarantees


Comparing to relational databases:

Cross-channel timing dependencies

Figure 9-5. The web server and image resizer communicate both through file storage and a message queue,  opening the potential for race conditions

Implementing Linearizable Systems

How to implement linearizability (behave like a single copy of data), and tolerate faults (by holding more than one copy of data)?

Single-leader replication (potentially linearizable)

Consensus algorithms (linearizable)

Multi-leader replication (not linearizable)

Leaderless replication (probably not linearizable)

Linearizability and quorums

Example race conditions in Dynamo-style strict quorum.

Figure 9-6. A nonlinearizable execution, despite using a strict quorum.png

The Cost of Linearizability

The CAP theorem

The Unhelpful CAP Theorem

CAP only considers network partition, it does not consider network delays, dead nodes, or other trade-offs. There're many more impossibility results in distributed systems: A Hundred Impossibility Proofs for Distributed Computing (1988)

CAP has now been superseded by more precise results:

Linearizability and network delays

The reason for dropping linearizability is performance, not fault tolerance. Sequential Consistency Versus Linearizability (1994) proves that response time of read and write requests is at least proportional to the uncertainty of delays in the network, if you want linearizability.

Ordering Guarantees

Ordering is an important fundamental idea:

Ordering and Causality

The causal order is not a total order

Linearizability is stronger than causal consistency

Capturing causal dependencies

Key ideas are:

Sequence Number Ordering

Noncausal sequence number generators

Multi-leader and leaderless databases cannot generate sequence numbers that are causal consistent.

In practice, these methods can generate sequence numbers for operations:

These are more performance than a single node, but notice the sequence numbers they generate are not consistent with causality.

Note that it is possible to make physical clock timestamps consistent with causality:
in [[DDIA-8 The Trouble with Distributed Systems#Synchronized clocks for global snapshots]] we discussed Google’s Spanner, which estimates the expected clock skew and waits out the uncertainty interval before committing a write.
This method ensures that a causally later transaction is given a greater timestamp.

Lamport timestamps

Leslie Lamport: "Time, Clocks, and the Ordering of Events in a Distributed System (1978)" is the most cited paper in the field of distributed systems. Lamport timestamp is a simple method for generating causally consistent sequence numbers.

It works like this:

Figure 9-8. Lamport timestamps provide a total ordering consistent with causality

Notice when client A writes max=1 to node 2, it receives (5, 2) timestamp, the next write to node 1 increases node 2 counter to 6.

What ensures the ordering is that every causal dependency results in an increased timestamp.

Difference between Lamport timestamp and [[DDIA-5 Replication#Version vectors]]:

Timestamp ordering is not sufficient

Total Order Broadcast

How to scale the throughput beyond single leader and handle failover?

Required safety properties:

Using total order broadcast

Implementing linearizable storage using total order broadcast

Close links between linearizability and total order broadcast:

In a formal sense, a linearizable read-write register is an “easier” problem. Total order broadcast is equivalent to consensus, which has no deterministic solution in the asynchronous crash-stop model, whereas a linearizable read-write register can be implemented in the same system model.

However, supporting atomic operations such as compare-and-set or increment-and-get in a register makes it equivalent to consensus. Thus, the problems of consensus and a linearizable register are closely related.

Implement linearizable compare-and-set operation by using total order broadcast as an append-only log:

  1. Append a message to the log, tentatively indicating the new value you want to claim.

  2. Read the log, and wait for the message you appended to be delivered back to you. (If you don't wait, it will be neither linearizable nor sequentially consistent)

  3. Check for any messages claiming the new value that you want. Commit the claim and acknowledge it to the client, or abort the operation.

Above procedure provides sequential consistency / timeline consistency, which is a weaker guarantee than linearizability. To further ensure reads linearizability in addition to the writes, options are:

Implementing total order broadcast using linearizable storage

The inverse of building linearizable compare-and-set operation from total order broadcast.

The key difference between total order broadcast and timestamp ordering is that total order broadcast has no sequence gaps.

In fact, linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus.

Distributed Transactions and Consensus

The simple yet fundamental goal: get several nodes to agree on something.

[[DDIA-5 Replication]], [[DDIA-7 Transactions]], system models ([[DDIA-8 The Trouble with Distributed Systems]]), [[#Linearizability]] and [[#Total Order Broadcast]] are prerequisite knowledge to tackle the consensus problem.

What are situations in which consensus is required?

The algorithms for consensus:

Atomic Commit and Two-Phase Commit (2PC)

From single-node to distributed atomic commit

When multiple nodes are involved in a transaction?


Introduction to two-phase commit

Figure 9-9. A successful execution of two-phase commit (2PC)

A system of promises

Two crucial "points of no return" that ensure the atomicity:

  1. when a participant votes "yes", it promises that it will definitely be able to commit later
  2. once the coordinator decides, that decision is irrevocable.

The coordinator retries indefinitely for the commit phase until it hears back from all the participants. The participant node, if crashed, should recover and commit.

Coordinator failure

If the coordinator crashes or the network fails at this point, the participant can do nothing but wait (in doubt or uncertain).

In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.

The coordinator must recover from reading the transaction log. Notice this is a blocking protocol.

Distributed Transactions in Practice

2PC is very slow, in MySQL it's 10x slower (Distributed Transactions in MySQL (2013)).

Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery, and the additional network round-trips.

Cloud vendors like Azure does not provide distributed transactions:

What we mean by "distributed transactions"?

One example of heterogeneous distributed transaction use case is exactly-once message processing (message delivery + database writes).

XA transactions

X/Open XA (short for eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies. Widely supported by relational databases and message brokers.

XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services. The library is responsible for participants tracking, crash recovery and communication with the database server.

Holding locks while in doubt

The lock will prevent any write or potentially read to the data. If coordinator does not recover, the lock will block other transactions forever.

Recovering from coordinator failure

If orphaned in-doubt transactions occur, the only way out is admin manually decide whether to commit or roll back each in-doubt transactions. XA implementations usually have emergency escape hatch to allow a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator.

Limitations of distributed transactions

[[DDIA-11 Stream Processing]] talks about alternatives to achieve heterogeneous distributed transactions.

Fault-Tolerant Consensus

The consensus algorithm must satisfy properties: uniform agreement, integrity, validity, termination.

Notes on termination:

Consensus algorithms and total order broadcast

Best-known fault-tolerant consensus algorithms: Viewstamped Replication (VSR), Paxos, Raft, and Zab.

They decide on a sequence of values, in other words total order broadcast. This is equivalent to performing several rounds of consensus.

Single-leader replication and consensus

Why [[DDIA-5 Replication#Leaders and Followers]] did not worry about consensus?

Epoch numbering and quorums

Leader uniqueness guarantee in consensus protocols:

How does a leader make decision:

Key insight:

Looks similar to 2PC, differences are:

Limitations of consensus

Summarize pros of consensus algorithms:

What are the cost?

More notes on network impact:

Coracle: Evaluating Consensus at the Internet Edge: Our insight is that current consensus algorithms have significant availability issues when deployed outside the well defined context of the data-center.

Figure 3 shows the reachability between 4 hosts, one of which is behind a poorly configured NAT, allowing it to transmit messages to all hosts but not hear responses.
This host will never be able to hear the leader’s heartbeats and thus will continually timeout, increment its term and start a new election.
When contacting the other hosts, it will remove the current leader and continuously force leader election, rendering the system unavailable.

Membership and Coordination Services

How a service like ZooKeeper or etcd is used?

ZooKeeper features

Combining these features, ZooKeeper can be very useful for distributed coordination.

Allocating work to nodes


Use atomic operations, ephemeral nodes and notification in ZooKeeper can achieve auto recover from faults without human intervention.

ZooKeeper usually run on a small (3/5) fixed number of nodes to support potentially large number of clients. It provides a way of "outsourcing" some of the work of coordinating nodes to an external service.

Data managed by ZooKeeper is slow-changing (in minutes or hours). For replicating runtime state, use tools like Apache BookKeeper.

Service discovery

Solves the problem that virtual machines continually come and go in cloud environments, and you don't know the IP address of your services.

Solution is to configure your services such that when they start up they register their network endpoints in a service registry, where they can then be found by other services.

Leader election requires consensus, consensus system can help other services know who the leader is. Async replica do not vote but can serve read requests that do not need to be linearizable.

Note: using ZooKeeper for service discovery means when network partitions, other services can't get response back. If availability is preferred, Eureka might be better option.

Membership services

A membership service determines which nodes are currently active and live members of a cluster. By coupling failure detection with consensus, nodes can come to an agreement about which nodes should be considered alive or not.