DDIA Chapter 5. Repliation

Buy the book https://dataintensive.net/

Intro

Definition: Replication means keeping a copy of the same data on multiple machines that are connected via a network
Why?
- closer to usert (latency)
- availability when some parts have failed
- increase read throughput
All of the difficulty in replication lies in handling changes to replicated data
Popular algorithms:single-leader, multi-leader, and leaderless replication.
Trade-offs: sync/async, handle failed replicas

Leaders and Followers

Definition: Each node that stores a copy of the database is called a replica
Fundamental question: how do we ensure that all the data ends up on all the replicas?
Wide usages:
- Used in relational DBs PostgreSQL (since version 9.0), MySQL, Oracle Data Guard, and SQL Server’s AlwaysOn Availability Groups
- Used in some non-relational DBs MongoDB, RethinkDB, and Espresso
- distributed message brokers such as Kafka and RabbitMQ highly available queues also use it

Synchronous Versus Asynchronous Replication

Examples followers fall behind: if a follower is recovering from a failure, if the system is operating near maximum capacity, or if there are network problems between the nodes.
Often, leader-based replication is configured to be completely asynchronous.
- In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost.
- This means that a write is not guaranteed to be durable, even if it has been confirmed to the client.
- However, a fully asynchronous configuration has the advantage that the leader can continue processing writes, even if all of its followers have fallen behind.
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed. More at [[#Problems with Replication Lag]]
[[DDIA-9 Consistency and Consensus]] talks about consistency of replication and consensus

Setting Up New Followers

Goal: to increase the number of replicas, or to replace failed nodes
Question: ensure new follower has an accurate copy of leader data
Process: snapshot copy, then requests data changes until caught up

Handling Node Outages

Follower failure: Catch-up recovery
Leader failure: Failover

Failover Process

Determining that the leader has failed: timeout
Choosing a new leader: election or appointed by previous controller node (more in [[DDIA-9 Consistency and Consensus]])
Reconfiguring the system to use the new leader
- Discuss in [[DDIA-6 Partitioning#Request Routing]]
- The system needs to ensure that the old leader becomes a follower and recognizes the new leader.

Things can go wrong during failover:

the new leader may not have received all the writes from the old leader before it failed
Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents (GitHub outage example)
split brain two nodes both believe that they are the leader, be careful not shutdown both nodes
Choosing right timeout
More in [[DDIA-8 The Trouble with Distributed Systems]] and [[DDIA-9 Consistency and Consensus]]

Implementation of Replication Logs

Write-ahead log (WAL) shipping

Storage engines usually write append log
- [[DDIA-3 Storage and Retrieval#SSTables and LSM-Trees]] log segments
- [[DDIA-3 Storage and Retrieval#B-Trees]] WAL before disk block
Send same log to replicas
Used in PostgreSQL, Oracle and others
Disadvantage: low level database backward compatibility to allow zero-downtime upgrade

Logical (row-based) log replication

Allows the replication log to be decoupled from the storage engine internals
Also easier for external applications to parse
- Data warehouse offline analysis
- Custom indexes and caches
- [[DDIA-11 Stream Processing#Change Data Capture]]

Trigger-based replication

Relational DB features: triggers and stored procedures.
Triggers
- register custom application code ->
- automatically executed when a write transaction occurs ->
- log this change into a separate table ->
- read by an external process ->
- apply any necessary application logic and replicate the data change to another system.
- Examples: Databus for Oracle and Bucardo for Postgres
- Con: greater overheads, prone to bugs; Pro: flexible

Problems with Replication Lag

Read-scaling architecture: create many followers, and distribute the read requests across those followers.
Realistically work with asynchonous follower -> eventural consistency

Reading Your Own Writes

Possible techniques
- read from leader for potentially modified data (e.g., user's own profile)
- track last update time make all reads from leader
- client remembers most recent write timestamp and system ensure proper replica
- multiple datacenter complexity
cross-device read-after-write consistency
- metadata will need to be centralized
- may first need to route requests from all of a user’s devices to the same datacenter.

Monotonic Reads

it’s possible for a user to see things moving backward in time.
make sure that each user always makes their reads from the same replica

Consistent Prefix Reads

Definition: If a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
Distributed databases partitions don't have global ordering of writes
- One solution is to make sure that any writes that are causally related to each other are written to the same partition
- Algorithms that explicitly keep track of causal dependencies ([[#The “happens-before” relationship and concurrency]])

Solutions for Replication Lag

Does the user experience need to avoid replication lag
Ways the app to provide stronger guarantee
Underlying database gaurantee by transactions
- Distributed database transactions are hard
- Discuss in [[DDIA-7 Transactions]] and [[DDIA-9 Consistency and Consensus]]
- Alternative mechanisms or Derived Data: Batch and Streaming

Multi-Leader Replication

Definition
- Allow more than one node to accept writes
- Each node that processes a write must forward that data change to all the other nodes
Each leader simultaneously acts as a follower to the other leaders

Use Cases for Multi-Leader Replication

Use on multi-datacenter deployment

Figure 5-6. Multi-leader replication across multiple datacenters

Advantages of multi-leader configurations in multi-datacenter deployment:
- Performance
- Tolerance of datacenter outages
- Tolerance of network problems
Downside
- Write conflicts (Discuss at [[#Handling Write Conflicts]])
- Dangerous from subtle configuration pitfalls and surprising interactions with other database features (https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html)

Clients with offline operation

E.g., CouchDB is designed for this mode of operation

Collaborative editing

automatic conflict resolution
acquire lock before write, make unit of change very small

Handling Write Conflicts

Synchronous versus asynchronous conflict detection

Need async conflict resolution, otherwise just use single-leader

Converging toward a consistent state

last write wins (LWW) more at [[#Detecting Concurrent Writes]] (implies data loss)
Replica UUID and cetain replica take precedence over writes (implies data loss)
Merge writes in certain way
Record conflict explicitly and write app code that resolves later

Custom conflict resolution logic

On write
On read
AUTOMATIC CONFLICT RESOLUTION
- Conflict-free replicated datatypes (CRDTs)
- Mergeable persistent data structures
- Operational transformation

Multi-Leader Replication Topologies

A replication topology describes the communication paths along which writes are propagated from one node to another
Most general: all-to-all
More restricted MySQL default circular topology
Start topology
A problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, causing them to be unable to communicate until the node is fixed
all-to-all can have issue of concurrent write, need version vectors discussed in [[#Detecting Concurrent Writes]]
Be careful about multi-leader replication and check the database gaurantees

Leaderless Replication

Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo
Therefore also known as Dynamo-style
In some leaderless implementations, the client directly sends its writes to several replicas
while in others, a coordinator node does this on behalf of the client, that coordinator does not enforce a particular ordering of writes

Writing to the Database When a Node Is Down

In leader-based configuration: failover
In a leaderless configuration: read requests are also sent to several nodes in parallel
Version numbers are used to determine which value is newer

Read repair and anti-entropy

To make stale value notes catch up
Read repair: client writes newer value back to stale replica
Anti-entropy process: background process taht constantly looks for differences, without it rarely read may be missing from some replicas

Quorums for reading and writing

If there are n replicas
every write must be confirmed by w nodes to be considered successful
must query at least r nodes for each read.
As long as w + r > n, we expect to get an up-to-date value when reading
n, w, r are typically configurable
There may be more than n notes in the cluster to allow partinioning
If fewer than the required w or r nodes are available, writes or reads return an error.

Limitations of Quorum Consistency

However, even with w + r > n, there are likely to be edge cases where stale values are returned:

if a sloppy quorum is used
two writes occur concurrently: one safe solutuon is to merge writes
write happens concurrently with a read
a write cucceeded on some replicas but failed overall
node carrying new value fails and restored from a replica carrying old value
edge cases from timing (see in [[DDIA-9 Consistency and Consensus#Linearizability and quorums]])

Thus:

Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency.
The parameters w and r allow you to adjust the probability of stale values being read
Stronger gaurantees require transactions or consensus: [DDIA-7 Transactions]] and [[DDIA-9 Consistency and Consensus]]

Monitoring staleness

Possible in leader-based replication
- writes are applied to the leader and to followers in the same order, and each node has a position in the replication log
- Subtract a follower's current position from leader's current position
No fixed write order in leaderless-based replication
Also no staleness limit in read-repair-only database

Sloppy Quorums and Hinted Handoff

Leaderless high availability and low latency:
- appropriately configured quorums can tolerate the failure of individual nodes without the need for failover
- can also tolerate individual nodes going slow
But not very fault-tolerant when large numbers of nodes are cut off
Trade-offs:
- error all requests when quorum is not reached
- accept writes to some reachable nodes (sloppy quorum)
hinted handoff: Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate "home" nodes.
Increase write availability: as long as any w nodes are available, the database can accept writes
There is no guarantee that a read of r nodes will see it until the hinted handoff has completed.
Optional in all Dynamo implementations; default in Riak; disabled by default in Cassandra and Voldemort

Multi-datacenter operation

Cross-DC replication as a use case for [[#Multi-Leader Replication]]
Leaderless is also suitable: designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes.
Examples
- Cassandra and Voldemort: n includes nodes in all DCs, write send to all replicas, client only waits for ack from same DC, higher latency cross-DC writes are background async
- Riak: n within on DC

Detecting Concurrent Writes

Dynamo-style DBs allow concurrent write to the same key
Similar to multi-leader replication ([[#Handling Write Conflicts]])
Dynamo-style also has conflicts during read repair or hinted handoff
Fundamental problem is that events may arrive in a different order at different nodes, due to variable network delays and partial failures
Goal: achieve eventual consistency
Poor implementaions: app developer needs to know DB conflict handling internals

Explore more details about [[#Handling Write Conflicts]] below:

Last write wins (discarding concurrent writes)

writes are concurrent -> order is undefined
Force arbitrary order: use timestamp
Pick largest timestamp (last write wins (LWW)) is the only supported method in Cassandra, and optional feature in Riak
LWW cost is durability: only one write suvives
May drop non-concurrent writes because [[DDIA-8 The Trouble with Distributed Systems#Timestamps for ordering events]]
Only safe way of using LWW is to ensure unique key ([[202112252056 Cassandra does not need vector clocks]])

The “happens-before” relationship and concurrency

concurrent means both operations do not know each other
if not concurrent, later operation can overwrite prior
otherwise needs conflict resolution

Capturing the happens-before relationship

Note that the server can determine whether two operations are concurrent by looking at the version numbers—it does not need to interpret the value itself (so the value could be any data structure). The algorithm works as follows:

The server maintains a version number for every key, increments the version number every time that key is written, and stores the new version number along with the value written.
When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.
When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)
When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).

Merging concurrently written values

If several operations happen concurrently, clients have to clean up afterward by merging the concurrently written values. Riak calls these concurrent values siblings.
Merging sibling values == conflict resolution in multi-leader
Mark tombstone when removal (previously seen at [[DDIA-3 Storage and Retrieval#Hash Indexes]])
Riak supports CRDTs

Version vectors

Single version number is not sufficient when there are multiple replicas accepting writes concurrently
Use a version number per replica as well as per key
Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas
The collection of version numbers from all the replicas is called a version vector, it allows database to distinguish between overwrites and concurrent writes

Previous: Book Review: Atomic Habits
Next: DDIA Chapter 6. Partitioning