DDIA Chapter 6. Partitioning

Buy the book https://dataintensive.net/

Intro

Definition: A partition is a division of a logical database or its constituent elements into distinct independent parts.
Main reason: scalability - the query load can be distributed across many processors.

Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes.
A node may store more than one partition.
[[DDIA-5 Replication]] applies to replication of partitions

If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed
A partition with disproportionately high load is called a hot spot.
Key-value data model: you always access a record by its primary key

A good hash function takes skewed data and makes it uniformly distributed.
hash function need not be cryptographically strong (MD5, Murmur3, etc.)
Assign each partition a range of hashes
Consistent hashing, Rendezvous hashing
Cannot do efficient range queries
- Cassandra compromise: compound primary key only the first part is hashed
- Cassandra one-to-many data model example: (user_id, update_timestamp)

Hashing a key cannot avoid hot spots
Example: 3% of Twitter's Servers Dedicated to Justin Bieber
Application has to reduce the skew by further partitioning and do additional work at read

The situation becomes more complicated if secondary indexes are involved (see also [[DDIA-3 Storage and Retrieval#Other Indexing Structures]]
Secondary indexes are common in both relational and document DBs, and is the most important part of search servers (Solr, Elasticsearch)
Main problem: secondary indexes don't map neatly to partitions
Two main approaches:
- document-based partitioning
- term-based partitioning

Database can perform indexing automatically
- Note: in KV-only DB, application level secondary indexing is prone to inconsistency
Each partition maintains its own secondary indexes
A.K.A. local index
Need to send the query to all partitions, and combine all the results you get back (scatter/gather)
- Read queries are expensive, and prone to latency amplification
- But widely used
- Recommend structure the partitioning scheme so that secondary index queries can be served from a single partition

The process of moving load from one node in the cluster to another is called rebalancing.
Minimum requirements:
- After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.
- While rebalancing is happening, the database should continue accepting reads and writes.
- No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.

Problem: move data more than necessary

Create many more partitions than there are nodes, and assign several partitions to each node
The new node can steal a few partitions from every existing node
The number of partitions is usually fixed when the database is first set up and not changed afterward
If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead
- Why 480 shards?

Fully automatic rebalancing: convenient but dangerous
Fully manual
Couchbase, Riak, and Voldemort generate a suggested partition assignment automatically, but require an administrator to commit it before it takes effect.

An instance of more generalized problem: service discovery
Open source solutions
On a high level, there're three approaches
Hard to implement correctly the protocols for achieving consensus in a distributed system
Separate coordination service such as ZooKeeper
- LinkedIn’s Espresso uses Helix -> ZooKeeper
- HBase, SolrCloud, and Kafka also use ZooKeeper to track partition assignment.
- MongoDB's config server and mongos daemons
Cassandra and Riak use gossip protocol
- Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition
- Puts more complexity in the database nodes but avoids the dependency on an external coordination service such as ZooKeeper
Couchbase simplifies the design by not rebalancing automatically
Use DNS to find IP of a node

The massively parallel processing (MPP) query optimizer breaks this complex query into a number of execution stages and partitions, many of which can be executed in parallel on different nodes of the database cluster.
Discuss more at [[DDIA-10 Batch Processing]]