How LMDB works

A gentle introduction to LMDB's design and internals

This is largely taking notes from the talk: Howard Chu - LMDB [The Databaseology Lectures - CMU Fall 2015]. I also dug deeper into some of the areas.

LMDB Links

Homepage: https://www.symas.com/lmdb
Documentation: http://www.lmdb.tech/doc/
Original pager: https://www.openldap.org/pub/hyc/mdb-paper.pdf

Motivation - problems dealing with BDB

BDB slapd backend always required careful, complex tuning

Data comes through 3 separate layers of caches
Each layer has different size and speed traits
Balancing the 3 layers against each other can be a difficult juggling act
Performance without the backend caches is unacceptably slow - over an order of magnitude

Backend caching significantly increased the overall complexity of the backend code

Two levels of locking required, since BDB database locks are too slow
Deadlocks occurring routinely in normal operation, requiring additional backoff/retry logic

The caches were not always beneficial, and were sometimes detrimental

Data could exist in 3 places at once - filesystem, DB, and backend cache - wasting memory
Searches with result sets that exceeded the configured cache size would reduce the cache effectiveness to zero
malloc/free churn from adding and removing entries in the cache could trigger pathological heap fragmentation in lib malloc

Obvious Solutions

Cache management is a hassle, so don't do any caching
- The filesystem already caches data; there's no reason to duplicate the effort
Lock management is a hassle, so don't do any locking
- Use Multi-Version Concurrency Control (MVCC)
- MVCC makes it possible to perform reads with no locking
BDB supports MVCC, but still requires complex caching and locking
To get the desired results, we need to abandon BDB
Surveying the landscape revealed no other DB libraries with the desired characteristics
Thus LMDB was created in 2011
- "Lightning Memory-Mapped Database"
- BDB is now deprecated in OpenLDAP

Design Approach

Based on the "Single-Level Store" concept[0]
- Not new, first implemented in Multics in 1964
- Access a database by mapping the entire DB into memory
- Data fetches are satisfied by direct reference to the memory map; there is no intermediate page or buffer cache
Single-Level Store
- Only viable if process address spaces are larger than the expected data volumes
  - For 32 bit processors, the practical limit on data size is under 2GB
    - A 32-bit address space means you have 2^32 possible addresses. This equates to 4GB. However, in many systems, the address space is split, with half (2GB) available for user space and the other half reserved for the system. Hence, the practical limit for user processes is 2GB.
  - For common 64 bit processors which only implement 48 bit address spaces, the limit is 47 bits or 128 terabytes
    - Not all 64-bit processors use the full 64 bits for addressing. If only 48 bits are used, you have 248248 possible addresses. This equals 256 terabytes. However, the negative address space (used for system-level tasks) would reduce this in half, giving you 128 terabytes for user processes.
  - The upper bound at 63 bits is 8 exabytes
    - If you use 63 bits for addressing (ignoring the sign bit), you have 2^63 possible addresses. This equates to 8 exabytes.
Uses a read-only memory map
- Protects the DB structure from corruption due to stray writes in memory
- Any attempts to write to the map will cause a SEGV, allowing immediate identification of software bugs
Can optionally use a read-write map
- Slight performance gain for fully in-memory data sets
  - When the dataset fits entirely in memory, using a read-write map can offer better performance because writes go directly to memory without the need for intermediate buffering.
- Should only be used on fully-debugged application code
  - Using a read-write map introduces a risk. If there's a software bug that writes to incorrect memory locations, it can corrupt the database.
Keith Bostic (BerkeleyDB author, personal email, 2008)
- "The most significant problem with building an mmap'd back-end is implementing write-ahead-logging (WAL). (You probably know this, but just in case: the way databases usually guarantee consistency is by ensuring that log records describing each change are written to disk before their transaction commits, and before the database page that was changed. In other words, log record X must hit disk before the database page containing the change described by log record X.)
- In Berkeley DB WAL is done by maintaining a relationship between the database pages and the log records. If a database page is being written to disk, there's a look-aside into the logging system to make sure the right log records have already been written. In a memory-mapped system, you would do this by locking modified pages into memory (mlock), and flushing them at specific times (msync), otherwise the VM might just push a database page with modifications to disk before its log record is written, and if you crash at that point it's all over but the screaming.
Implement MVCC using copy-on-write
- In-use data is never overwritten, modifications are performed by copying the data and modifying the copy
- Since updates never alter existing data, the DB structure can never be corrupted by incomplete modifications
  - Write-ahead transaction logs are unnecessary
- Readers always see a consistent snapshot of the DB, they are fully isolated from writers
  - Read accesses require no locks
MVCC Details
- "Full" MVCC can be extremely resource intensive
  - DBs typically store complete histories reaching far back into time
  - The volume of data grows extremely fast, and grows without bound unless explicit pruning is done
  - Pruning the data using garbage collection or compaction requires more CPU and I/O resources than the normal update workload
    - Either the server must be heavily over-provisioned, or updates must be stopped while pruning is done
  - Pruning requires tracking of in-use status, which typically involves reference counters, which require locking
- LMDB nominally maintains only two versions of the DB
  - Rolling back to a historical version is not interesting for OpenLDAP
  - Older versions can be held open longer by reader transactions
- LMDB maintains a free list tracking the IDs of unused pages
  - Old pages are reused as soon as possible, so data volumes don't grow without bound
- LMDB tracks in-use status without locks

Internals

Btree Operation
- Write-Ahead Logging
- Append-Only
- Copy-on-Write, LMDB-style
Free Space Management
- Avoiding Compaction/Garbage Collection
Transaction Handling
- Avoiding Locking

Btree Operation

How Append-Only/Copy-On-Write Works

Updates are always performed bottom up
Every branch node from the leaf to the root must be copied/modified for any leaf update
Any node not on the path from the leaf to the root is unaltered
The root node is always written last

In the Append-Only tree, new pages are always appended sequentially to the DB file

While there's significant overhead for making complete copies of modified pages, the actual I/O is linear and relatively fast
The root node is always the last page of the file, unless there was a crash
Any root node can be found by seeking backward from the end of the file, and checking the page's header
Recovery from a crash is relatively easy
- Everything from the last valid root to the beginning of the file is always pristine
- Anything between the end of the file and the last valid root is discard

Append-Only disk usage is very inefficient

Disk space usage grows without bound
99+% of the space will be occupied by old versions of the data
The old versions are usually not interesting
Reclaiming the old space requires a very expensive compaction phase
New updates must be throttled until compaction completes

The LMDB Approach

Still Copy-on-Write, but using two fixed root nodes
- Page 0 and Page 1 of the file, used in double-buffer fashion
- Even faster cold-start than Append-Only, no searching needed to find the last valid root node
- Any app always reads both pages and uses the one with the greater Transaction ID stamp in its header Consequently, only 2 outstanding versions of the DB exist, not fully "multi-version"

Free Space Management

LMDB maintains two B+trees per root node

One storing the user data, as illustrated above
One storing lists of IDs of pages that have been freed in a given transaction
Old, freed pages are used in preference to new pages, so the DB file size remains relatively static over time
No compaction or garbage collection phase is ever needed
Caveat: If a read transaction is open on a particular version of the DB, that version and every version after it are excluded from page reclaiming.
Thus, long-lived read transactions should be avoided, otherwise the DB file size may grow rapidly, devolving into Append-Only behavior until the transactions are closed

Transaction Handling

LMDB supports a single writer concurrent with many readers
- A single mutex serializes all write transactions
- The mutex is shared/multiprocess
Readers run lockless and never block
- But for page reclamation purposes, readers are tracked
Transactions are stamped with an ID which is a monotonically increasing integer
The ID is only incremented for Write transactions that actually modify data
If a Write transaction is aborted, or committed with no changes, the same ID will be reused for the next Write transaction
Transactions take a snapshot of the currently valid meta page at the beginning of the transaction
No matter what write transactions follow, a read transaction's snapshot will always point to a valid version of the DB
The snapshot is totally isolated from subsequent writes
This provides the Consistency and Isolation in ACID semantics
The currently valid meta page is chosen based on the greatest transaction ID in each meta page
- The meta pages are page and CPU cache aligned
- The transaction ID is a single machine word
- The update of the transaction ID is atomic
- Thus, the Atomicity semantics of transactions are guaranteed
During Commit, the data pages are written and then synchronously flushed before the meta page is updated
- Then the meta page is written synchronously
- Thus, when a commit returns "success", it is guaranteed that the transaction has been written intact
- This provides the Durability semantics
- If the system crashes before the meta page is updated, then the data updates are irrelevant
For tracking purposes, Readers must acquire a slot in the readers table
- The readers table is also in a shared memory map, but separate from the main data map[1]
- This is a simple array recording the Process ID, Thread ID, and Transaction ID of the reader
- The array elements are CPU cache aligned, so each reader and access its own slot without interfering with any other reader[2]
- The first time a thread opens a read transaction, it must acquire a mutex to reserve a slot in the table
- The slot ID is stored in Thread Local Storage; subsequent read transactions performed by the thread need no further locking
Write transactions use pages from the free list before allocating new disk pages
- Pages in the free list are used in order, oldest transaction first
- The readers table must be scanned to see if any reader is referencing an old transaction
- The writer doesn't need to lock the reader table when performing this scan - readers never block writers
  - The only consequence of scanning with no locks is that the writer may see stale data
  - This is irrelevant, newer readers are of no concern; only the oldest readers matter
Dark Underside:
- Complications that textbook B+tree descriptions never mention
  - Textbook examples assume uniform record sizes
    - In real life, records are variable size
  - Textbook examples use parent and sibling pointers in each page
    - Supports fast iteration, range lookups
    - With Copy-On-Write, only child pointers are allowed. Pages no longer have unique parent or sibling
- Page Splitting
  - Where do we split? "It depends"
    - For pages with "small" number of records, we tally up the sizes of each node up to the insertion point and split wherever things fit
    - For pages with more records, we assume they are all small enough that just taking the N/2 slot will work
- Page Pointers
  - Naive code mallocs pointer structs to point to pages and their parents
    - Messy, requires extensive malloc/free activity while navigating up/down tree
    - Refcounting overhead, NULL checks, etc...
  - LMDB uses a cursor struct with a fixed-size[3] stack to record position in tree
    - No malloc/free needed for navigation
- Cursor Tracking
  - When multiple cursors are open in the same transaction on the same DB, changes made by one may affect all the others
    - Obvious example - Deleting a record that others point to
      - BerkeleyDB sets a tombstone
        
        Other cursors need no special help, but you must schedule periodic compactions to reclaim space
      - SQLite3 invalidates all open cursors
        
        Other cursors must seek to their old position again on next use
      - LMDB updates all open cursors
        
        Other cursors are always immediately usable
- Freespace DB
  - Using an additional B+tree to track the free pages in the B+tree presents special challenges
    - The act of recording a free page in the tree may consume a free page
    - Updating the freespace DB requires careful thought

Special Features

Reserve Mode

Allocates space in write buffer for data of user-specified size, returns address
Useful for data that is generated dynamically instead of statically copied
Allows generated data to be written directly to DB, avoiding unnecessary memcpy

Fixed Mapping

Uses a fixed address for the memory map
Allows complex pointer-based data structures to be stored directly with minimal serialization
Objects using persistent addresses can thus be read back and used directly, with no deserialization

Sub-Databases

Store multiple independent named B+trees in a single LMDB environment
A Sub-DB is simply a key/data pair in the main DB, where the data item is the root node of another tree
Allows many related databases to be managed easily
- Transactions may span all of the Sub-DBs
- Used in back-mob for the main data and all of the indices
- Used in SLightning for multiple tables and indices

Sorted Duplicates

Allows multiple data values for a single key
Values are stored in sorted order, with customizable comparison functions
When the data values are all of a fixed size, the values are stored contiguously, with no extra headers
- maximizes storage efficiency and performance
Implemented by the same code as SubDB support
- maximum coding efficiency
Can be used to efficiently implement inverted indices

Atomic Hot Backup

The entire database can be backed up live
No need to stop updates while backups run
The backup runs at the maximum speed of the target storage medium
Essentially: write(outfd, map, mapsize);
- No memcpy's in or out of user space
- Pure DMA (Direct Memory Access) from the database to the backup

Benchmarks at https://www.symas.com/symas-lmdb-tech-info

Conclusions

The combination of memory-mapped operation with MVCC is extremely potent

Reduced administrative overhead
- no periodic cleanup / maintenance required
- no particular tuning required
Reduced developer overhead
- code size and complexity drastically reduced
Enhanced efficiency
- minimal CPU and I/O use
  - allows for longer battery life on mobile devices
  - allows for lower electricity/cooling costs in data centers
  - allows more work to be done with less hardware

Futures

Work in progress

Incremental backup
- Add a txnID stamp to the page header
- Copy all pages with txnID newer than N
Write-Ahead Log
- Despite misgivings from BDB experience
- Prototype shows 30x improvement on synchronous write speed
- fsync[4] on a small log file is much faster than on large DB file

Notes

[0] Single-level store explanation

From Howard Chu's 2013 MDB Paper:

One fundamental concept behind the MDB approach is known as "Single-Level
Store". The basic idea is to treat all of computer memory as a single address
space. Pages of storage may reside in primary storage (RAM) or in secondary
storage (disk) but the actual location is unimportant to the application. If
a referenced page is currently in primary storage the application can use it
immediately, if not a page fault occurs and the operating system brings the
page into primary storage. The concept was introduced in 1964 in the Multics
operating system but was generally abandoned by the early 1990s as data
volumes surpassed the capacity of 32 bit address spaces. (We last knew of it
in the Apollo DOMAIN operating system, though many other Multics-influenced
designs carried it on.) With the ubiquity of 64 bit processors today this
concept can again be put to good use. (Given a virtual address space limit of
63 bits that puts the upper bound of database size at 8 exabytes. Commonly
available processors today only implement 48 bit address spaces, limiting us
to 47 bits or 128 terabytes.)

Another operating system requirement for this approach to be viable is a
Unified BufferCache. While most POSIX-based operating systems have supported
an mmap() system call for many years, their initial implementations kept
memory managed by the VM subsystem separate from memory managed by the
filesystem cache. This was not only wasteful (again, keeping data cached in
two places at once) but also led to coherency problems - data modified
through a memory map was not visible using filesystem read() calls, or data
modified through a filesystem write() was not visible in the memory map. Most
modern operating systems now have filesystem and VM paging unified, so this
should not be a concern in most deployments.

A deep dive of "LMDB Database File Sizes and Memory Utilization" that answers the following questions:

Why does an LMDB database file grow to sizes that are sometimes substantially larger than expected?

Why doesn’t the database file shrink when entries are deleted from the database?

Why does my LDAP server’s VM usage skyrocket when an LMDB database is in use?

Why does apparent memory usage of an LMDB process go so high?

Why do LMDB database files in multi-master or replica servers often have different sizes if they are supposed to be identical copies of each other?

[1] Shared memory map versus the main data map

In the context of LMDB and similar systems, the terms "shared memory map" and "main data map" refer to different components of the database's memory-mapped architecture. Let's clarify the distinctions:

Main Data Map:
- This refers to the memory-mapped representation of the actual database content.
- When LMDB loads, it memory-maps the database file, meaning it creates a direct byte-for-byte association between a region of memory and the file's content. This allows the database to be accessed and manipulated as if it were an in-memory data structure, even though the data is stored on disk.
- Read and write operations on the database translate to accesses on this memory-mapped region.
Shared Memory Map:
- This refers to a separate memory-mapped region used for meta-information and coordination between processes or threads accessing the database.
- In LMDB, the readers table (which tracks active readers) is situated in the shared memory map. This table is essential for LMDB's MVCC (Multi-Version Concurrency Control) mechanism, ensuring that readers see a consistent snapshot of the database.
- The shared memory map is crucial for synchronization and coordination, especially in scenarios where multiple processes or threads want to access the database concurrently.
- As the name suggests, this memory region is "shared" across processes, enabling inter-process communication and coordination.

In essence, while the "main data map" is concerned with the actual data and its organization within the database, the "shared memory map" deals with meta-information, synchronization, and coordination mechanisms. Both are fundamental components of LMDB's architecture, but they serve distinct roles.

[2] CPU Cache Lines
A CPU cache line is the unit of data transfer between the main memory and the CPU cache. It's essentially a block of memory that gets loaded into the CPU cache whenever a particular address is accessed. The rationale behind this is spatial locality: if one memory location is accessed, nearby memory locations are likely to be accessed soon after.

Details about CPU Cache Lines:

Size: The size of a cache line varies by CPU architecture but is typically between 32 bytes to 256 bytes, with 64 bytes being a common size for many modern processors.
Locality Principle: Due to the principle of spatial locality, when a specific byte is fetched, the entire cache line (e.g., 64 bytes) containing that byte is loaded into the cache. This ensures that subsequent accesses to nearby memory locations can be served directly from the cache rather than fetching from the slower main memory.
Cache Line Bouncing: This occurs in multi-core systems where multiple cores are trying to write to the same cache line. Since each core has its cache, the cache line can "bounce" between caches, causing performance degradation.

LMDB and CPU Cache Alignment:

To optimize for CPU cache performance, LMDB designs its transaction slot data structure to be cache-aligned. This means:

Size of Slot: The size of each slot in the readers table is designed to match the size of a CPU cache line. For instance, if the cache line is 64 bytes, the slot size would be 64 bytes.
Avoiding Cache Contention: By ensuring each slot matches the cache line size, LMDB ensures that when one reader accesses its slot, it doesn't inadvertently pull in data associated with another reader. This avoids cache contention and ensures that one reader's activity doesn't cause cache misses for another reader.
Direct Mapping: Cache alignment also minimizes the possibility of two different slots mapping to the same cache line, a scenario that could lead to cache line bouncing in multi-core systems.
Struct Layout: To achieve cache alignment, LMDB would define its struct for the readers table in a way that the combined size of the Process ID, Thread ID, and Transaction ID, along with any other necessary metadata, would sum up to the size of a cache line.

By keeping the transaction slot data structure cache-aligned, LMDB can efficiently utilize the CPU cache, minimize contention, and enhance the performance of concurrent read transactions.

[3] LMDB cursor stack limit

Aside - why does QuickSort has a max stack size of 50? The size of the stack you need is log2 of the elements you have. So with max 50 the qsort algorithm is good for 2^25 elements.

In LMDB, cursor stack limits to 32 elements. For any particular tree where we have at minimum of 2 leaf nodes per interior page, we can accomodate 4B records.

[4] When does LMDB fsync

In LMDB, fsync() (or its equivalent depending on the platform) is used to ensure that all modified data and metadata are flushed from the operating system's buffers to the disk, guaranteeing durability. LMDB calls fsync() in the following scenarios:

Committing Transactions: When a write transaction is committed, LMDB ensures that all changes are flushed to disk. This is done by calling fsync() on the main database file.
Environment Open: When the LMDB environment is opened, fsync() is called to ensure any pending writes from previous sessions are committed to disk.
Environment Close: When closing the LMDB environment, fsync() is invoked to make sure all data is safely written to disk.

Optimizations by the OS:

When LMDB issues an fsync on the entire database file, the OS doesn't necessarily write every single page of the file to disk. Instead, the OS writes only the "dirty" pages (those that have been modified since the last write to disk) to the storage medium. This is a significant optimization because it minimizes the amount of actual I/O performed.
Modern file systems and operating systems often employ additional optimizations to group or order disk writes in a way that minimizes disk seek times and maximizes throughput.

Previous: Building Systems with the ChatGPT API