Thomas Wang

From journeyman to master.


How LMDB works

A gentle introduction to LMDB's design and internals

This is largely taking notes from the talk: Howard Chu - LMDB [The Databaseology Lectures - CMU Fall 2015]. I also dug deeper into some of the areas.

Motivation - problems dealing with BDB

BDB slapd backend always required careful, complex tuning

Backend caching significantly increased the overall complexity of the backend code

The caches were not always beneficial, and were sometimes detrimental

Obvious Solutions

Design Approach

Internals

Btree Operation

How Append-Only/Copy-On-Write Works

In the Append-Only tree, new pages are always appended sequentially to the DB file

Append-Only disk usage is very inefficient

The LMDB Approach

Free Space Management

LMDB maintains two B+trees per root node

Transaction Handling

Special Features

Reserve Mode

Fixed Mapping

Sub-Databases

Sorted Duplicates

Atomic Hot Backup

Benchmarks at https://www.symas.com/symas-lmdb-tech-info

Conclusions

The combination of memory-mapped operation with MVCC is extremely potent

Futures

Work in progress

Notes

[0] Single-level store explanation

From Howard Chu's 2013 MDB Paper:

One fundamental concept behind the MDB approach is known as "Single-Level
Store". The basic idea is to treat all of computer memory as a single address
space. Pages of storage may reside in primary storage (RAM) or in secondary
storage (disk) but the actual location is unimportant to the application. If
a referenced page is currently in primary storage the application can use it
immediately, if not a page fault occurs and the operating system brings the
page into primary storage. The concept was introduced in 1964 in the Multics
operating system but was generally abandoned by the early 1990s as data
volumes surpassed the capacity of 32 bit address spaces. (We last knew of it
in the Apollo DOMAIN operating system, though many other Multics-influenced
designs carried it on.) With the ubiquity of 64 bit processors today this
concept can again be put to good use. (Given a virtual address space limit of
63 bits that puts the upper bound of database size at 8 exabytes. Commonly
available processors today only implement 48 bit address spaces, limiting us
to 47 bits or 128 terabytes.)

Another operating system requirement for this approach to be viable is a
Unified BufferCache. While most POSIX-based operating systems have supported
an mmap() system call for many years, their initial implementations kept
memory managed by the VM subsystem separate from memory managed by the
filesystem cache. This was not only wasteful (again, keeping data cached in
two places at once) but also led to coherency problems - data modified
through a memory map was not visible using filesystem read() calls, or data
modified through a filesystem write() was not visible in the memory map. Most
modern operating systems now have filesystem and VM paging unified, so this
should not be a concern in most deployments.

A deep dive of "LMDB Database File Sizes and Memory Utilization" that answers the following questions:

  • Why does an LMDB database file grow to sizes that are sometimes substantially larger than expected?
  • Why doesn’t the database file shrink when entries are deleted from the database?
  • Why does my LDAP server’s VM usage skyrocket when an LMDB database is in use?
  • Why does apparent memory usage of an LMDB process go so high?
  • Why do LMDB database files in multi-master or replica servers often have different sizes if they are supposed to be identical copies of each other?

[1] Shared memory map versus the main data map

In the context of LMDB and similar systems, the terms "shared memory map" and "main data map" refer to different components of the database's memory-mapped architecture. Let's clarify the distinctions:

  1. Main Data Map:
    • This refers to the memory-mapped representation of the actual database content.
    • When LMDB loads, it memory-maps the database file, meaning it creates a direct byte-for-byte association between a region of memory and the file's content. This allows the database to be accessed and manipulated as if it were an in-memory data structure, even though the data is stored on disk.
    • Read and write operations on the database translate to accesses on this memory-mapped region.
  2. Shared Memory Map:
    • This refers to a separate memory-mapped region used for meta-information and coordination between processes or threads accessing the database.
    • In LMDB, the readers table (which tracks active readers) is situated in the shared memory map. This table is essential for LMDB's MVCC (Multi-Version Concurrency Control) mechanism, ensuring that readers see a consistent snapshot of the database.
    • The shared memory map is crucial for synchronization and coordination, especially in scenarios where multiple processes or threads want to access the database concurrently.
    • As the name suggests, this memory region is "shared" across processes, enabling inter-process communication and coordination.

In essence, while the "main data map" is concerned with the actual data and its organization within the database, the "shared memory map" deals with meta-information, synchronization, and coordination mechanisms. Both are fundamental components of LMDB's architecture, but they serve distinct roles.

[2] CPU Cache Lines
A CPU cache line is the unit of data transfer between the main memory and the CPU cache. It's essentially a block of memory that gets loaded into the CPU cache whenever a particular address is accessed. The rationale behind this is spatial locality: if one memory location is accessed, nearby memory locations are likely to be accessed soon after.

Details about CPU Cache Lines:

  1. Size: The size of a cache line varies by CPU architecture but is typically between 32 bytes to 256 bytes, with 64 bytes being a common size for many modern processors.
  2. Locality Principle: Due to the principle of spatial locality, when a specific byte is fetched, the entire cache line (e.g., 64 bytes) containing that byte is loaded into the cache. This ensures that subsequent accesses to nearby memory locations can be served directly from the cache rather than fetching from the slower main memory.
  3. Cache Line Bouncing: This occurs in multi-core systems where multiple cores are trying to write to the same cache line. Since each core has its cache, the cache line can "bounce" between caches, causing performance degradation.

LMDB and CPU Cache Alignment:

To optimize for CPU cache performance, LMDB designs its transaction slot data structure to be cache-aligned. This means:

  1. Size of Slot: The size of each slot in the readers table is designed to match the size of a CPU cache line. For instance, if the cache line is 64 bytes, the slot size would be 64 bytes.
  2. Avoiding Cache Contention: By ensuring each slot matches the cache line size, LMDB ensures that when one reader accesses its slot, it doesn't inadvertently pull in data associated with another reader. This avoids cache contention and ensures that one reader's activity doesn't cause cache misses for another reader.
  3. Direct Mapping: Cache alignment also minimizes the possibility of two different slots mapping to the same cache line, a scenario that could lead to cache line bouncing in multi-core systems.
  4. Struct Layout: To achieve cache alignment, LMDB would define its struct for the readers table in a way that the combined size of the Process ID, Thread ID, and Transaction ID, along with any other necessary metadata, would sum up to the size of a cache line.

By keeping the transaction slot data structure cache-aligned, LMDB can efficiently utilize the CPU cache, minimize contention, and enhance the performance of concurrent read transactions.

[3] LMDB cursor stack limit

Aside - why does QuickSort has a max stack size of 50? The size of the stack you need is log2 of the elements you have. So with max 50 the qsort algorithm is good for 2^25 elements.

In LMDB, cursor stack limits to 32 elements. For any particular tree where we have at minimum of 2 leaf nodes per interior page, we can accomodate 4B records.

[4] When does LMDB fsync

In LMDB, fsync() (or its equivalent depending on the platform) is used to ensure that all modified data and metadata are flushed from the operating system's buffers to the disk, guaranteeing durability. LMDB calls fsync() in the following scenarios:

  1. Committing Transactions: When a write transaction is committed, LMDB ensures that all changes are flushed to disk. This is done by calling fsync() on the main database file.
  2. Environment Open: When the LMDB environment is opened, fsync() is called to ensure any pending writes from previous sessions are committed to disk.
  3. Environment Close: When closing the LMDB environment, fsync() is invoked to make sure all data is safely written to disk.

Optimizations by the OS: