26.4 Main-Memory Databases

Introduction

Main-memory databases (MMDBs) are database systems where the primary copy of the data resides in RAM. This contrasts with traditional databases, which primarily store data on disk and use RAM as a buffer. MMDBs are designed for applications requiring very high transaction throughput and low latency, which disk-based systems often cannot achieve.

Disk I/O Bottlenecks

The Problem with Disk Access

Traditional disk-based databases suffer from a significant performance bottleneck: disk I/O.

Latency: Disk access latency (the time it takes to retrieve data from a disk) is relatively high, typically around 10 milliseconds. This is orders of magnitude slower than accessing data in RAM.
Throughput Limitation: Latency directly impacts throughput (the number of transactions per second). Each disk I/O introduces a delay, limiting the overall transaction rate.

Why Disk I/O is a Bottleneck

Reads: When data is not in the buffer, a read from disk is necessary, incurring the latency penalty.
Transaction Commits: Before a transaction can commit, log records must be written to stable storage (usually disk). This write operation is another source of I/O delay.

Optimizations in Main-Memory Databases

MMDBs leverage the speed of RAM and employ specific design choices to minimize overhead and maximize performance.

Data Structures

Pointers: MMDBs can extensively use pointers. Unlike disk-based systems where traversing multiple disk pages is costly, RAM allows for efficient pointer chasing across memory locations.
B⁺-trees and other Structures
- Although there is difference between catch memory and main-memory speed. Data is transfer from main memory to cache memory in unit of cache-line(about 64 bytes).
- $B^{+}$ -trees are still valuable, especially when designed with small node sizes that fit within a cache line.
- Other specialized data structures optimized for in-memory access can also be used. These can be deeper than typical B⁺-trees due to no disk access.

No Buffer Pinning

Traditional Databases: In disk-based systems, buffer pages must be “pinned” in memory to prevent them from being swapped out before data is accessed. This adds overhead.
MMDBs: Since data is always in RAM, pinning is unnecessary. There’s no risk of a page fault requiring a disk access. This eliminates the overhead of pinning and unpinning buffer pages.

Query Processing

Space Optimization: Query processing techniques in MMDBs are designed to minimize space overhead. Excessive memory usage could lead to swapping to disk (which would defeat the purpose of an MMDB), slowing down query processing.
Avoid Paging: Strategies are employed to prevent the operating system from paging parts of the database to disk (swap space). This is crucial for maintaining high performance.

Locking/Latching

New Bottlenecks: With the disk I/O bottleneck reduced, other operations can become performance limitations. Locking and latching (mechanisms for concurrency control) are prime examples.
Optimized Implementations: MMDBs require highly optimized implementations of locking and latching to avoid becoming new bottlenecks. This might involve using specialized, lightweight synchronization primitives.

Recovery

Reduced Page Writes: In MMDBs, pages are rarely written out to make space for other pages (because all data fits in RAM). This significantly changes the recovery process.
Log-Based Recovery still needed: Although data resides primarily in main memory. Log records still need to be written in stable storage(usually disk).
Log Bottleneck: The process of writing log records to stable storage before a transaction commits can become a significant bottleneck, even in an MMDB.
Non-Volatile RAM (NVRAM): Using NVRAM (e.g., battery-backed RAM) for the log buffer can significantly reduce commit time. Writes to NVRAM are much faster than disk writes.
Stable storage: In MMDBs, Log records must be written in stable storage before transaction commits.

Group Commit

Addressing the Log Bottleneck

Group commit is a technique to reduce the overhead of writing log records to stable storage.

The Mechanism

Delayed Commit: Instead of committing a transaction immediately when it completes, the system waits.
Accumulation: The system accumulates multiple completed transactions into a “group.”
Batch Write: The log records for all transactions in the group are written to stable storage in a single (or fewer) I/O operation(s). This often results in writing nearly full blocks to the log.
Group size and maximum waiting time can ensure for writing blocks in stable storage.

Benefits

Reduced I/O Operations: Fewer I/O operations per committed transaction, significantly improving throughput.
Fuller Log Blocks: More efficient use of log disk space, as blocks are typically written when they are nearly full.

Trade-offs

Increased Commit Latency: Transactions experience a slight delay before they are fully committed (while waiting for the group to form). This delay is usually small (e.g., milliseconds) and acceptable for many applications.
Eliminate delays: This delays can be eliminated by supporting nonvolatile RAM buffers.
Group commit importance: Group commit also useful in database system with disk-resident.

Example

Imagine transactions $T_{1}$ , $T_{2}$ , and $T_{3}$ complete within a short time window. Instead of writing log records for each separately, group commit might:

Wait for a small time interval or until a certain number of transactions have completed.
Combine the log records for $T_{1}$ , $T_{2}$ , and $T_{3}$ into a single block.
Write that block to the log disk in one I/O operation.

Flash storage

If flash storage is used instead of magnetic disk for storing log record, the commit delay is significantly reduced.
Group commit can still be useful since it minimizes the number of pages written.
Flash storage system remap logical pages to a pre-erased physical page to avoiding delay at the time a page is written, but the erase operation must be performed eventually as part of garbage collection of old versions of pages.

Quartz 4

Explorer