Skip to content

Storage

Storage engine internals: LSM trees, B-trees, SSTables, WAL, memtables, compaction, and transactions.

Storage engine internals for database simulation.

Provides components modeling the internal data structures of storage engines: WAL, Memtable, SSTable, LSM Tree, B-Tree, and Transaction Manager. These enable simulation of compaction storms, read/write amplification tradeoffs, and transaction contention.

BTree

BTree(
    name: str,
    *,
    order: int = 128,
    disk: Any | None = None,
    page_read_latency: float = 0.001,
    page_write_latency: float = 0.002,
)

Bases: Entity

B-tree index with page-level I/O cost simulation.

Each tree traversal costs depth * page_read_latency in simulated I/O time. Writes additionally pay page_write_latency per modified page. Node splits create extra write overhead.

Parameters:

Name Type Description Default
name str

Entity name.

required
order int

Maximum number of children per internal node. Leaf nodes hold at most (order - 1) keys.

128
disk Any | None

Optional Resource for disk I/O contention.

None
page_read_latency float

Seconds per page read.

0.001
page_write_latency float

Seconds per page write.

0.002

Example::

btree = BTree("index", order=128)
sim = Simulation(entities=[btree], ...)

depth property

depth: int

Current depth of the B-tree.

size property

size: int

Total number of keys in the tree.

stats property

stats: BTreeStats

Frozen snapshot of B-tree statistics.

get

get(key: str) -> Generator[float, None, Any | None]

Look up a key, yielding page read latency for each tree level.

get_sync

get_sync(key: str) -> Any | None

Look up without yielding latency.

put

put(key: str, value: Any) -> Generator[float]

Insert or update a key-value pair, yielding I/O latency.

Traverses from root to leaf (page reads), then writes the leaf page. May split nodes, adding extra page writes.

put_sync

put_sync(key: str, value: Any) -> None

Insert without yielding latency.

delete

delete(key: str) -> Generator[float, None, bool]

Delete a key, yielding I/O latency.

Returns True if the key was found and deleted, False otherwise. Uses lazy deletion (mark as None) for simplicity.

scan

scan(
    start_key: str, end_key: str
) -> Generator[float, None, list[tuple[str, Any]]]

Range scan, yielding I/O latency.

Traverses to the start leaf, then scans sequentially.

handle_event

handle_event(event: Event) -> None

BTree does not process events directly.

BTreeStats dataclass

BTreeStats(
    reads: int = 0,
    writes: int = 0,
    page_reads: int = 0,
    page_writes: int = 0,
    node_splits: int = 0,
    tree_depth: int = 0,
    total_keys: int = 0,
)

Frozen snapshot of B-tree statistics.

Attributes:

Name Type Description
reads int

Total get operations.

writes int

Total put/delete operations.

page_reads int

Total disk page reads.

page_writes int

Total disk page writes.

node_splits int

Total node split operations.

tree_depth int

Current depth of the tree.

total_keys int

Total number of keys stored.

CompactionStrategy

Bases: ABC

Strategy for deciding when and what to compact.

should_compact abstractmethod

should_compact(levels: list[list[SSTable]]) -> bool

Return True if compaction should be triggered.

select_compaction abstractmethod

select_compaction(
    levels: list[list[SSTable]],
) -> tuple[int, list[SSTable]]

Select which level and SSTables to compact.

Returns:

Type Description
tuple[int, list[SSTable]]

(source_level, sstables_to_compact) tuple.

FIFOCompaction

FIFOCompaction(max_total_sstables: int = 100)

Bases: CompactionStrategy

Drop oldest SSTables when total count exceeds threshold.

Simple TTL-like compaction for time-series data where old data can be discarded.

Parameters:

Name Type Description Default
max_total_sstables int

Maximum total SSTables across all levels.

100

LeveledCompaction

LeveledCompaction(
    level_0_max: int = 4,
    size_ratio: int = 10,
    base_size_keys: int = 1000,
)

Bases: CompactionStrategy

Level-based compaction with size ratio between levels.

L0 compacts when it has too many SSTables. Other levels compact when total keys exceed level_size_limit = base_size * ratio^level.

Parameters:

Name Type Description Default
level_0_max int

Max SSTables in L0 before compaction.

4
size_ratio int

Size multiplier between adjacent levels.

10
base_size_keys int

Target key count for L1.

1000

LSMTree

LSMTree(
    name: str,
    *,
    memtable_size: int = 1000,
    compaction_strategy: CompactionStrategy | None = None,
    wal: WriteAheadLog | None = None,
    disk: Any | None = None,
    sstable_read_latency: float = 0.001,
    sstable_write_latency: float = 0.002,
    max_levels: int = 7,
)

Bases: Entity

Log-Structured Merge Tree storage engine.

Composes WAL, Memtable, and SSTable levels. Write path appends to WAL then memtable, flushing to L0 SSTables when full. Reads check memtable first, then each level using bloom filters to skip irrelevant SSTables.

Parameters:

Name Type Description Default
name str

Entity name.

required
memtable_size int

Max entries in the memtable before flush.

1000
compaction_strategy CompactionStrategy | None

Strategy for compaction triggers and selection.

None
wal WriteAheadLog | None

Optional WriteAheadLog for durability. Created internally if None.

None
disk Any | None

Optional Resource for disk I/O contention.

None
sstable_read_latency float

Seconds per SSTable page read.

0.001
sstable_write_latency float

Seconds per SSTable page write.

0.002
max_levels int

Maximum number of levels.

7

Example::

lsm = LSMTree("db", memtable_size=1000,
               compaction_strategy=LeveledCompaction())
sim = Simulation(entities=[lsm], ...)

stats property

stats: LSMTreeStats

Frozen snapshot of LSM tree statistics.

level_summary property

level_summary: list[dict]

Summary of each level: count, total keys, total bytes.

set_clock

set_clock(clock) -> None

Inject clock into this entity and internal components.

put

put(key: str, value: Any) -> Generator[float]

Write a key-value pair through WAL -> memtable -> (flush).

Yields I/O latency for WAL and potential flush operations.

put_sync

put_sync(key: str, value: Any) -> None

Write without yielding latency.

get

get(key: str) -> Generator[float, None, Any | None]

Read a key, checking memtable then SSTable levels.

Uses bloom filters to skip SSTables that don't contain the key. Yields I/O latency for SSTable reads.

get_sync

get_sync(key: str) -> Any | None

Read without yielding latency.

delete

delete(key: str) -> Generator[float]

Delete a key by writing a tombstone marker.

scan

scan(
    start_key: str, end_key: str
) -> Generator[float, None, list[tuple[str, Any]]]

Range scan across memtable and all SSTable levels.

Returns merged, deduplicated results with most recent values. Yields I/O latency for SSTable reads.

crash

crash() -> dict

Simulate power loss: lose memtable and unsynced WAL entries.

SSTables survive (already flushed to disk). The memtable and any immutable memtables awaiting flush are lost (volatile memory). WAL entries beyond the last sync point are also lost.

Returns:

Type Description
dict

Summary with counts of lost entries.

recover_from_crash

recover_from_crash() -> dict

Recover durable state after a crash.

Replays surviving WAL entries into a fresh memtable. SSTables are already intact on disk and need no recovery.

Returns:

Type Description
dict

Summary with counts of recovered data.

handle_event

handle_event(event: Event) -> Generator[float] | None

Handle CompactionTrigger events.

LSMTreeStats dataclass

LSMTreeStats(
    writes: int = 0,
    reads: int = 0,
    read_hits: int = 0,
    read_misses: int = 0,
    wal_writes: int = 0,
    memtable_flushes: int = 0,
    compactions: int = 0,
    total_sstables: int = 0,
    levels: int = 0,
    read_amplification: float = 0.0,
    write_amplification: float = 0.0,
    space_amplification: float = 0.0,
    bloom_filter_saves: int = 0,
)

Frozen snapshot of LSM tree statistics.

Attributes:

Name Type Description
writes int

Total put/delete operations.

reads int

Total get operations.

read_hits int

Gets that found a value.

read_misses int

Gets that returned None.

wal_writes int

Total WAL append operations.

memtable_flushes int

Number of memtable flushes.

compactions int

Number of compaction operations.

total_sstables int

Current SSTable count across all levels.

levels int

Number of occupied levels.

read_amplification float

Average SSTables checked per read.

write_amplification float

Total SSTable bytes / user write bytes.

space_amplification float

Total stored bytes / logical data bytes.

bloom_filter_saves int

Reads avoided by bloom filter.

SizeTieredCompaction

SizeTieredCompaction(min_sstables: int = 4)

Bases: CompactionStrategy

Compact when a level accumulates enough similarly-sized SSTables.

Triggers when any level has >= min_sstables. Merges all SSTables in the most populated level into the next level.

Parameters:

Name Type Description Default
min_sstables int

Minimum SSTables in a level to trigger compaction.

4

Memtable

Memtable(
    name: str,
    *,
    size_threshold: int = 1000,
    write_latency: float = 1e-05,
    read_latency: float = 5e-06,
    rwlock: Any | None = None,
)

Bases: Entity

In-memory sorted write buffer that flushes to SSTables.

Writes are accumulated in memory until the size threshold is reached. On flush, the contents are frozen into an immutable SSTable and the memtable is cleared.

Parameters:

Name Type Description Default
name str

Entity name.

required
size_threshold int

Max number of entries before the memtable is full.

1000
write_latency float

Seconds per put operation.

1e-05
read_latency float

Seconds per get operation.

5e-06
rwlock Any | None

Optional RWLock for concurrent read/write modeling.

None

Example::

memtable = Memtable("mem", size_threshold=1000)
is_full = yield from memtable.put("key1", "value1")
if is_full:
    sstable = memtable.flush()

is_full property

is_full: bool

Whether the memtable has reached its size threshold.

size property

size: int

Current number of entries in the memtable.

stats property

stats: MemtableStats

Frozen snapshot of memtable statistics.

put

put(key: str, value: Any) -> Generator[float, None, bool]

Write a key-value pair, yielding write latency.

Returns True if the memtable is now full and should be flushed.

put_sync

put_sync(key: str, value: Any) -> bool

Write without yielding latency. For testing or fast paths.

Returns True if the memtable is now full.

get

get(key: str) -> Generator[float, None, Any | None]

Look up a key, yielding read latency.

Returns the value if found, None otherwise.

get_sync

get_sync(key: str) -> Any | None

Look up without yielding latency.

contains

contains(key: str) -> bool

Check if a key is in the memtable (no I/O cost).

flush

flush() -> SSTable

Freeze contents into an SSTable and clear the memtable.

Returns the new SSTable containing all current entries.

handle_event

handle_event(event: Event) -> None

Memtable does not process events directly.

MemtableStats dataclass

MemtableStats(
    writes: int = 0,
    reads: int = 0,
    hits: int = 0,
    misses: int = 0,
    flushes: int = 0,
    current_size: int = 0,
    total_bytes_written: int = 0,
)

Frozen snapshot of Memtable statistics.

Attributes:

Name Type Description
writes int

Total put operations.

reads int

Total get operations.

hits int

Get operations that found the key.

misses int

Get operations that did not find the key.

flushes int

Number of times the memtable was flushed to SSTable.

current_size int

Current number of entries.

total_bytes_written int

Estimated total bytes written.

SSTable

SSTable(
    data: list[tuple[str, Any]],
    *,
    index_interval: int = 16,
    bloom_fp_rate: float = 0.01,
    level: int = 0,
    sequence: int = 0,
)

Immutable sorted-string table with bloom filter and sparse index.

Stores a sorted collection of key-value pairs. Once created, data cannot be modified. Provides efficient point lookups via bloom filter + sparse index, and range scans via sorted iteration.

Parameters:

Name Type Description Default
data list[tuple[str, Any]]

List of (key, value) tuples. Will be sorted by key.

required
index_interval int

Build one index entry per this many keys. Smaller values use more memory but enable faster lookups.

16
bloom_fp_rate float

Target false positive rate for the bloom filter.

0.01
level int

LSM level this SSTable belongs to (0 = most recent).

0
sequence int

Sequence number for ordering within a level.

0

Example::

sstable = SSTable([("a", 1), ("b", 2), ("c", 3)])
assert sstable.get("b") == 2
assert sstable.contains("a")
assert not sstable.contains("z")

key_count property

key_count: int

Number of key-value pairs in this SSTable.

size_bytes property

size_bytes: int

Estimated size in bytes.

level property

level: int

LSM level this SSTable belongs to.

sequence property

sequence: int

Sequence number for ordering within a level.

min_key property

min_key: str | None

Smallest key in this SSTable, or None if empty.

max_key property

max_key: str | None

Largest key in this SSTable, or None if empty.

bloom_filter property

bloom_filter: BloomFilter

The bloom filter for this SSTable.

stats property

stats: SSTableStats

Frozen snapshot of SSTable statistics.

contains

contains(key: str) -> bool

Check if this SSTable might contain the key (bloom filter).

Returns False only if the key is definitely not present. May return True for keys not actually in the table (false positive).

get

get(key: str) -> Any | None

Look up a key using sparse index + binary search.

Returns the value if found, None otherwise.

scan

scan(
    start_key: str | None = None, end_key: str | None = None
) -> list[tuple[str, Any]]

Return all key-value pairs in [start_key, end_key) range.

Parameters:

Name Type Description Default
start_key str | None

Inclusive lower bound. None means from the beginning.

None
end_key str | None

Exclusive upper bound. None means to the end.

None

Returns:

Type Description
list[tuple[str, Any]]

Sorted list of (key, value) tuples in the range.

page_reads_for_get

page_reads_for_get(key: str) -> int

Estimate page reads needed for a point lookup.

Returns 0 if the bloom filter says the key is absent. Otherwise returns 1 (index page) + 1 (data page) = 2.

page_reads_for_scan

page_reads_for_scan(
    start_key: str | None = None, end_key: str | None = None
) -> int

Estimate page reads for a range scan.

Returns the number of data pages that would be read, based on index_interval as page size.

overlaps

overlaps(other: SSTable) -> bool

Check if this SSTable's key range overlaps with another's.

SSTableStats dataclass

SSTableStats(
    key_count: int = 0,
    size_bytes: int = 0,
    index_entries: int = 0,
    bloom_filter_fp_rate: float = 0.0,
    bloom_filter_size_bits: int = 0,
)

Frozen snapshot of SSTable statistics.

Attributes:

Name Type Description
key_count int

Number of key-value pairs stored.

size_bytes int

Estimated size in bytes.

index_entries int

Number of sparse index entries.

bloom_filter_fp_rate float

Current bloom filter false positive rate.

bloom_filter_size_bits int

Size of bloom filter bit array.

IsolationLevel

Bases: Enum

Transaction isolation level.

StorageEngine

Bases: Protocol

Structural typing for any key-value store.

StorageTransaction

StorageTransaction(
    tx_id: int,
    manager: TransactionManager,
    isolation: IsolationLevel,
    snapshot_version: int,
)

A single transaction against a storage engine.

Buffers reads and writes locally. On commit, checks for conflicts based on the isolation level and applies buffered writes if no conflicts are detected.

Not an Entity — created and managed by TransactionManager.

tx_id property

tx_id: int

Transaction identifier.

is_active property

is_active: bool

Whether this transaction is still active.

read

read(key: str) -> Generator[float, None, Any | None]

Read a key, tracking it in the read set.

For SNAPSHOT_ISOLATION and SERIALIZABLE, reads use the snapshot. For READ_COMMITTED, reads see the latest committed value.

write

write(key: str, value: Any) -> Generator[float]

Buffer a write in the local write set.

The write is not applied to the store until commit.

commit

commit() -> Generator[float, None, bool]

Attempt to commit the transaction.

Checks for conflicts based on isolation level. If no conflicts, applies buffered writes to the store.

Returns True if committed successfully, False if aborted due to conflict.

abort

abort() -> None

Abort this transaction, discarding all buffered writes.

TransactionManager

TransactionManager(
    name: str,
    store: StorageEngine,
    isolation: IsolationLevel = IsolationLevel.SNAPSHOT_ISOLATION,
    deadlock_detection: bool = True,
)

Bases: Entity

Transaction manager with configurable isolation levels.

Wraps a storage engine and provides transactional semantics with conflict detection. Each transaction sees a consistent snapshot (for SI/Serializable) and buffers writes until commit.

Parameters:

Name Type Description Default
name str

Entity name.

required
store StorageEngine

The underlying storage engine (LSMTree, BTree, etc.).

required
isolation IsolationLevel

Default isolation level for new transactions.

SNAPSHOT_ISOLATION
deadlock_detection bool

Whether to detect deadlocks (placeholder).

True

Example::

lsm = LSMTree("db")
tm = TransactionManager("txm", store=lsm,
                        isolation=IsolationLevel.SNAPSHOT_ISOLATION)
sim = Simulation(entities=[lsm, tm], ...)

stats property

stats: TransactionStats

Frozen snapshot of transaction manager statistics.

active_transactions property

active_transactions: int

Number of currently active transactions.

begin

begin(
    isolation: IsolationLevel | None = None,
) -> Generator[float, None, StorageTransaction]

Begin a new transaction.

Parameters:

Name Type Description Default
isolation IsolationLevel | None

Override the default isolation level for this transaction.

None

Returns the new StorageTransaction after yielding a small latency.

begin_sync

begin_sync(
    isolation: IsolationLevel | None = None,
) -> StorageTransaction

Begin a transaction without yielding latency.

handle_event

handle_event(event: Event) -> None

TransactionManager does not process events directly.

TransactionStats dataclass

TransactionStats(
    transactions_started: int = 0,
    transactions_committed: int = 0,
    transactions_aborted: int = 0,
    conflicts_detected: int = 0,
    deadlocks_detected: int = 0,
    reads: int = 0,
    writes: int = 0,
    avg_transaction_duration_s: float = 0.0,
)

Frozen snapshot of transaction manager statistics.

Attributes:

Name Type Description
transactions_started int

Total transactions begun.

transactions_committed int

Total successful commits.

transactions_aborted int

Total aborted transactions.

conflicts_detected int

Total conflict detections (commit failures).

deadlocks_detected int

Total deadlock detections.

reads int

Total read operations across all transactions.

writes int

Total write operations across all transactions.

avg_transaction_duration_s float

Average transaction duration.

SyncEveryWrite

Bases: SyncPolicy

Sync after every write — maximum durability.

SyncOnBatch

SyncOnBatch(batch_size: int)

Bases: SyncPolicy

Sync after a fixed number of writes accumulate.

Parameters:

Name Type Description Default
batch_size int

Number of writes before syncing.

required

SyncPeriodic

SyncPeriodic(interval_s: float)

Bases: SyncPolicy

Sync when a time interval has elapsed since the last sync.

Parameters:

Name Type Description Default
interval_s float

Seconds between syncs.

required

SyncPolicy

Bases: ABC

Strategy for when to fsync the WAL to disk.

should_sync abstractmethod

should_sync(
    writes_since_sync: int, time_since_sync_s: float
) -> bool

Return True if the WAL should be synced now.

WALEntry dataclass

WALEntry(
    sequence_number: int,
    key: str,
    value: Any,
    timestamp_s: float,
)

A single entry in the write-ahead log.

Attributes:

Name Type Description
sequence_number int

Monotonically increasing ID for this entry.

key str

The key being written.

value Any

The value being written.

timestamp_s float

Simulation time when the entry was written.

WALStats dataclass

WALStats(
    writes: int = 0,
    bytes_written: int = 0,
    syncs: int = 0,
    total_sync_latency_s: float = 0.0,
    entries_recovered: int = 0,
)

Frozen snapshot of WAL statistics.

Attributes:

Name Type Description
writes int

Total append operations.

bytes_written int

Estimated bytes written.

syncs int

Number of fsync operations performed.

total_sync_latency_s float

Cumulative sync latency.

entries_recovered int

Number of entries returned by last recover().

WriteAheadLog

WriteAheadLog(
    name: str,
    *,
    sync_policy: SyncPolicy | None = None,
    disk: Any | None = None,
    write_latency: float = 0.0001,
    sync_latency: float = 0.001,
)

Bases: Entity

Append-only durability log with pluggable sync policies.

Records key-value writes for crash recovery. The sync policy controls when data is flushed to persistent storage. Optional disk Resource models I/O contention.

Parameters:

Name Type Description Default
name str

Entity name.

required
sync_policy SyncPolicy | None

When to fsync. Defaults to SyncEveryWrite.

None
disk Any | None

Optional Resource for disk I/O contention modeling.

None
write_latency float

Seconds per append operation.

0.0001
sync_latency float

Seconds per fsync operation.

0.001

Example::

wal = WriteAheadLog("wal", sync_policy=SyncOnBatch(batch_size=10))
sim = Simulation(entities=[wal], ...)

# In another entity's handle_event:
seq = yield from wal.append("key1", "value1")

synced_up_to property

synced_up_to: int

Sequence number of the last entry synced to durable storage.

size property

size: int

Number of entries currently in the log.

stats property

stats: WALStats

Frozen snapshot of WAL statistics.

append

append(key: str, value: Any) -> Generator[float, None, int]

Append a key-value pair to the log, yielding I/O latency.

Returns the sequence number assigned to this entry.

append_sync

append_sync(key: str, value: Any) -> int

Append without yielding I/O latency. For testing or fast paths.

Returns the sequence number assigned to this entry.

recover

recover() -> list[WALEntry]

Return all entries in the log for crash recovery.

Returns entries in sequence order.

truncate

truncate(up_to_sequence: int) -> None

Remove entries with sequence_number <= up_to_sequence.

Called after a checkpoint to reclaim space.

crash

crash() -> int

Simulate power loss: discard entries not yet synced to disk.

Entries with sequence_number > synced_up_to are lost — they were in volatile memory only. Returns the number of entries lost.

handle_event

handle_event(event: Event) -> None

WAL does not process events directly.