Storage¶
Storage engine internals: LSM trees, B-trees, SSTables, WAL, memtables, compaction, and transactions.
Storage engine internals for database simulation.
Provides components modeling the internal data structures of storage engines: WAL, Memtable, SSTable, LSM Tree, B-Tree, and Transaction Manager. These enable simulation of compaction storms, read/write amplification tradeoffs, and transaction contention.
BTree ¶
BTree(
name: str,
*,
order: int = 128,
disk: Any | None = None,
page_read_latency: float = 0.001,
page_write_latency: float = 0.002,
)
Bases: Entity
B-tree index with page-level I/O cost simulation.
Each tree traversal costs depth * page_read_latency in simulated I/O time. Writes additionally pay page_write_latency per modified page. Node splits create extra write overhead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Entity name. |
required |
order
|
int
|
Maximum number of children per internal node. Leaf nodes hold at most (order - 1) keys. |
128
|
disk
|
Any | None
|
Optional Resource for disk I/O contention. |
None
|
page_read_latency
|
float
|
Seconds per page read. |
0.001
|
page_write_latency
|
float
|
Seconds per page write. |
0.002
|
Example::
btree = BTree("index", order=128)
sim = Simulation(entities=[btree], ...)
get ¶
Look up a key, yielding page read latency for each tree level.
put ¶
Insert or update a key-value pair, yielding I/O latency.
Traverses from root to leaf (page reads), then writes the leaf page. May split nodes, adding extra page writes.
delete ¶
Delete a key, yielding I/O latency.
Returns True if the key was found and deleted, False otherwise. Uses lazy deletion (mark as None) for simplicity.
scan ¶
Range scan, yielding I/O latency.
Traverses to the start leaf, then scans sequentially.
BTreeStats
dataclass
¶
BTreeStats(
reads: int = 0,
writes: int = 0,
page_reads: int = 0,
page_writes: int = 0,
node_splits: int = 0,
tree_depth: int = 0,
total_keys: int = 0,
)
Frozen snapshot of B-tree statistics.
Attributes:
| Name | Type | Description |
|---|---|---|
reads |
int
|
Total get operations. |
writes |
int
|
Total put/delete operations. |
page_reads |
int
|
Total disk page reads. |
page_writes |
int
|
Total disk page writes. |
node_splits |
int
|
Total node split operations. |
tree_depth |
int
|
Current depth of the tree. |
total_keys |
int
|
Total number of keys stored. |
CompactionStrategy ¶
Bases: ABC
Strategy for deciding when and what to compact.
should_compact
abstractmethod
¶
Return True if compaction should be triggered.
FIFOCompaction ¶
Bases: CompactionStrategy
Drop oldest SSTables when total count exceeds threshold.
Simple TTL-like compaction for time-series data where old data can be discarded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_total_sstables
|
int
|
Maximum total SSTables across all levels. |
100
|
LeveledCompaction ¶
Bases: CompactionStrategy
Level-based compaction with size ratio between levels.
L0 compacts when it has too many SSTables. Other levels compact when total keys exceed level_size_limit = base_size * ratio^level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level_0_max
|
int
|
Max SSTables in L0 before compaction. |
4
|
size_ratio
|
int
|
Size multiplier between adjacent levels. |
10
|
base_size_keys
|
int
|
Target key count for L1. |
1000
|
LSMTree ¶
LSMTree(
name: str,
*,
memtable_size: int = 1000,
compaction_strategy: CompactionStrategy | None = None,
wal: WriteAheadLog | None = None,
disk: Any | None = None,
sstable_read_latency: float = 0.001,
sstable_write_latency: float = 0.002,
max_levels: int = 7,
)
Bases: Entity
Log-Structured Merge Tree storage engine.
Composes WAL, Memtable, and SSTable levels. Write path appends to WAL then memtable, flushing to L0 SSTables when full. Reads check memtable first, then each level using bloom filters to skip irrelevant SSTables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Entity name. |
required |
memtable_size
|
int
|
Max entries in the memtable before flush. |
1000
|
compaction_strategy
|
CompactionStrategy | None
|
Strategy for compaction triggers and selection. |
None
|
wal
|
WriteAheadLog | None
|
Optional WriteAheadLog for durability. Created internally if None. |
None
|
disk
|
Any | None
|
Optional Resource for disk I/O contention. |
None
|
sstable_read_latency
|
float
|
Seconds per SSTable page read. |
0.001
|
sstable_write_latency
|
float
|
Seconds per SSTable page write. |
0.002
|
max_levels
|
int
|
Maximum number of levels. |
7
|
Example::
lsm = LSMTree("db", memtable_size=1000,
compaction_strategy=LeveledCompaction())
sim = Simulation(entities=[lsm], ...)
level_summary
property
¶
Summary of each level: count, total keys, total bytes.
put ¶
Write a key-value pair through WAL -> memtable -> (flush).
Yields I/O latency for WAL and potential flush operations.
get ¶
Read a key, checking memtable then SSTable levels.
Uses bloom filters to skip SSTables that don't contain the key. Yields I/O latency for SSTable reads.
scan ¶
Range scan across memtable and all SSTable levels.
Returns merged, deduplicated results with most recent values. Yields I/O latency for SSTable reads.
crash ¶
Simulate power loss: lose memtable and unsynced WAL entries.
SSTables survive (already flushed to disk). The memtable and any immutable memtables awaiting flush are lost (volatile memory). WAL entries beyond the last sync point are also lost.
Returns:
| Type | Description |
|---|---|
dict
|
Summary with counts of lost entries. |
recover_from_crash ¶
Recover durable state after a crash.
Replays surviving WAL entries into a fresh memtable. SSTables are already intact on disk and need no recovery.
Returns:
| Type | Description |
|---|---|
dict
|
Summary with counts of recovered data. |
handle_event ¶
Handle CompactionTrigger events.
LSMTreeStats
dataclass
¶
LSMTreeStats(
writes: int = 0,
reads: int = 0,
read_hits: int = 0,
read_misses: int = 0,
wal_writes: int = 0,
memtable_flushes: int = 0,
compactions: int = 0,
total_sstables: int = 0,
levels: int = 0,
read_amplification: float = 0.0,
write_amplification: float = 0.0,
space_amplification: float = 0.0,
bloom_filter_saves: int = 0,
)
Frozen snapshot of LSM tree statistics.
Attributes:
| Name | Type | Description |
|---|---|---|
writes |
int
|
Total put/delete operations. |
reads |
int
|
Total get operations. |
read_hits |
int
|
Gets that found a value. |
read_misses |
int
|
Gets that returned None. |
wal_writes |
int
|
Total WAL append operations. |
memtable_flushes |
int
|
Number of memtable flushes. |
compactions |
int
|
Number of compaction operations. |
total_sstables |
int
|
Current SSTable count across all levels. |
levels |
int
|
Number of occupied levels. |
read_amplification |
float
|
Average SSTables checked per read. |
write_amplification |
float
|
Total SSTable bytes / user write bytes. |
space_amplification |
float
|
Total stored bytes / logical data bytes. |
bloom_filter_saves |
int
|
Reads avoided by bloom filter. |
SizeTieredCompaction ¶
Bases: CompactionStrategy
Compact when a level accumulates enough similarly-sized SSTables.
Triggers when any level has >= min_sstables. Merges all SSTables in the most populated level into the next level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_sstables
|
int
|
Minimum SSTables in a level to trigger compaction. |
4
|
Memtable ¶
Memtable(
name: str,
*,
size_threshold: int = 1000,
write_latency: float = 1e-05,
read_latency: float = 5e-06,
rwlock: Any | None = None,
)
Bases: Entity
In-memory sorted write buffer that flushes to SSTables.
Writes are accumulated in memory until the size threshold is reached. On flush, the contents are frozen into an immutable SSTable and the memtable is cleared.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Entity name. |
required |
size_threshold
|
int
|
Max number of entries before the memtable is full. |
1000
|
write_latency
|
float
|
Seconds per put operation. |
1e-05
|
read_latency
|
float
|
Seconds per get operation. |
5e-06
|
rwlock
|
Any | None
|
Optional RWLock for concurrent read/write modeling. |
None
|
Example::
memtable = Memtable("mem", size_threshold=1000)
is_full = yield from memtable.put("key1", "value1")
if is_full:
sstable = memtable.flush()
put ¶
Write a key-value pair, yielding write latency.
Returns True if the memtable is now full and should be flushed.
put_sync ¶
Write without yielding latency. For testing or fast paths.
Returns True if the memtable is now full.
get ¶
Look up a key, yielding read latency.
Returns the value if found, None otherwise.
flush ¶
Freeze contents into an SSTable and clear the memtable.
Returns the new SSTable containing all current entries.
MemtableStats
dataclass
¶
MemtableStats(
writes: int = 0,
reads: int = 0,
hits: int = 0,
misses: int = 0,
flushes: int = 0,
current_size: int = 0,
total_bytes_written: int = 0,
)
Frozen snapshot of Memtable statistics.
Attributes:
| Name | Type | Description |
|---|---|---|
writes |
int
|
Total put operations. |
reads |
int
|
Total get operations. |
hits |
int
|
Get operations that found the key. |
misses |
int
|
Get operations that did not find the key. |
flushes |
int
|
Number of times the memtable was flushed to SSTable. |
current_size |
int
|
Current number of entries. |
total_bytes_written |
int
|
Estimated total bytes written. |
SSTable ¶
SSTable(
data: list[tuple[str, Any]],
*,
index_interval: int = 16,
bloom_fp_rate: float = 0.01,
level: int = 0,
sequence: int = 0,
)
Immutable sorted-string table with bloom filter and sparse index.
Stores a sorted collection of key-value pairs. Once created, data cannot be modified. Provides efficient point lookups via bloom filter + sparse index, and range scans via sorted iteration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[tuple[str, Any]]
|
List of (key, value) tuples. Will be sorted by key. |
required |
index_interval
|
int
|
Build one index entry per this many keys. Smaller values use more memory but enable faster lookups. |
16
|
bloom_fp_rate
|
float
|
Target false positive rate for the bloom filter. |
0.01
|
level
|
int
|
LSM level this SSTable belongs to (0 = most recent). |
0
|
sequence
|
int
|
Sequence number for ordering within a level. |
0
|
Example::
sstable = SSTable([("a", 1), ("b", 2), ("c", 3)])
assert sstable.get("b") == 2
assert sstable.contains("a")
assert not sstable.contains("z")
contains ¶
Check if this SSTable might contain the key (bloom filter).
Returns False only if the key is definitely not present. May return True for keys not actually in the table (false positive).
get ¶
Look up a key using sparse index + binary search.
Returns the value if found, None otherwise.
scan ¶
Return all key-value pairs in [start_key, end_key) range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_key
|
str | None
|
Inclusive lower bound. None means from the beginning. |
None
|
end_key
|
str | None
|
Exclusive upper bound. None means to the end. |
None
|
Returns:
| Type | Description |
|---|---|
list[tuple[str, Any]]
|
Sorted list of (key, value) tuples in the range. |
page_reads_for_get ¶
Estimate page reads needed for a point lookup.
Returns 0 if the bloom filter says the key is absent. Otherwise returns 1 (index page) + 1 (data page) = 2.
page_reads_for_scan ¶
Estimate page reads for a range scan.
Returns the number of data pages that would be read, based on index_interval as page size.
overlaps ¶
Check if this SSTable's key range overlaps with another's.
SSTableStats
dataclass
¶
SSTableStats(
key_count: int = 0,
size_bytes: int = 0,
index_entries: int = 0,
bloom_filter_fp_rate: float = 0.0,
bloom_filter_size_bits: int = 0,
)
Frozen snapshot of SSTable statistics.
Attributes:
| Name | Type | Description |
|---|---|---|
key_count |
int
|
Number of key-value pairs stored. |
size_bytes |
int
|
Estimated size in bytes. |
index_entries |
int
|
Number of sparse index entries. |
bloom_filter_fp_rate |
float
|
Current bloom filter false positive rate. |
bloom_filter_size_bits |
int
|
Size of bloom filter bit array. |
IsolationLevel ¶
Bases: Enum
Transaction isolation level.
StorageEngine ¶
Bases: Protocol
Structural typing for any key-value store.
StorageTransaction ¶
StorageTransaction(
tx_id: int,
manager: TransactionManager,
isolation: IsolationLevel,
snapshot_version: int,
)
A single transaction against a storage engine.
Buffers reads and writes locally. On commit, checks for conflicts based on the isolation level and applies buffered writes if no conflicts are detected.
Not an Entity — created and managed by TransactionManager.
read ¶
Read a key, tracking it in the read set.
For SNAPSHOT_ISOLATION and SERIALIZABLE, reads use the snapshot. For READ_COMMITTED, reads see the latest committed value.
write ¶
Buffer a write in the local write set.
The write is not applied to the store until commit.
commit ¶
Attempt to commit the transaction.
Checks for conflicts based on isolation level. If no conflicts, applies buffered writes to the store.
Returns True if committed successfully, False if aborted due to conflict.
TransactionManager ¶
TransactionManager(
name: str,
store: StorageEngine,
isolation: IsolationLevel = IsolationLevel.SNAPSHOT_ISOLATION,
deadlock_detection: bool = True,
)
Bases: Entity
Transaction manager with configurable isolation levels.
Wraps a storage engine and provides transactional semantics with conflict detection. Each transaction sees a consistent snapshot (for SI/Serializable) and buffers writes until commit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Entity name. |
required |
store
|
StorageEngine
|
The underlying storage engine (LSMTree, BTree, etc.). |
required |
isolation
|
IsolationLevel
|
Default isolation level for new transactions. |
SNAPSHOT_ISOLATION
|
deadlock_detection
|
bool
|
Whether to detect deadlocks (placeholder). |
True
|
Example::
lsm = LSMTree("db")
tm = TransactionManager("txm", store=lsm,
isolation=IsolationLevel.SNAPSHOT_ISOLATION)
sim = Simulation(entities=[lsm, tm], ...)
begin ¶
Begin a new transaction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
isolation
|
IsolationLevel | None
|
Override the default isolation level for this transaction. |
None
|
Returns the new StorageTransaction after yielding a small latency.
begin_sync ¶
Begin a transaction without yielding latency.
handle_event ¶
TransactionManager does not process events directly.
TransactionStats
dataclass
¶
TransactionStats(
transactions_started: int = 0,
transactions_committed: int = 0,
transactions_aborted: int = 0,
conflicts_detected: int = 0,
deadlocks_detected: int = 0,
reads: int = 0,
writes: int = 0,
avg_transaction_duration_s: float = 0.0,
)
Frozen snapshot of transaction manager statistics.
Attributes:
| Name | Type | Description |
|---|---|---|
transactions_started |
int
|
Total transactions begun. |
transactions_committed |
int
|
Total successful commits. |
transactions_aborted |
int
|
Total aborted transactions. |
conflicts_detected |
int
|
Total conflict detections (commit failures). |
deadlocks_detected |
int
|
Total deadlock detections. |
reads |
int
|
Total read operations across all transactions. |
writes |
int
|
Total write operations across all transactions. |
avg_transaction_duration_s |
float
|
Average transaction duration. |
SyncEveryWrite ¶
SyncOnBatch ¶
Bases: SyncPolicy
Sync after a fixed number of writes accumulate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Number of writes before syncing. |
required |
SyncPeriodic ¶
Bases: SyncPolicy
Sync when a time interval has elapsed since the last sync.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
interval_s
|
float
|
Seconds between syncs. |
required |
SyncPolicy ¶
Bases: ABC
Strategy for when to fsync the WAL to disk.
should_sync
abstractmethod
¶
Return True if the WAL should be synced now.
WALEntry
dataclass
¶
A single entry in the write-ahead log.
Attributes:
| Name | Type | Description |
|---|---|---|
sequence_number |
int
|
Monotonically increasing ID for this entry. |
key |
str
|
The key being written. |
value |
Any
|
The value being written. |
timestamp_s |
float
|
Simulation time when the entry was written. |
WALStats
dataclass
¶
WALStats(
writes: int = 0,
bytes_written: int = 0,
syncs: int = 0,
total_sync_latency_s: float = 0.0,
entries_recovered: int = 0,
)
Frozen snapshot of WAL statistics.
Attributes:
| Name | Type | Description |
|---|---|---|
writes |
int
|
Total append operations. |
bytes_written |
int
|
Estimated bytes written. |
syncs |
int
|
Number of fsync operations performed. |
total_sync_latency_s |
float
|
Cumulative sync latency. |
entries_recovered |
int
|
Number of entries returned by last recover(). |
WriteAheadLog ¶
WriteAheadLog(
name: str,
*,
sync_policy: SyncPolicy | None = None,
disk: Any | None = None,
write_latency: float = 0.0001,
sync_latency: float = 0.001,
)
Bases: Entity
Append-only durability log with pluggable sync policies.
Records key-value writes for crash recovery. The sync policy controls when data is flushed to persistent storage. Optional disk Resource models I/O contention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Entity name. |
required |
sync_policy
|
SyncPolicy | None
|
When to fsync. Defaults to SyncEveryWrite. |
None
|
disk
|
Any | None
|
Optional Resource for disk I/O contention modeling. |
None
|
write_latency
|
float
|
Seconds per append operation. |
0.0001
|
sync_latency
|
float
|
Seconds per fsync operation. |
0.001
|
Example::
wal = WriteAheadLog("wal", sync_policy=SyncOnBatch(batch_size=10))
sim = Simulation(entities=[wal], ...)
# In another entity's handle_event:
seq = yield from wal.append("key1", "value1")
synced_up_to
property
¶
Sequence number of the last entry synced to durable storage.
append ¶
Append a key-value pair to the log, yielding I/O latency.
Returns the sequence number assigned to this entry.
append_sync ¶
Append without yielding I/O latency. For testing or fast paths.
Returns the sequence number assigned to this entry.
recover ¶
Return all entries in the log for crash recovery.
Returns entries in sequence order.
truncate ¶
Remove entries with sequence_number <= up_to_sequence.
Called after a checkpoint to reclaim space.
crash ¶
Simulate power loss: discard entries not yet synced to disk.
Entries with sequence_number > synced_up_to are lost — they were in volatile memory only. Returns the number of entries lost.