Faults¶
Fault injection scheduling for testing system resilience.
Fault injection framework for declarative fault scheduling.
Provides fault types for nodes, networks, and resources, plus a
FaultSchedule entity that generates activation/deactivation events
during simulation bootstrap.
Fault ¶
Bases: Protocol
Protocol that all fault types implement.
generate_events ¶
Generate activation/deactivation events for this fault.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
FaultContext
|
Resolution context with entity/network/resource lookups. |
required |
Returns:
| Type | Description |
|---|---|
list[Event]
|
Events to schedule for fault activation and deactivation. |
FaultContext
dataclass
¶
FaultContext(
entities: dict[str, Entity],
networks: dict[str, Network],
resources: dict[str, Resource],
start_time: Instant,
)
Resolution context passed to faults during event generation.
Built by FaultSchedule.start() from the simulation's registered
entities, networks, and resources.
Attributes:
| Name | Type | Description |
|---|---|---|
entities |
dict[str, Entity]
|
Name-to-Entity lookup (all registered entities). |
networks |
dict[str, Network]
|
Name-to-Network lookup. |
resources |
dict[str, Resource]
|
Name-to-Resource lookup. |
start_time |
Instant
|
Simulation start time. |
FaultHandle ¶
Handle returned by FaultSchedule.add() for manual cancellation.
Cancelling a handle marks all its pending fault events as cancelled so they are skipped by the simulation loop.
Attributes:
| Name | Type | Description |
|---|---|---|
fault |
The fault this handle controls. |
FaultStats
dataclass
¶
FaultStats(
faults_scheduled: int,
faults_activated: int,
faults_deactivated: int,
faults_cancelled: int,
)
Summary of fault injection activity.
Attributes:
| Name | Type | Description |
|---|---|---|
faults_scheduled |
int
|
Number of faults added to the schedule. |
faults_activated |
int
|
Number of fault activations that fired. |
faults_deactivated |
int
|
Number of fault deactivations that fired. |
faults_cancelled |
int
|
Number of faults cancelled before activation. |
InjectLatency
dataclass
¶
InjectLatency(
source_name: str,
dest_name: str,
extra_ms: float,
start: float,
end: float,
network_name: str | None = None,
)
Add extra latency to a network link for a time window.
At start, replaces the link's latency with a compound distribution
that adds extra_ms milliseconds. At end, restores the original.
Attributes:
| Name | Type | Description |
|---|---|---|
source_name |
str
|
Source entity name for the link. |
dest_name |
str
|
Destination entity name for the link. |
extra_ms |
float
|
Extra latency to add in milliseconds. |
start |
float
|
Fault activation time in seconds. |
end |
float
|
Fault deactivation time in seconds. |
network_name |
str | None
|
Network to target. None = use first found. |
InjectPacketLoss
dataclass
¶
InjectPacketLoss(
source_name: str,
dest_name: str,
loss_rate: float,
start: float,
end: float,
network_name: str | None = None,
)
Inject additional packet loss on a link for a time window.
At start, increases the link's packet_loss_rate. At end,
restores the original rate.
Attributes:
| Name | Type | Description |
|---|---|---|
source_name |
str
|
Source entity name for the link. |
dest_name |
str
|
Destination entity name for the link. |
loss_rate |
float
|
Additional loss rate to add [0, 1]. |
start |
float
|
Fault activation time in seconds. |
end |
float
|
Fault deactivation time in seconds. |
network_name |
str | None
|
Network to target. None = use first found. |
NetworkPartition
dataclass
¶
NetworkPartition(
group_a: list[str],
group_b: list[str],
start: float,
end: float,
asymmetric: bool = False,
network_name: str | None = None,
)
Create a network partition between two groups for a time window.
At start, calls network.partition() to block traffic between
groups. At end, heals the partition.
Attributes:
| Name | Type | Description |
|---|---|---|
group_a |
list[str]
|
Entity names for group A. |
group_b |
list[str]
|
Entity names for group B. |
start |
float
|
Partition start time in seconds. |
end |
float
|
Partition end time in seconds. |
asymmetric |
bool
|
If True, only block A -> B traffic. |
network_name |
str | None
|
Network to target. None = use first found. |
RandomPartition
dataclass
¶
RandomPartition(
nodes: list[str],
mtbf: float,
mttr: float,
seed: int | None = None,
network_name: str | None = None,
)
Jepsen-style random partition injection (recurring).
Schedules fault/heal cycles using exponentially distributed intervals. Each cycle randomly splits nodes into two groups, creates a partition, then heals after a random repair time.
The self-scheduling chain (like Source's self-perpetuation) uses
Event.once() callbacks that schedule the next event.
Attributes:
| Name | Type | Description |
|---|---|---|
nodes |
list[str]
|
Entity names that can be partitioned. |
mtbf |
float
|
Mean time between failures in seconds. |
mttr |
float
|
Mean time to repair in seconds. |
seed |
int | None
|
Random seed for reproducibility. |
network_name |
str | None
|
Network to target. None = use first found. |
CrashNode
dataclass
¶
Crash a node at a specific time, optionally restart later.
Sets entity._crashed = True at crash time, causing all events
targeting the entity to be silently dropped. If restart_at is
provided, clears the flag at that time.
Attributes:
| Name | Type | Description |
|---|---|---|
entity_name |
str
|
Name of the entity to crash. |
at |
float
|
Crash time in seconds. |
restart_at |
float | None
|
Optional restart time in seconds. None = permanent crash. |
PauseNode
dataclass
¶
Pause a node (freeze processing) for a time window, then resume.
Semantically identical to CrashNode but uses start/end naming
to emphasize the temporary nature of the fault.
Attributes:
| Name | Type | Description |
|---|---|---|
entity_name |
str
|
Name of the entity to pause. |
start |
float
|
Pause start time in seconds. |
end |
float
|
Resume time in seconds. |
ReduceCapacity
dataclass
¶
Temporarily reduce a resource's capacity.
At start, multiplies the resource's capacity by factor
(e.g., 0.5 = halve). At end, restores the original capacity.
Attributes:
| Name | Type | Description |
|---|---|---|
resource_name |
str
|
Name of the resource to degrade. |
factor |
float
|
Capacity multiplier (0 < factor < 1 to reduce). |
start |
float
|
Fault activation time in seconds. |
end |
float
|
Fault deactivation time in seconds. |
FaultSchedule ¶
Bases: Entity
Orchestrates fault injection during simulation.
Collects faults via add() and generates their events during
start(), which is called by the Simulation during initialization.
Example::
schedule = FaultSchedule()
schedule.add(CrashNode("server", at=30.0, restart_at=45.0))
sim = Simulation(sources=[...], entities=[...], fault_schedule=schedule)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Identifier for logging. Defaults to |
'FaultSchedule'
|
add ¶
Register a fault for injection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fault
|
Fault
|
The fault to schedule. |
required |
Returns:
| Type | Description |
|---|---|
FaultHandle
|
A handle that can be used to cancel the fault before activation. |
start ¶
Generate fault events by resolving entity/network/resource references.
Called by Simulation.__init__() during bootstrap.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_time
|
Instant
|
The simulation's start time. |
required |
sim
|
Simulation
|
The simulation instance (used to resolve names). |
required |
Returns:
| Type | Description |
|---|---|
list[Event]
|
All fault events to push onto the heap. |