Skip to content

Faults

Fault injection scheduling for testing system resilience.

Fault injection framework for declarative fault scheduling.

Provides fault types for nodes, networks, and resources, plus a FaultSchedule entity that generates activation/deactivation events during simulation bootstrap.

Fault

Bases: Protocol

Protocol that all fault types implement.

generate_events

generate_events(ctx: FaultContext) -> list[Event]

Generate activation/deactivation events for this fault.

Parameters:

Name Type Description Default
ctx FaultContext

Resolution context with entity/network/resource lookups.

required

Returns:

Type Description
list[Event]

Events to schedule for fault activation and deactivation.

FaultContext dataclass

FaultContext(
    entities: dict[str, Entity],
    networks: dict[str, Network],
    resources: dict[str, Resource],
    start_time: Instant,
)

Resolution context passed to faults during event generation.

Built by FaultSchedule.start() from the simulation's registered entities, networks, and resources.

Attributes:

Name Type Description
entities dict[str, Entity]

Name-to-Entity lookup (all registered entities).

networks dict[str, Network]

Name-to-Network lookup.

resources dict[str, Resource]

Name-to-Resource lookup.

start_time Instant

Simulation start time.

FaultHandle

FaultHandle(fault: Fault)

Handle returned by FaultSchedule.add() for manual cancellation.

Cancelling a handle marks all its pending fault events as cancelled so they are skipped by the simulation loop.

Attributes:

Name Type Description
fault

The fault this handle controls.

cancelled property

cancelled: bool

Whether this fault has been cancelled.

cancel

cancel() -> None

Cancel all pending events for this fault.

FaultStats dataclass

FaultStats(
    faults_scheduled: int,
    faults_activated: int,
    faults_deactivated: int,
    faults_cancelled: int,
)

Summary of fault injection activity.

Attributes:

Name Type Description
faults_scheduled int

Number of faults added to the schedule.

faults_activated int

Number of fault activations that fired.

faults_deactivated int

Number of fault deactivations that fired.

faults_cancelled int

Number of faults cancelled before activation.

InjectLatency dataclass

InjectLatency(
    source_name: str,
    dest_name: str,
    extra_ms: float,
    start: float,
    end: float,
    network_name: str | None = None,
)

Add extra latency to a network link for a time window.

At start, replaces the link's latency with a compound distribution that adds extra_ms milliseconds. At end, restores the original.

Attributes:

Name Type Description
source_name str

Source entity name for the link.

dest_name str

Destination entity name for the link.

extra_ms float

Extra latency to add in milliseconds.

start float

Fault activation time in seconds.

end float

Fault deactivation time in seconds.

network_name str | None

Network to target. None = use first found.

InjectPacketLoss dataclass

InjectPacketLoss(
    source_name: str,
    dest_name: str,
    loss_rate: float,
    start: float,
    end: float,
    network_name: str | None = None,
)

Inject additional packet loss on a link for a time window.

At start, increases the link's packet_loss_rate. At end, restores the original rate.

Attributes:

Name Type Description
source_name str

Source entity name for the link.

dest_name str

Destination entity name for the link.

loss_rate float

Additional loss rate to add [0, 1].

start float

Fault activation time in seconds.

end float

Fault deactivation time in seconds.

network_name str | None

Network to target. None = use first found.

NetworkPartition dataclass

NetworkPartition(
    group_a: list[str],
    group_b: list[str],
    start: float,
    end: float,
    asymmetric: bool = False,
    network_name: str | None = None,
)

Create a network partition between two groups for a time window.

At start, calls network.partition() to block traffic between groups. At end, heals the partition.

Attributes:

Name Type Description
group_a list[str]

Entity names for group A.

group_b list[str]

Entity names for group B.

start float

Partition start time in seconds.

end float

Partition end time in seconds.

asymmetric bool

If True, only block A -> B traffic.

network_name str | None

Network to target. None = use first found.

RandomPartition dataclass

RandomPartition(
    nodes: list[str],
    mtbf: float,
    mttr: float,
    seed: int | None = None,
    network_name: str | None = None,
)

Jepsen-style random partition injection (recurring).

Schedules fault/heal cycles using exponentially distributed intervals. Each cycle randomly splits nodes into two groups, creates a partition, then heals after a random repair time.

The self-scheduling chain (like Source's self-perpetuation) uses Event.once() callbacks that schedule the next event.

Attributes:

Name Type Description
nodes list[str]

Entity names that can be partitioned.

mtbf float

Mean time between failures in seconds.

mttr float

Mean time to repair in seconds.

seed int | None

Random seed for reproducibility.

network_name str | None

Network to target. None = use first found.

CrashNode dataclass

CrashNode(
    entity_name: str,
    at: float,
    restart_at: float | None = None,
)

Crash a node at a specific time, optionally restart later.

Sets entity._crashed = True at crash time, causing all events targeting the entity to be silently dropped. If restart_at is provided, clears the flag at that time.

Attributes:

Name Type Description
entity_name str

Name of the entity to crash.

at float

Crash time in seconds.

restart_at float | None

Optional restart time in seconds. None = permanent crash.

PauseNode dataclass

PauseNode(entity_name: str, start: float, end: float)

Pause a node (freeze processing) for a time window, then resume.

Semantically identical to CrashNode but uses start/end naming to emphasize the temporary nature of the fault.

Attributes:

Name Type Description
entity_name str

Name of the entity to pause.

start float

Pause start time in seconds.

end float

Resume time in seconds.

ReduceCapacity dataclass

ReduceCapacity(
    resource_name: str,
    factor: float,
    start: float,
    end: float,
)

Temporarily reduce a resource's capacity.

At start, multiplies the resource's capacity by factor (e.g., 0.5 = halve). At end, restores the original capacity.

Attributes:

Name Type Description
resource_name str

Name of the resource to degrade.

factor float

Capacity multiplier (0 < factor < 1 to reduce).

start float

Fault activation time in seconds.

end float

Fault deactivation time in seconds.

FaultSchedule

FaultSchedule(name: str = 'FaultSchedule')

Bases: Entity

Orchestrates fault injection during simulation.

Collects faults via add() and generates their events during start(), which is called by the Simulation during initialization.

Example::

schedule = FaultSchedule()
schedule.add(CrashNode("server", at=30.0, restart_at=45.0))
sim = Simulation(sources=[...], entities=[...], fault_schedule=schedule)

Parameters:

Name Type Description Default
name str

Identifier for logging. Defaults to "FaultSchedule".

'FaultSchedule'

stats property

stats: FaultStats

Frozen snapshot of fault injection statistics.

add

add(fault: Fault) -> FaultHandle

Register a fault for injection.

Parameters:

Name Type Description Default
fault Fault

The fault to schedule.

required

Returns:

Type Description
FaultHandle

A handle that can be used to cancel the fault before activation.

start

start(start_time: Instant, sim: Simulation) -> list[Event]

Generate fault events by resolving entity/network/resource references.

Called by Simulation.__init__() during bootstrap.

Parameters:

Name Type Description Default
start_time Instant

The simulation's start time.

required
sim Simulation

The simulation instance (used to resolve names).

required

Returns:

Type Description
list[Event]

All fault events to push onto the heap.

handle_event

handle_event(event: Event) -> None

FaultSchedule does not process events itself.