Skip to content

Resilience

Circuit breakers, bulkheads, and retry policies for fault-tolerant systems.

Resilience patterns for fault tolerance.

This module provides components for building resilient systems that handle failures gracefully.

Components

CircuitBreaker: Fails fast when downstream is unhealthy Bulkhead: Isolates resources to prevent cascade failures TimeoutWrapper: Wraps services with timeout handling Fallback: Provides fallback behavior on failure Hedge: Sends redundant requests to reduce tail latency

Example

from happysimulator.components.resilience import ( CircuitBreaker, Bulkhead, TimeoutWrapper, Fallback, Hedge, )

Compose multiple resilience patterns

hedge = Hedge("hedge", backend, hedge_delay=0.050) timeout = TimeoutWrapper("timeout", hedge, timeout=5.0) breaker = CircuitBreaker("breaker", timeout, failure_threshold=5) bulkhead = Bulkhead("bulkhead", breaker, max_concurrent=10)

Bulkhead

Bulkhead(
    name: str,
    target: Entity,
    max_concurrent: int,
    max_wait_queue: int = 0,
    max_wait_time: float | None = None,
)

Bases: Entity

Isolates resources by limiting concurrent access.

The bulkhead limits the number of concurrent requests to a target service. Additional requests can optionally wait in a queue with a timeout. When both concurrent and queue limits are reached, requests are rejected immediately.

This prevents a slow or failing service from consuming all resources and affecting other parts of the system.

Attributes:

Name Type Description
name

Bulkhead identifier.

target Entity

The service being protected.

max_concurrent int

Maximum concurrent requests allowed.

Initialize the bulkhead.

Parameters:

Name Type Description Default
name str

Bulkhead identifier.

required
target Entity

The downstream entity to protect.

required
max_concurrent int

Maximum concurrent requests to target.

required
max_wait_queue int

Maximum requests that can wait in queue. 0 means no queuing (immediate reject when full).

0
max_wait_time float | None

Maximum time a request can wait in queue. None means no timeout (wait indefinitely).

None

Raises:

Type Description
ValueError

If parameters are invalid.

target property

target: Entity

The protected target entity.

max_concurrent property

max_concurrent: int

Maximum concurrent requests allowed.

max_wait_queue property

max_wait_queue: int

Maximum requests that can wait in queue.

max_wait_time property

max_wait_time: float | None

Maximum wait time in queue.

active_count property

active_count: int

Number of currently active requests.

queue_depth property

queue_depth: int

Number of requests waiting in queue.

available_permits property

available_permits: int

Number of available concurrent slots.

stats property

stats: BulkheadStats

Frozen snapshot of bulkhead statistics.

set_clock

set_clock(clock: Clock) -> None

Inject clock and propagate to target.

handle_event

handle_event(
    event: Event,
) -> (
    Generator[float, None, list[Event] | Event | None]
    | list[Event]
    | Event
    | None
)

Handle incoming events.

Routes requests through the bulkhead logic and handles responses from the target.

Parameters:

Name Type Description Default
event Event

The event to handle.

required

Returns:

Type Description
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None

Events to schedule, or None if rejected.

BulkheadStats dataclass

BulkheadStats(
    total_requests: int = 0,
    accepted_requests: int = 0,
    rejected_requests: int = 0,
    timed_out_requests: int = 0,
    queued_requests: int = 0,
    peak_concurrent: int = 0,
    peak_queue_depth: int = 0,
)

Frozen snapshot of Bulkhead statistics.

CircuitBreaker

CircuitBreaker(
    name: str,
    target: Entity,
    failure_threshold: int = 5,
    success_threshold: int = 2,
    timeout: float = 30.0,
    half_open_max_requests: int = 1,
    failure_predicate: Callable[[Event], bool]
    | None = None,
    on_state_change: Callable[
        [CircuitState, CircuitState], None
    ]
    | None = None,
)

Bases: Entity

Implements the circuit breaker pattern.

The circuit breaker monitors requests to a target service and tracks failures. When failures exceed a threshold, the circuit opens and subsequent requests fail fast without calling the target. After a timeout, the circuit enters half-open state to test if the target has recovered.

States

CLOSED: Normal operation. Requests forwarded to target. Failures tracked. Opens after failure_threshold failures. OPEN: Failing fast. Requests rejected immediately. Transitions to HALF_OPEN after timeout expires. HALF_OPEN: Testing recovery. Limited requests allowed through. success_threshold successes -> CLOSED Any failure -> OPEN

Attributes:

Name Type Description
name

Circuit breaker identifier.

target Entity

The service being protected.

state CircuitState

Current circuit state.

Initialize the circuit breaker.

Parameters:

Name Type Description Default
name str

Circuit breaker identifier.

required
target Entity

The downstream entity to protect.

required
failure_threshold int

Consecutive failures before opening circuit.

5
success_threshold int

Consecutive successes in half-open to close.

2
timeout float

Seconds in open state before transitioning to half-open.

30.0
half_open_max_requests int

Max concurrent requests in half-open state.

1
failure_predicate Callable[[Event], bool] | None

Optional function to determine if response is failure. If None, only exceptions count as failures.

None
on_state_change Callable[[CircuitState, CircuitState], None] | None

Optional callback when state changes.

None

Raises:

Type Description
ValueError

If thresholds or timeout are invalid.

target property

target: Entity

The protected target entity.

state property

state: CircuitState

Current circuit state.

failure_threshold property

failure_threshold: int

Number of failures before opening.

success_threshold property

success_threshold: int

Number of successes to close from half-open.

timeout property

timeout: float

Seconds before transitioning from open to half-open.

failure_count property

failure_count: int

Current consecutive failure count.

success_count property

success_count: int

Current consecutive success count (in half-open).

stats property

stats: CircuitBreakerStats

Frozen snapshot of circuit breaker statistics.

set_clock

set_clock(clock: Clock) -> None

Inject clock and propagate to target.

handle_event

handle_event(
    event: Event,
) -> (
    Generator[float, None, list[Event] | Event | None]
    | list[Event]
    | Event
    | None
)

Handle incoming events.

Routes requests through the circuit breaker logic and handles responses from the target.

Parameters:

Name Type Description Default
event Event

The event to handle.

required

Returns:

Type Description
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None

Events to schedule, or None if rejected.

record_success

record_success() -> None

Manually record a success (for external failure detection).

record_failure

record_failure() -> None

Manually record a failure (for external failure detection).

force_open

force_open() -> None

Force the circuit to open.

force_close

force_close() -> None

Force the circuit to close.

reset

reset() -> None

Reset the circuit breaker to initial state.

CircuitBreakerStats dataclass

CircuitBreakerStats(
    total_requests: int = 0,
    successful_requests: int = 0,
    failed_requests: int = 0,
    rejected_requests: int = 0,
    state_changes: int = 0,
    times_opened: int = 0,
    times_closed: int = 0,
)

Frozen snapshot of CircuitBreaker statistics.

CircuitState

Bases: Enum

Circuit breaker states.

Fallback

Fallback(
    name: str,
    primary: Entity,
    fallback: Entity | Callable[[Event], Event | None],
    failure_predicate: Callable[[Event], bool]
    | None = None,
    timeout: float | None = None,
)

Bases: Entity

Provides fallback behavior on primary failure.

Forwards requests to the primary service and monitors for failures. When a failure is detected (via timeout, exception, or predicate), the request is retried with the fallback service.

Attributes:

Name Type Description
name

Fallback wrapper identifier.

primary Entity

The primary service to try first.

fallback Entity | Callable[[Event], Event | None]

The fallback service or function.

Initialize the fallback wrapper.

Parameters:

Name Type Description Default
name str

Fallback wrapper identifier.

required
primary Entity

The primary entity to forward requests to.

required
fallback Entity | Callable[[Event], Event | None]

The fallback entity or callable. If callable, receives the original event and returns a fallback event to schedule (or None).

required
failure_predicate Callable[[Event], bool] | None

Optional function to detect failures. Returns True if the response indicates failure.

None
timeout float | None

Optional timeout before triggering fallback. If None, only failure_predicate triggers fallback.

None

Raises:

Type Description
ValueError

If parameters are invalid.

primary property

primary: Entity

The primary entity.

fallback property

fallback: Entity | Callable[[Event], Event | None]

The fallback entity or callable.

timeout property

timeout: float | None

Timeout before triggering fallback.

stats property

stats: FallbackStats

Frozen snapshot of fallback statistics.

set_clock

set_clock(clock: Clock) -> None

Inject clock and propagate to primary and fallback.

handle_event

handle_event(
    event: Event,
) -> (
    Generator[float, None, list[Event] | Event | None]
    | list[Event]
    | Event
    | None
)

Handle incoming events.

Forwards requests to primary with fallback handling.

Parameters:

Name Type Description Default
event Event

The event to handle.

required

Returns:

Type Description
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None

Events to schedule.

FallbackStats dataclass

FallbackStats(
    total_requests: int = 0,
    primary_successes: int = 0,
    primary_failures: int = 0,
    fallback_invocations: int = 0,
    fallback_successes: int = 0,
    fallback_failures: int = 0,
)

Frozen snapshot of Fallback statistics.

Hedge

Hedge(
    name: str,
    target: Entity,
    hedge_delay: float,
    max_hedges: int = 1,
)

Bases: Entity

Sends redundant requests to reduce tail latency.

When a request doesn't complete within the hedge delay, a second (hedge) request is sent to the same target. The first response is used and the other request is effectively cancelled.

This is useful when: - Tail latency is significantly higher than median latency - The cost of extra requests is acceptable - The target service is idempotent

Attributes:

Name Type Description
name

Hedge wrapper identifier.

target Entity

The service to send requests to.

hedge_delay float

Time to wait before sending hedge request.

max_hedges int

Maximum number of hedge requests per original.

Initialize the hedge wrapper.

Parameters:

Name Type Description Default
name str

Hedge wrapper identifier.

required
target Entity

The downstream entity to send requests to.

required
hedge_delay float

Seconds to wait before sending hedge request.

required
max_hedges int

Maximum number of hedge requests per original.

1

Raises:

Type Description
ValueError

If parameters are invalid.

target property

target: Entity

The target entity.

hedge_delay property

hedge_delay: float

Delay before sending hedge request.

max_hedges property

max_hedges: int

Maximum number of hedge requests.

in_flight_count property

in_flight_count: int

Number of requests currently in flight.

stats property

stats: HedgeStats

Frozen snapshot of hedge statistics.

set_clock

set_clock(clock: Clock) -> None

Inject clock and propagate to target.

handle_event

handle_event(
    event: Event,
) -> (
    Generator[float, None, list[Event] | Event | None]
    | list[Event]
    | Event
    | None
)

Handle incoming events.

Forwards requests with hedge scheduling.

Parameters:

Name Type Description Default
event Event

The event to handle.

required

Returns:

Type Description
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None

Events to schedule.

HedgeStats dataclass

HedgeStats(
    total_requests: int = 0,
    primary_wins: int = 0,
    hedge_wins: int = 0,
    hedges_sent: int = 0,
    hedges_cancelled: int = 0,
)

Frozen snapshot of Hedge statistics.

TimeoutStats dataclass

TimeoutStats(
    total_requests: int = 0,
    successful_requests: int = 0,
    timed_out_requests: int = 0,
)

Frozen snapshot of TimeoutWrapper statistics.

TimeoutWrapper

TimeoutWrapper(
    name: str,
    target: Entity,
    timeout: float,
    on_timeout: Callable[[Event], Event | None]
    | None = None,
)

Bases: Entity

Wraps a target entity with timeout handling.

Forwards requests to the target and tracks their completion. If a request doesn't complete within the timeout period, it is considered failed and an optional callback is invoked.

Attributes:

Name Type Description
name

Timeout wrapper identifier.

target Entity

The service being wrapped.

timeout float

Maximum time to wait for response.

Initialize the timeout wrapper.

Parameters:

Name Type Description Default
name str

Timeout wrapper identifier.

required
target Entity

The downstream entity to wrap.

required
timeout float

Maximum time in seconds to wait for response.

required
on_timeout Callable[[Event], Event | None] | None

Optional callback when timeout occurs. Receives the original request event. Can return an event to schedule (e.g., fallback).

None

Raises:

Type Description
ValueError

If timeout is invalid.

target property

target: Entity

The wrapped target entity.

timeout property

timeout: float

Timeout in seconds.

in_flight_count property

in_flight_count: int

Number of requests currently in flight.

stats property

stats: TimeoutStats

Frozen snapshot of timeout wrapper statistics.

set_clock

set_clock(clock: Clock) -> None

Inject clock and propagate to target.

handle_event

handle_event(
    event: Event,
) -> (
    Generator[float, None, list[Event] | Event | None]
    | list[Event]
    | Event
    | None
)

Handle incoming events.

Forwards requests to target with timeout tracking.

Parameters:

Name Type Description Default
event Event

The event to handle.

required

Returns:

Type Description
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None

Events to schedule.