Resilience¶
Circuit breakers, bulkheads, and retry policies for fault-tolerant systems.
Resilience patterns for fault tolerance.
This module provides components for building resilient systems that handle failures gracefully.
Components
CircuitBreaker: Fails fast when downstream is unhealthy Bulkhead: Isolates resources to prevent cascade failures TimeoutWrapper: Wraps services with timeout handling Fallback: Provides fallback behavior on failure Hedge: Sends redundant requests to reduce tail latency
Example
from happysimulator.components.resilience import ( CircuitBreaker, Bulkhead, TimeoutWrapper, Fallback, Hedge, )
Compose multiple resilience patterns¶
hedge = Hedge("hedge", backend, hedge_delay=0.050) timeout = TimeoutWrapper("timeout", hedge, timeout=5.0) breaker = CircuitBreaker("breaker", timeout, failure_threshold=5) bulkhead = Bulkhead("bulkhead", breaker, max_concurrent=10)
Bulkhead ¶
Bulkhead(
name: str,
target: Entity,
max_concurrent: int,
max_wait_queue: int = 0,
max_wait_time: float | None = None,
)
Bases: Entity
Isolates resources by limiting concurrent access.
The bulkhead limits the number of concurrent requests to a target service. Additional requests can optionally wait in a queue with a timeout. When both concurrent and queue limits are reached, requests are rejected immediately.
This prevents a slow or failing service from consuming all resources and affecting other parts of the system.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Bulkhead identifier. |
|
target |
Entity
|
The service being protected. |
max_concurrent |
int
|
Maximum concurrent requests allowed. |
Initialize the bulkhead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Bulkhead identifier. |
required |
target
|
Entity
|
The downstream entity to protect. |
required |
max_concurrent
|
int
|
Maximum concurrent requests to target. |
required |
max_wait_queue
|
int
|
Maximum requests that can wait in queue. 0 means no queuing (immediate reject when full). |
0
|
max_wait_time
|
float | None
|
Maximum time a request can wait in queue. None means no timeout (wait indefinitely). |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If parameters are invalid. |
handle_event ¶
handle_event(
event: Event,
) -> (
Generator[float, None, list[Event] | Event | None]
| list[Event]
| Event
| None
)
Handle incoming events.
Routes requests through the bulkhead logic and handles responses from the target.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
event
|
Event
|
The event to handle. |
required |
Returns:
| Type | Description |
|---|---|
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None
|
Events to schedule, or None if rejected. |
BulkheadStats
dataclass
¶
BulkheadStats(
total_requests: int = 0,
accepted_requests: int = 0,
rejected_requests: int = 0,
timed_out_requests: int = 0,
queued_requests: int = 0,
peak_concurrent: int = 0,
peak_queue_depth: int = 0,
)
Frozen snapshot of Bulkhead statistics.
CircuitBreaker ¶
CircuitBreaker(
name: str,
target: Entity,
failure_threshold: int = 5,
success_threshold: int = 2,
timeout: float = 30.0,
half_open_max_requests: int = 1,
failure_predicate: Callable[[Event], bool]
| None = None,
on_state_change: Callable[
[CircuitState, CircuitState], None
]
| None = None,
)
Bases: Entity
Implements the circuit breaker pattern.
The circuit breaker monitors requests to a target service and tracks failures. When failures exceed a threshold, the circuit opens and subsequent requests fail fast without calling the target. After a timeout, the circuit enters half-open state to test if the target has recovered.
States
CLOSED: Normal operation. Requests forwarded to target. Failures tracked. Opens after failure_threshold failures. OPEN: Failing fast. Requests rejected immediately. Transitions to HALF_OPEN after timeout expires. HALF_OPEN: Testing recovery. Limited requests allowed through. success_threshold successes -> CLOSED Any failure -> OPEN
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Circuit breaker identifier. |
|
target |
Entity
|
The service being protected. |
state |
CircuitState
|
Current circuit state. |
Initialize the circuit breaker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Circuit breaker identifier. |
required |
target
|
Entity
|
The downstream entity to protect. |
required |
failure_threshold
|
int
|
Consecutive failures before opening circuit. |
5
|
success_threshold
|
int
|
Consecutive successes in half-open to close. |
2
|
timeout
|
float
|
Seconds in open state before transitioning to half-open. |
30.0
|
half_open_max_requests
|
int
|
Max concurrent requests in half-open state. |
1
|
failure_predicate
|
Callable[[Event], bool] | None
|
Optional function to determine if response is failure. If None, only exceptions count as failures. |
None
|
on_state_change
|
Callable[[CircuitState, CircuitState], None] | None
|
Optional callback when state changes. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If thresholds or timeout are invalid. |
handle_event ¶
handle_event(
event: Event,
) -> (
Generator[float, None, list[Event] | Event | None]
| list[Event]
| Event
| None
)
Handle incoming events.
Routes requests through the circuit breaker logic and handles responses from the target.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
event
|
Event
|
The event to handle. |
required |
Returns:
| Type | Description |
|---|---|
Generator[float, None, list[Event] | Event | None] | list[Event] | Event | None
|
Events to schedule, or None if rejected. |
record_success ¶
Manually record a success (for external failure detection).
record_failure ¶
Manually record a failure (for external failure detection).
CircuitBreakerStats
dataclass
¶
CircuitBreakerStats(
total_requests: int = 0,
successful_requests: int = 0,
failed_requests: int = 0,
rejected_requests: int = 0,
state_changes: int = 0,
times_opened: int = 0,
times_closed: int = 0,
)
Frozen snapshot of CircuitBreaker statistics.
CircuitState ¶
Bases: Enum
Circuit breaker states.
Fallback ¶
Fallback(
name: str,
primary: Entity,
fallback: Entity | Callable[[Event], Event | None],
failure_predicate: Callable[[Event], bool]
| None = None,
timeout: float | None = None,
)
Bases: Entity
Provides fallback behavior on primary failure.
Forwards requests to the primary service and monitors for failures. When a failure is detected (via timeout, exception, or predicate), the request is retried with the fallback service.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Fallback wrapper identifier. |
|
primary |
Entity
|
The primary service to try first. |
fallback |
Entity | Callable[[Event], Event | None]
|
The fallback service or function. |
Initialize the fallback wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Fallback wrapper identifier. |
required |
primary
|
Entity
|
The primary entity to forward requests to. |
required |
fallback
|
Entity | Callable[[Event], Event | None]
|
The fallback entity or callable. If callable, receives the original event and returns a fallback event to schedule (or None). |
required |
failure_predicate
|
Callable[[Event], bool] | None
|
Optional function to detect failures. Returns True if the response indicates failure. |
None
|
timeout
|
float | None
|
Optional timeout before triggering fallback. If None, only failure_predicate triggers fallback. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If parameters are invalid. |
FallbackStats
dataclass
¶
FallbackStats(
total_requests: int = 0,
primary_successes: int = 0,
primary_failures: int = 0,
fallback_invocations: int = 0,
fallback_successes: int = 0,
fallback_failures: int = 0,
)
Frozen snapshot of Fallback statistics.
Hedge ¶
Bases: Entity
Sends redundant requests to reduce tail latency.
When a request doesn't complete within the hedge delay, a second (hedge) request is sent to the same target. The first response is used and the other request is effectively cancelled.
This is useful when: - Tail latency is significantly higher than median latency - The cost of extra requests is acceptable - The target service is idempotent
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Hedge wrapper identifier. |
|
target |
Entity
|
The service to send requests to. |
hedge_delay |
float
|
Time to wait before sending hedge request. |
max_hedges |
int
|
Maximum number of hedge requests per original. |
Initialize the hedge wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Hedge wrapper identifier. |
required |
target
|
Entity
|
The downstream entity to send requests to. |
required |
hedge_delay
|
float
|
Seconds to wait before sending hedge request. |
required |
max_hedges
|
int
|
Maximum number of hedge requests per original. |
1
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If parameters are invalid. |
handle_event ¶
HedgeStats
dataclass
¶
HedgeStats(
total_requests: int = 0,
primary_wins: int = 0,
hedge_wins: int = 0,
hedges_sent: int = 0,
hedges_cancelled: int = 0,
)
Frozen snapshot of Hedge statistics.
TimeoutStats
dataclass
¶
Frozen snapshot of TimeoutWrapper statistics.
TimeoutWrapper ¶
TimeoutWrapper(
name: str,
target: Entity,
timeout: float,
on_timeout: Callable[[Event], Event | None]
| None = None,
)
Bases: Entity
Wraps a target entity with timeout handling.
Forwards requests to the target and tracks their completion. If a request doesn't complete within the timeout period, it is considered failed and an optional callback is invoked.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Timeout wrapper identifier. |
|
target |
Entity
|
The service being wrapped. |
timeout |
float
|
Maximum time to wait for response. |
Initialize the timeout wrapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Timeout wrapper identifier. |
required |
target
|
Entity
|
The downstream entity to wrap. |
required |
timeout
|
float
|
Maximum time in seconds to wait for response. |
required |
on_timeout
|
Callable[[Event], Event | None] | None
|
Optional callback when timeout occurs. Receives the original request event. Can return an event to schedule (e.g., fallback). |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If timeout is invalid. |