Resilience PatternsIntermediate5 min

Circuit Breakers

When a dependency is down, stop hammering it

In a nutshell

When a service your API depends on stops responding, your API can get stuck waiting and eventually go down too. A circuit breaker monitors those downstream calls and, when failures pile up, stops making them entirely -- returning an error immediately instead of waiting 30 seconds for a timeout. Once the struggling service recovers, the circuit breaker lets traffic flow again.

The situation

Your checkout service calls the inventory service to reserve stock. The inventory service's database is overloaded — every request takes 30 seconds and then times out.

Your checkout service has 200 request threads. Each one is stuck waiting on inventory. Within a minute, all 200 threads are consumed. The checkout service can't handle any requests — not even the ones that don't need inventory. Cart browsing, order history, account pages — everything is down.

The inventory service took down the checkout service. And checkout is about to take down the API gateway. This is a cascading failure, and it's the most common way a microservices architecture collapses.

What a circuit breaker does

A circuit breaker sits between your service and a dependency. It monitors failures and, when the failure rate crosses a threshold, it stops making calls entirely — failing immediately instead of waiting for a timeout.

The concept comes from electrical engineering. When the current exceeds safe levels, the breaker trips to prevent a fire. Same idea: when error rates exceed safe levels, stop sending traffic to prevent a system-wide outage.

The three states

A circuit breaker has three states, and the transitions between them are the key to understanding the pattern:

Closed (normal operation)

All requests pass through to the dependency. The breaker tracks success and failure counts.

POST /internal/inventory/reserve HTTP/1.1
Host: inventory-service
Content-Type: application/json

{
  "sku": "WIDGET-001",
  "quantity": 2
}

HTTP/1.1 200 OK
Content-Type: application/json

{
  "reservation_id": "res_abc123",
  "sku": "WIDGET-001",
  "quantity": 2,
  "expires_at": "2026-04-13T16:00:00Z"
}

The breaker sees: success. Counter stays healthy.

Open (failing fast)

The failure threshold has been crossed. The breaker rejects all requests immediately without calling the dependency. This is the protective state.

POST /internal/inventory/reserve HTTP/1.1
Host: inventory-service
Content-Type: application/json

{
  "sku": "WIDGET-001",
  "quantity": 2
}

The request never leaves your service. The circuit breaker returns an error immediately:

HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{
  "error": {
    "code": "circuit_open",
    "message": "Inventory service is currently unavailable",
    "retry_after": 30
  }
}

Response time: 1 millisecond instead of 30 seconds. Your threads are free. Your service stays alive.

Half-open (testing recovery)

After a cooldown period, the breaker allows a limited number of probe requests through to test if the dependency has recovered.

If the probe succeeds, the breaker transitions back to closed
If the probe fails, the breaker returns to open and resets the cooldown timer

Loading diagram...

Why half-open matters

Without the half-open state, you'd need to manually reset the breaker or wait for a fixed timer. Half-open automates recovery detection: the breaker heals itself when the dependency comes back, and re-trips if it hasn't.

What happens without a circuit breaker

Here's the cascading failure in detail:

Loading diagram...

With a circuit breaker on the checkout → inventory call:

Loading diagram...

Configuration thresholds

A circuit breaker needs three main configuration values:

Parameter	What it controls	Typical value
Failure threshold	How many failures before tripping	5 failures in 10 seconds, or 50% failure rate
Cooldown period	How long to stay open before testing	30-60 seconds
Probe count	How many test requests in half-open	1-3 requests

const circuitBreakerConfig = {
  // Trip after 5 failures in a 10-second window
  failureThreshold: 5,
  failureWindowMs: 10_000,

  // Wait 30 seconds before probing
  cooldownMs: 30_000,

  // Allow 3 probe requests in half-open state
  probeCount: 3,

  // Only count these as failures (not 400s)
  failureStatusCodes: [500, 502, 503, 504],

  // Timeouts count as failures
  timeoutMs: 5_000,
};

Tune your thresholds

A threshold that's too low trips on normal transient errors. A threshold that's too high lets the cascading failure happen before the breaker kicks in. Start with a 50% failure rate over a 10-second window and adjust based on your traffic patterns.

Fallback strategies

When the circuit is open, you need to decide what to return. Options:

Return an error — honest and simple, let the caller handle it
Return cached data — stale data is better than no data for read operations
Return a default — "inventory unknown" instead of crashing the checkout flow
Degrade gracefully — skip the optional feature (e.g., show products without stock counts)

async function getInventory(sku: string): Promise<InventoryResult> {
  try {
    return await circuitBreaker.call(() =>
      fetch(`/internal/inventory/${sku}`)
    );
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Fallback: return cached data or a degraded response
      const cached = await cache.get(`inventory:${sku}`);
      if (cached) return { ...cached, source: "cache", stale: true };

      return { sku, available: null, source: "fallback" };
    }
    throw error;
  }
}

Health check integration

Your health endpoint should reflect circuit breaker states so load balancers and monitoring can act:

GET /health HTTP/1.1
Host: checkout-service

{
  "status": "degraded",
  "checks": {
    "database": { "status": "healthy", "latency_ms": 12 },
    "inventory-service": {
      "status": "unhealthy",
      "circuit": "open",
      "last_failure": "2026-04-13T14:32:10Z",
      "cooldown_remaining_s": 22
    },
    "payment-service": { "status": "healthy", "latency_ms": 85 }
  }
}

This gives your operations team immediate visibility into which dependencies are struggling without digging through logs.

Checklist: circuit breaker implementation

Identify all external dependencies that could fail or slow down
Add circuit breakers on each outbound call
Configure failure thresholds based on your traffic patterns
Implement a fallback strategy for each dependency (error, cache, default, or degrade)
Expose circuit breaker state in your health endpoint
Set up alerts for circuit state changes (closed → open)
Test by injecting failures — your breaker should trip and recover cleanly

Next up: rate limiting — protecting your API not from failing dependencies, but from being overwhelmed by your own consumers.