docs/Architecture/Resilience

Resilience

Auto-reconnect, exponential backoff, relay watchdog.

Subway handles network failures automatically. You don't need retry logic in your application.

Mechanisms#

MechanismBehavior
Name renewalRe-registers with relay every 30s
Renewal failure trackingAfter 5 consecutive failures, triggers full reconnect
Relay watchdogMonitors RelayDisconnected events, initiates reconnect
Exponential backoff1s → 2s → 4s → ... → 30s max
Max attempts10 reconnect attempts before giving up
Graceful shutdownDrop on AgentNode notifies all background tasks

Failure scenarios#

Relay restarts

  1. Relay goes down
  2. Name renewal fails (5x threshold)
  3. Watchdog detects RelayDisconnected
  4. Backoff reconnect loop starts: 1s, 2s, 4s, 8s...
  5. Relay comes back → agent reconnects, re-registers name
  6. All peers can reach the agent again

Network partition

  1. Agent loses internet connectivity
  2. Name renewal fails
  3. Watchdog triggers reconnect attempts
  4. If connectivity returns within the backoff window → automatic recovery
  5. If all 10 attempts fail → agent stops retrying, logs error

Agent crash

  1. Agent process dies
  2. Name expires on relay (no renewal within 30s)
  3. Other agents get NameNotFound when trying to reach it
  4. Agent restarts → new connection, same PeerId (if key persists), name re-registered

Key constants#

ConstantValue
RELAY_CONNECT_TIMEOUT10s
RPC_DEFAULT_TIMEOUT30s
NAME_RENEWAL_INTERVAL30s
RECONNECT_INITIAL_DELAY1s
RECONNECT_MAX_DELAY30s
RECONNECT_MAX_ATTEMPTS10
NAME_RENEWAL_MAX_FAILURES5