docs/Guides/Troubleshooting

Troubleshooting

Common problems and how to fix them.

Quick fixes for the most common Subway issues — connection failures, name resolution, message delivery, bridge problems, and performance.

Connection issues#

Agent won't connect to relay

Symptoms: Hangs on startup, ConnectionFailed error, or timeout.

Error: ConnectionFailed("dial relay.subway.dev:9000 timeout")

Check:

  1. Relay is reachable — the public relay at relay.subway.dev runs on port 9000 (QUIC/UDP):

    # Health check (HTTP on port 9001)
    curl https://relay.subway.dev/v1/health
  2. Firewall allows outbound UDP — QUIC uses UDP, not TCP. Some corporate networks and cloud providers block outbound UDP:

    # Quick connectivity test
    nc -u -z relay.subway.dev 9000
  3. DNS resolves — confirm the hostname resolves:

    dig relay.subway.dev
  4. Not a port conflict — if self-hosting a relay, ensure nothing else uses your QUIC port.

Warning

Subway uses QUIC (UDP), not TCP. If your network only allows TCP egress, you'll need WebTransport support or the REST/WebSocket bridge as a fallback.

Agent connects but immediately disconnects

Symptoms: Brief connected log then RelayDisconnected event.

Common causes:

  • Name collision — another agent is already registered with the same name. Names are unique per relay. Pick a different name or shut down the other instance.
  • Relay at capacity — the relay has connection limits (max_connections defaults to 1,000,000 but self-hosted relays may be lower).

Debug:

RUST_LOG=subway_core=debug subway agent --name test.relay

Look for RelayDisconnected events or NameAlreadyRegistered errors in the debug output.

Reconnection isn't working

Subway auto-reconnects with exponential backoff: 1s → 2s → 4s → ... → 30s max, up to 10 attempts.

If reconnection fails:

  • The relay may still be down — check /v1/health
  • All 10 attempts were exhausted — restart the agent process
  • Network changed (e.g., WiFi → cellular) — the QUIC connection can't recover from IP changes; restart required

Name resolution#

NameNotFound when sending to an agent

Error: NameNotFound("worker.relay")

The target agent isn't registered. Check:

  1. Is the target running? Names only exist while the agent process is alive.
  2. Exact name match — names are case-sensitive. Worker.relayworker.relay.
  3. Same relay — both agents must connect to the same relay. The default is relay.subway.dev:9000.
  4. Name expired — if the target agent lost connectivity, its name expires after ~30s (one missed renewal cycle).

Verify with resolve:

# CLI
subway resolve worker.relay
 
# REST
curl http://localhost:9002/v1/resolve/worker.relay
 
# Returns 404 if not found

Name registration fails

If your agent can't register its name:

  • Name taken — another agent (or a stale session) already has it. Names are first-come-first-served.
  • Invalid format — names must be non-empty strings. Convention is name.relay but any string works.

Message delivery#

Messages sent but not received

  1. Name resolved to wrong peer — if you restarted an agent without the same keypair, it has a new PeerId. Other agents with cached routes may try the old peer. Wait for name re-registration (~30s) or resolve again.

  2. No message handler — if the target agent hasn't set up a message handler, messages are delivered to the agent but silently dropped at the application level.

  3. Payload too large — messages travel through relay circuits. Extremely large payloads may be rejected by the transport layer. Keep payloads under 1MB.

RPC calls timeout

Error: Timeout

The default RPC timeout is 30 seconds. Common causes:

CauseFix
Target agent has no RPC handlerRegister a handler with node.handle_rpc()
Handler is slowOptimize or offload heavy work
Target agent is offlineCheck with resolve first
Network partitionCheck connectivity to relay

Pattern: Check before calling

// Resolve first to get a fast failure
match node.resolve("target.relay").await {
    Ok(_peer_id) => {
        // Agent is online, safe to call
        let resp = node.call("target.relay", request).await?;
    }
    Err(SubwayError::NameNotFound(_)) => {
        eprintln!("target is offline");
    }
    Err(e) => eprintln!("resolve failed: {}", e),
}

Broadcasts not received

  • Not subscribed — the subscriber must call subscribe before the broadcast is sent. There's no message replay.
  • Topic mismatch — wildcards only work at one level. metrics.* matches metrics.cpu but not metrics.cpu.avg.
  • Subscriber disconnected — pub/sub is best-effort. If the subscriber was temporarily offline, they miss broadcasts.

Bridge issues#

REST API returns 404

curl http://localhost:9002/v1/send
# 404 Not Found

Check:

  • Use POST for send, call, broadcast. Only resolve, health, stats, and subscribe are GET.
  • Include /v1/ prefix — all endpoints are under /v1/.
  • Correct port — relay REST is on port 9001 (e.g., https://relay.subway.dev/). Standalone bridge defaults to 9002.

WebSocket connection rejected

WebSocket connection to 'ws://localhost:9002/ws' failed
  • Use the correct path — WebSocket endpoint is /ws, not /v1/ws.
  • Register immediately — send a register message right after connecting. The bridge expects registration within the first few seconds.
  • One name per connection — you can't register the same name from two WebSocket sessions.

WebSocket error messages

The bridge returns structured errors:

{"type": "error", "code": "name_not_found", "message": "agent 'ghost.relay' is not registered"}
CodeMeaning
name_not_foundTarget agent not registered
delivery_failedMessage couldn't be delivered
timeoutRPC call timed out (30s)
invalid_messageMalformed JSON or missing required fields
registration_failedName already taken or invalid

SSE subscribe stream closes immediately

  • Topic required/v1/subscribe requires a ?topic= query parameter.
  • No keep-alive — if your HTTP client has a short timeout, it may close the SSE stream. Set a long or infinite timeout.

Build & installation#

subway command not found

After installing:

# Check if it's in PATH
which subway
 
# Default install location
ls ~/.local/bin/subway
 
# Add to PATH if needed
export PATH="$HOME/.local/bin:$PATH"

Build from source fails

cargo build --release

Common issues:

ErrorFix
protoc not foundInstall Protocol Buffers: brew install protobuf or apt install protobuf-compiler
openssl not foundInstall OpenSSL dev headers: apt install libssl-dev or brew install openssl
Git dependency fetch failsCheck SSH keys or use CARGO_NET_GIT_FETCH_WITH_CLI=true
Compile OOM on small VMsUse cargo build --release -j 2 to limit parallelism

Performance#

High CPU usage

  • Debug logging is expensiveRUST_LOG=debug generates massive output. Use info in production.
  • High message throughput — Subway is designed for coordination messages, not bulk data transfer. If you're sending thousands of messages per second, consider batching.

Memory grows over time

  • Subscription accumulation — if you subscribe to many topics without unsubscribing, each topic maintains a listener. Unsubscribe when done.
  • WebSocket sessions — each WebSocket connection spawns a full AgentNode. Many concurrent connections consume proportional memory.

Debugging checklist#

When something isn't working, run through this:

Check relay health
# Public relay
curl https://relay.subway.dev/v1/health
 
# Local bridge
curl http://localhost:9002/v1/health

Should return {"status": "ok", ...}.

Enable debug logs
RUST_LOG=subway_core=debug subway agent --name test.relay
Verify name registration
curl http://localhost:9002/v1/resolve/your-agent.relay

Should return a PeerId. 404 means the agent isn't connected.

Test with a fresh agent

Start a clean agent with a unique name and try the simplest possible operation (send a message to yourself via another terminal).

Check your network

Confirm outbound UDP is allowed. Try from a different network if possible.

Getting help#

If you're stuck after working through this guide: