Troubleshooting
Common problems and how to fix them.
Quick fixes for the most common Subway issues — connection failures, name resolution, message delivery, bridge problems, and performance.
Connection issues#
Agent won't connect to relay
Symptoms: Hangs on startup, ConnectionFailed error, or timeout.
Error: ConnectionFailed("dial relay.subway.dev:9000 timeout")
Check:
-
Relay is reachable — the public relay at
relay.subway.devruns on port 9000 (QUIC/UDP):# Health check (HTTP on port 9001) curl https://relay.subway.dev/v1/health -
Firewall allows outbound UDP — QUIC uses UDP, not TCP. Some corporate networks and cloud providers block outbound UDP:
# Quick connectivity test nc -u -z relay.subway.dev 9000 -
DNS resolves — confirm the hostname resolves:
dig relay.subway.dev -
Not a port conflict — if self-hosting a relay, ensure nothing else uses your QUIC port.
Subway uses QUIC (UDP), not TCP. If your network only allows TCP egress, you'll need WebTransport support or the REST/WebSocket bridge as a fallback.
Agent connects but immediately disconnects
Symptoms: Brief connected log then RelayDisconnected event.
Common causes:
- Name collision — another agent is already registered with the same name. Names are unique per relay. Pick a different name or shut down the other instance.
- Relay at capacity — the relay has connection limits (
max_connectionsdefaults to 1,000,000 but self-hosted relays may be lower).
Debug:
RUST_LOG=subway_core=debug subway agent --name test.relayLook for RelayDisconnected events or NameAlreadyRegistered errors in the debug output.
Reconnection isn't working
Subway auto-reconnects with exponential backoff: 1s → 2s → 4s → ... → 30s max, up to 10 attempts.
If reconnection fails:
- The relay may still be down — check
/v1/health - All 10 attempts were exhausted — restart the agent process
- Network changed (e.g., WiFi → cellular) — the QUIC connection can't recover from IP changes; restart required
Name resolution#
NameNotFound when sending to an agent
Error: NameNotFound("worker.relay")
The target agent isn't registered. Check:
- Is the target running? Names only exist while the agent process is alive.
- Exact name match — names are case-sensitive.
Worker.relay≠worker.relay. - Same relay — both agents must connect to the same relay. The default is
relay.subway.dev:9000. - Name expired — if the target agent lost connectivity, its name expires after ~30s (one missed renewal cycle).
Verify with resolve:
# CLI
subway resolve worker.relay
# REST
curl http://localhost:9002/v1/resolve/worker.relay
# Returns 404 if not foundName registration fails
If your agent can't register its name:
- Name taken — another agent (or a stale session) already has it. Names are first-come-first-served.
- Invalid format — names must be non-empty strings. Convention is
name.relaybut any string works.
Message delivery#
Messages sent but not received
-
Name resolved to wrong peer — if you restarted an agent without the same keypair, it has a new PeerId. Other agents with cached routes may try the old peer. Wait for name re-registration (~30s) or resolve again.
-
No message handler — if the target agent hasn't set up a message handler, messages are delivered to the agent but silently dropped at the application level.
-
Payload too large — messages travel through relay circuits. Extremely large payloads may be rejected by the transport layer. Keep payloads under 1MB.
RPC calls timeout
Error: Timeout
The default RPC timeout is 30 seconds. Common causes:
| Cause | Fix |
|---|---|
| Target agent has no RPC handler | Register a handler with node.handle_rpc() |
| Handler is slow | Optimize or offload heavy work |
| Target agent is offline | Check with resolve first |
| Network partition | Check connectivity to relay |
Pattern: Check before calling
// Resolve first to get a fast failure
match node.resolve("target.relay").await {
Ok(_peer_id) => {
// Agent is online, safe to call
let resp = node.call("target.relay", request).await?;
}
Err(SubwayError::NameNotFound(_)) => {
eprintln!("target is offline");
}
Err(e) => eprintln!("resolve failed: {}", e),
}Broadcasts not received
- Not subscribed — the subscriber must call
subscribebefore the broadcast is sent. There's no message replay. - Topic mismatch — wildcards only work at one level.
metrics.*matchesmetrics.cpubut notmetrics.cpu.avg. - Subscriber disconnected — pub/sub is best-effort. If the subscriber was temporarily offline, they miss broadcasts.
Bridge issues#
REST API returns 404
curl http://localhost:9002/v1/send
# 404 Not FoundCheck:
- Use POST for
send,call,broadcast. Onlyresolve,health,stats, andsubscribeare GET. - Include
/v1/prefix — all endpoints are under/v1/. - Correct port — relay REST is on port 9001 (e.g.,
https://relay.subway.dev/). Standalone bridge defaults to 9002.
WebSocket connection rejected
WebSocket connection to 'ws://localhost:9002/ws' failed
- Use the correct path — WebSocket endpoint is
/ws, not/v1/ws. - Register immediately — send a
registermessage right after connecting. The bridge expects registration within the first few seconds. - One name per connection — you can't register the same name from two WebSocket sessions.
WebSocket error messages
The bridge returns structured errors:
{"type": "error", "code": "name_not_found", "message": "agent 'ghost.relay' is not registered"}| Code | Meaning |
|---|---|
name_not_found | Target agent not registered |
delivery_failed | Message couldn't be delivered |
timeout | RPC call timed out (30s) |
invalid_message | Malformed JSON or missing required fields |
registration_failed | Name already taken or invalid |
SSE subscribe stream closes immediately
- Topic required —
/v1/subscriberequires a?topic=query parameter. - No keep-alive — if your HTTP client has a short timeout, it may close the SSE stream. Set a long or infinite timeout.
Build & installation#
subway command not found
After installing:
# Check if it's in PATH
which subway
# Default install location
ls ~/.local/bin/subway
# Add to PATH if needed
export PATH="$HOME/.local/bin:$PATH"Build from source fails
cargo build --releaseCommon issues:
| Error | Fix |
|---|---|
protoc not found | Install Protocol Buffers: brew install protobuf or apt install protobuf-compiler |
openssl not found | Install OpenSSL dev headers: apt install libssl-dev or brew install openssl |
| Git dependency fetch fails | Check SSH keys or use CARGO_NET_GIT_FETCH_WITH_CLI=true |
| Compile OOM on small VMs | Use cargo build --release -j 2 to limit parallelism |
Performance#
High CPU usage
- Debug logging is expensive —
RUST_LOG=debuggenerates massive output. Useinfoin production. - High message throughput — Subway is designed for coordination messages, not bulk data transfer. If you're sending thousands of messages per second, consider batching.
Memory grows over time
- Subscription accumulation — if you subscribe to many topics without unsubscribing, each topic maintains a listener. Unsubscribe when done.
- WebSocket sessions — each WebSocket connection spawns a full
AgentNode. Many concurrent connections consume proportional memory.
Debugging checklist#
When something isn't working, run through this:
# Public relay
curl https://relay.subway.dev/v1/health
# Local bridge
curl http://localhost:9002/v1/healthShould return {"status": "ok", ...}.
RUST_LOG=subway_core=debug subway agent --name test.relaycurl http://localhost:9002/v1/resolve/your-agent.relayShould return a PeerId. 404 means the agent isn't connected.
Start a clean agent with a unique name and try the simplest possible operation (send a message to yourself via another terminal).
Confirm outbound UDP is allowed. Try from a different network if possible.
Getting help#
If you're stuck after working through this guide:
- Check the error reference for specific error types
- Review the resilience docs for auto-recovery behavior
- Open an issue on GitHub