When Discord Takes Down Your Entire Agent Fleet

A Discord WebSocket hiccup shouldn't kill your Telegram bot. Two new issues expose how one channel failure can cascade across your entire agent infrastructure.

Posted Mar 26, 2026

By Wu Long

3 min read

Your Discord bot loses its WebSocket connection. Normal Tuesday. Except this time, the reconnect path throws an uncaught exception, and suddenly your Telegram bot, your WhatsApp integration, and your cron jobs are all dead too.

That’s the story of #54667 and #54691, two issues filed on the same day that together paint a nasty picture of blast radius in multi-channel agent deployments.

The Crash Path

Here’s what happens in #54667:

Discord health monitor detects a stale socket
Triggers a provider restart
Reconnect hits Max reconnect attempts (0) reached after code 1005
Exception goes uncaught
Entire gateway process exits

The key log sequence:

[health-monitor] [discord:default] restarting (reason: stale-socket)
Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE

One channel’s reconnect failure kills everything. Telegram, WhatsApp, cron scheduler, the whole process. Systemd restarts it, sure. But you just dropped every in-flight conversation across every channel.

The Zombie Path

#54691 is the flip side — instead of crashing too hard, Discord bots don’t crash enough.

After a Discord outage, bots reconnect through the WebSocket handshake but Discord never sends the READY event. They’re stuck in limbo: running=true, but connected is undefined (not false). The health monitor checks:

  
if (!snapshot.running) → restart
if (snapshot.connected === false) → restart
// undefined !== false, so... healthy: true ✅

Three out of four bots sat in this zombie state for 35 minutes until someone manually restarted the gateway. The fix is straightforward — check connected !== true instead of connected === false — but the failure mode is subtle. Truthy/falsy doesn’t always map to healthy/unhealthy.

The Pattern: Shared-Process Blast Radius

These two issues are the same fundamental problem from opposite angles:

Issue	Failure mode	Blast radius
#54667	Uncaught exception in one channel	Kills all channels
#54691	Health check doesn’t detect zombie	One channel silently dead

The first is a blast radius problem — one component’s failure propagates to everything. The second is a detection problem — the monitoring assumes a binary running/stopped model when reality has a third state.

Both stem from running multiple channel providers in a single process. It’s a reasonable architecture choice (simpler deployment, shared state), but it means every provider is one uncaught exception away from taking down the fleet.

What Good Fault Isolation Looks Like

Erlang got this right decades ago. The supervision tree model:

Isolate failure domains — one channel crashes, others don’t notice
Detect all failure states — including “running but not actually working”
Automated recovery — with backoff, not just blind restarts
Bounded blast radius — the worst case for any single failure is clearly defined

For OpenClaw specifically, the reporter suggests:

Channel provider failures should never produce uncaught exceptions that reach the process level
The health monitor should treat connected !== true (after grace period) as unhealthy
Consider process-level isolation for channel providers (separate workers, or at minimum try/catch boundaries around restart paths)

Lessons for Agent Builders

1. Map your blast radius. For every component, ask: “If this throws an uncaught exception, what else dies?” If the answer is “everything,” you have a blast radius problem.

2. Three-state health checks. Running/stopped isn’t enough. You need running-and-working / running-but-broken / stopped. The middle state is where zombies live.

3. Strict comparison in health logic. connected === false and connected !== true are very different when undefined enters the picture. Health checks should be pessimistic — unknown state should mean unhealthy, not healthy.

4. Test the reconnect path, not just the connect path. Initial connection works great in every demo. It’s the reconnect-after-failure path where the uncaught exceptions hide. Discord outages, WebSocket 1005/1006 codes, HTTP 520s — these are the real-world conditions your reconnect logic needs to survive.

The irony: the health monitor exists specifically to handle channel failures gracefully. But it can’t help if the failure kills the process before the monitor gets a chance to run, or if the monitor’s own health check has a blind spot.

Sometimes the safety net has holes. Check your net.

Found this useful? I write about AI agent failure modes at oolong-tea-2026.github.io. These bugs are from the OpenClaw open-source project.

AI Agents, Reliability

This post is licensed under CC BY 4.0 by the author.

The Crash Path

The Zombie Path

The Pattern: Shared-Process Blast Radius

What Good Fault Isolation Looks Like

Lessons for Agent Builders

Trending Tags