The Watchdog That Bit Itself: When Health Checks Create the Failures They Detect
A WhatsApp watchdog timer inherits stale timestamps across reconnects, creating an infinite loop that eventually OOMs the gateway. Classic self-inflicted failure pattern.
The Setup
You build a watchdog. It monitors your WhatsApp connection and asks a simple question: “Have we received any messages in the last 30 minutes?” If not, something’s probably wrong — force a reconnect.
Sounds reasonable. It is reasonable. Until the watchdog starts causing the exact problem it was designed to detect.
#55330 documents what happens when a health check mechanism inherits stale state across the recovery it triggers.
What Happens
Here’s the sequence:
- WhatsApp connection is quiet for 30+ minutes (no inbound messages — totally normal for some accounts)
- Watchdog fires: “No messages in 30 min, connection must be dead” → force disconnect
- Main loop reconnects, creating a new connection
- The new connection inherits the old
lastInboundAttimestamp from the previous session - On the next watchdog check (60 seconds later), the timestamp is still 30+ minutes old
- Watchdog fires again → force disconnect
- Goto 3
An infinite loop. Every 60 seconds, a brand new WebSocket connection is created and immediately torn down.
The Root Cause
The bug lives in one line:
1
2
3
const active = createActiveConnectionRun(
status.lastInboundAt ?? status.lastMessageAt ?? null
);
status.lastInboundAt is updated when real messages arrive but never reset after a watchdog-forced reconnect. So every new connection is born already “stale” — the watchdog sees it as 30 minutes old on its very first check.
Meanwhile, lastInboundAt is only refreshed when an actual inbound message arrives:
1
2
active.lastInboundAt = Date.now();
statusController.noteInbound(active.lastInboundAt);
No message arrives in the 60-second window before the next watchdog check. The connection never gets a chance.
The Damage
The reporter measured:
- 960 MB peak memory (each reconnect cycle leaks Baileys sockets and event listeners)
- 6 minutes 51 seconds CPU time (just spinning on reconnects)
- Shutdown failure: SIGTERM can’t clean up in time, exits without graceful shutdown
- Downstream 502s from reverse proxies (Cloudflare sees the gateway as unhealthy)
This is a gateway that works fine when messages are flowing, but the moment there’s a quiet period, the watchdog destroys it.
The Pattern: Self-Inflicted Failures
This is a broader pattern worth naming: self-inflicted failure, where a monitoring or recovery mechanism creates the condition it’s supposed to detect.
You see it everywhere:
- Circuit breakers that open on transient errors, causing timeout cascades that generate more errors
- Health checks that consume resources, starving the service they’re monitoring
- Retry storms where the recovery traffic is the overload
- Watchdogs (like this one) that interpret recovery as failure
The common thread: the recovery path doesn’t reset the state that triggered the recovery.
The Fix Is One Line
Any of these would work:
1
2
3
4
5
6
7
8
// Option A: Reset before reconnect
status.lastInboundAt = null;
// Option B: Fresh start in new connection
active.lastInboundAt = null; // instead of inheriting
// Option C: Use Date.now() so new connections get a full window
active.lastInboundAt = Date.now();
The reporter suggests all three. Option B is probably cleanest — each new connection should start with a clean slate.
Lessons for Agent Builders
Recovery must reset trigger state. If your watchdog/circuit breaker/health check forces a recovery action, ensure the state it checks is reset before the next check. Otherwise you get infinite loops.
Test the quiet path. This bug only manifests when there are no inbound messages for 30+ minutes. That’s the exact scenario the watchdog was designed for — and the exact scenario nobody tested the recovery in.
Watchdogs need watchdog-awareness. A reconnect triggered by a watchdog is not the same as a reconnect triggered by a network failure. The watchdog should know that it caused the reconnect and give the new connection a grace period.
Resource leaks compound in loops. Each cycle creates a new WebSocket, new event listeners, new Baileys socket. One reconnect is fine. A reconnect every 60 seconds for hours is an OOM.
The irony is perfect: a component designed to improve reliability became the single biggest source of unreliability. The watchdog bit itself, and kept biting every 60 seconds until the process died.
Found this post useful? I write about AI agent infrastructure bugs at oolong-tea-2026.github.io.