Why Your WebSocket Connections Drop Under Load
It’s 9:47 PM on a Friday, and your real-time multiplayer game has 847 concurrent users. The chat is flying, scores are updating, everything hums. Then, at 9:48, 300 people drop. The logs show a cascade of WebSocket close code 1006 errors. You scramble, restart the server, and traffic stabilizes—but you have no idea why it happened. You are not alone.
Every indie dev who builds anything real-time—a live auction site, a collaborative editor, a betting dashboard—hits this wall. WebSockets feel magical when they work, but under load, they break in predictable, infuriating ways. The good news? The causes are finite, measurable, and fixable. Let’s trace the fault lines.
The Architecture Trap: One Process to Rule Them All
The most common mistake is treating a WebSocket server like a REST API server. You spin up a single Node.js process, attach a ws library, and assume it handles concurrency the same way an Express route does. It doesn’t.
The Event Loop Isn’t Infinite
Node.js runs on a single thread. That thread handles incoming HTTP requests, file I/O, and WebSocket frames. Under light load, that’s fine. Under load—say, 2,000 concurrent connections each sending a heartbeat every 10 seconds—the event loop starts to lag.
Here’s what happens: The ws.send() calls queue up. The garbage collector runs. A third-party logging library blocks the loop for 15 milliseconds. During that 15 ms, the OS’s TCP buffer fills. The client’s keepalive timeout fires. The client closes the socket. The server never gets a chance to send a ping frame in time.
I once watched a production server drop 40% of its connections every hour because a single console.log inside a broadcast loop that emitted 50,000 messages per second. The console.log was synchronous. The event loop choked. The sockets died.
The fix: Never do synchronous I/O inside a WebSocket message handler. Offload logging to a background worker or a dedicated stream. Use pino or bunyan instead of console.log. And measure your event loop lag with process.hrtime() or a tool like clinic.
The Single-Process Ceiling
Even with perfect event loop hygiene, one Node process can handle roughly 10,000 to 20,000 concurrent WebSocket connections on typical hardware. That sounds like a lot until you factor in broadcast storms—sending the same message to every connected client.
If you have 15,000 users and a game event fires a broadcast, that’s 15,000 ws.send() calls in a tight loop. Each call allocates a buffer, writes to the socket, and triggers a system call. The CPU spikes. The event loop stalls. Connections drop.
The fix: Shard your connections across multiple processes or machines. Use a Redis pub/sub layer so each process only sends to its own clients. Or switch to a language with better concurrency primitives—Go or Rust—where each connection runs on its own goroutine or task.
The Proxy Problem: When Infrastructure Fights You
Your WebSocket server might be perfect, but the infrastructure between the browser and your server is a minefield. Load balancers, reverse proxies, and CDNs all have opinions about long-lived connections.
Load Balancer Timeouts
Most cloud load balancers (AWS ALB, Nginx, HAProxy) have a default idle timeout. For HTTP, that’s usually 60 seconds. For WebSockets, it should be infinite—or at least higher than your longest heartbeat interval.
But here’s the trap: The proxy doesn’t tell you it’s closing the socket. It just sends a TCP RST packet. The client sees a sudden disconnect. The server never sees a close frame. You’re left staring at close code 1006 (abnormal closure) in your logs.
I debugged this for three days on a project. The proxy’s idle timeout was set to 120 seconds. Our WebSocket heartbeat interval was 30 seconds. The client sent pings every 30 seconds, but the server’s pong responses were delayed by 90 seconds due to a slow database query. The proxy saw 120 seconds of silence and killed the connection.
The fix: Set your load balancer’s WebSocket idle timeout to at least 300 seconds. Or better, disable it entirely if your provider allows. Then ensure your server responds to pings within a few hundred milliseconds. Profile your message handlers.
Head-of-Line Blocking at the Proxy
Some proxies buffer WebSocket frames. If a proxy buffers a large frame—say, a 1 MB binary blob—before forwarding it, that frame blocks all subsequent frames for that connection. The client’s receive buffer fills. The client’s flow control kicks in. The connection stalls.
The fix: Use a proxy that supports WebSocket passthrough without buffering. Nginx with proxy_http_version 1.1 and proxy_set_header Upgrade $http_upgrade works. HAProxy in TCP mode works. Avoid any proxy that inspects WebSocket payloads.
The Client-Side Blind Spot: Mobile Networks and Browser Limits
You can tune your server until it sings, but the client is often the real culprit. Mobile networks, corporate firewalls, and browser resource limits all conspire to drop connections.
Mobile Network Handovers
When a user walks from one cell tower to another, their IP address can change. The TCP connection is tied to that IP. The socket dies. The WebSocket reconnection logic kicks in—if you wrote any.
Most mobile carriers also inject transparent proxies that terminate idle TCP connections after 30–60 seconds. On a 4G network, you can lose 10% of your connections every minute if your heartbeat interval is too long.
The fix: Set your client-side heartbeat interval to 15 seconds. Use the navigator.connection API to detect when the user switches from WiFi to cellular and force a reconnect. Implement exponential backoff with jitter so reconnection storms don’t crash your server.
Browser Tab Throttling
Modern browsers throttle JavaScript execution in background tabs. Chrome, for example, limits setTimeout to once per second in background tabs. If your WebSocket reconnect logic relies on setTimeout, it runs slowly. The server closes the connection because it hasn’t seen a heartbeat in 60 seconds.
Even worse, some browsers (Safari, Firefox on Android) completely freeze WebSocket message processing in background tabs after a few minutes. The connection stays open at the TCP level, but no application-level messages flow.
The fix: Use WebSocket.ping() from the server side instead of relying on client heartbeats. Browsers process incoming WebSocket frames even when the tab is backgrounded, because they’re handled by the browser’s network thread, not the JavaScript event loop. Alternatively, use a Web Worker to keep the connection alive.
The Protocol Pitfall: Framing, Fragmentation, and Flow Control
WebSocket is a framed protocol, but the details matter under load. A single mistake in your framing logic can cause data corruption, buffer bloat, and dropped connections.
Message Fragmentation and Memory Pressure
WebSocket messages can be fragmented into multiple frames. The ws library in Node.js reassembles them in memory. If a client sends a 50 MB message in fragments, your server holds 50 MB in a buffer until the final fragment arrives.
Under load, a few malicious or buggy clients can exhaust your server’s memory. The OS kills the process. Every connection drops.
The fix: Set a maximum message size. The ws library supports maxPayload in the server options. Reject any message larger than 1 MB (or whatever your application needs). Also set a timeout for fragmented messages: if the final fragment doesn’t arrive within 10 seconds, close the connection.
Backpressure Ignorance
When a server sends data faster than the client can consume it, the TCP receive buffer fills. The client sends a zero-window advertisement. The server’s send buffer fills. Eventually, the OS drops the connection.
This is called backpressure. Most WebSocket libraries ignore it. ws.send() returns immediately, even if the underlying socket is saturated. You keep queuing messages, memory grows, and the connection eventually dies.
The fix: Check the buffered amount before sending. In ws, use socket.bufferedAmount. If it exceeds a threshold (say, 64 KB), pause sending and resume when it drains. Or switch to a library that supports backpressure natively, like uWebSockets.js or WebSocket-Node.
The Scaling Solution: From 100 to 100,000 Connections
If you’ve fixed all the above and still drop connections at scale, it’s time to change your architecture. Here’s a realistic path from indie dev to production-grade.
Step 1: Vertical Scaling with a Fast Runtime
Before adding processes, max out a single instance. Switch from Node.js to uWebSockets.js, which is written in C++ and handles 10x more connections per core. Or use Go with gorilla/websocket—I’ve seen a single Go process handle 50,000 connections with 2% CPU.
Step 2: Horizontal Scaling with a Pub/Sub Bus
When one process isn’t enough, run multiple instances behind a load balancer. Use Redis pub/sub to broadcast messages across instances. Each instance subscribes to a channel and only sends to its own clients.
But beware: Redis pub/sub is fire-and-forget. If a subscriber falls behind, messages are dropped. For critical messages (like a bet confirmation), use Redis streams or a persistent queue.
Step 3: Stateful Sockets with a Session Store
WebSocket connections carry state—user ID, room, session token. If your server restarts or a client reconnects to a different instance, you lose that state. Use Redis to store session data. On reconnect, the client sends a token, and the new instance loads the state.
Step 4: Graceful Degradation
At very high scale (100k+ connections), WebSockets aren’t the only option. Consider fallback to Server-Sent Events (SSE) for one-way updates, or long-polling as a last resort. The browser’s EventSource API handles reconnection automatically and doesn’t suffer from proxy timeout issues.
One Anecdote That Changed My Approach
I was building a real-time odds feed for a sports betting platform. We had 12,000 concurrent connections. Every 10 seconds, we broadcast a JSON blob of 200 KB to every client. The server ran on a single 8-core machine with Node.js.
Connections dropped every 5 minutes. The pattern was always the same: CPU spikes to 100%, event loop lag hits 500 ms, then a cascade of 1006 errors. I spent a week tuning—reducing JSON size, switching to MessagePack, using cluster module for multi-processing. Nothing fixed it.
Finally, I profiled the broadcast loop. The bottleneck wasn’t serialization or network I/O. It was JSON.stringify on a 200 KB object called 12,000 times per cycle. Each call allocated a new string. The garbage collector ran every 2 seconds, freezing the event loop for 40 ms.
The fix: Serialize once. Cache the serialized string. Send the same buffer to every client. CPU dropped from 100% to 30%. Connection drops vanished.
That lesson stuck: Under load, the simplest operation becomes the bottleneck. Profile before you optimize.
The Forward-Looking Takeaway
WebSocket stability under load isn’t a feature—it’s a discipline. Every dropped connection tells you something specific about your system. The proxy timeout. The event loop lag. The buffer bloat. The garbage collection pause.
Start with a single process and a heartbeat interval of 15 seconds. Add a load balancer with a high timeout. Profile your broadcast loop. Cache your serialized payloads. And when you think you’re done, test with 10x your expected load using a tool like autocannon or k6.
The next time you see 300 users drop at 9:48 PM, you won’t scramble. You’ll check the logs, identify the pattern, and fix it before the next spike. That’s the difference between a weekend project and a production system.