Why Your Socket.io Handshake Fails Behind an AWS Load Balancer
You’ve built a real-time feature—maybe a live leaderboard, a chat widget, or a multiplayer game lobby. It works perfectly on localhost. You deploy to EC2, put an Application Load Balancer in front, and suddenly your Socket.io clients can’t connect. The browser console shows repeated polling attempts, then a timeout. The server logs show no handshake ever completed.
This is one of the most common deployment traps for indie devs scaling their first Node.js WebSocket service. The socket handshake fails not because your code is wrong, but because your load balancer is silently eating the upgrade request. Here’s exactly why it happens and how to fix it without ripping your hair out.
The Anatomy of a Socket.io Handshake
Socket.io starts every connection as a long-polling HTTP request. This is deliberate—it gives the library a fallback when WebSockets can’t negotiate. The client sends a standard POST to the server’s /socket.io/ endpoint with an EIO=4 query parameter and an Upgrade: websocket header.
The server responds with a 200 OK and a session ID. Then, if both sides support WebSockets, the client sends a second request with the Connection: Upgrade header. The server intercepts this, upgrades the protocol from HTTP to WebSocket, and the persistent TCP connection is born.
Behind a load balancer, that second step is where everything breaks. The ALB has to see the upgrade headers and forward the raw TCP connection, not an HTTP request. If it doesn’t, your Node.js server never gets the chance to complete the handshake.
The Classic ALB Configuration Trap
Default Settings Kill WebSockets
AWS Application Load Balancers default to HTTP/HTTPS listeners. When you create a target group, the protocol is set to HTTP by default. An HTTP target group terminates the TCP connection at the load balancer, inspects the request, and forwards only valid HTTP traffic to your instances.
WebSocket upgrade requests are not valid HTTP. They’re a protocol switch. The ALB sees the Upgrade: websocket header and, if not configured for WebSocket support, either drops the request or responds with a 426 Upgrade Required error. Your Socket.io client, waiting for a 101 Switching Protocols response, times out.
The fix is deceptively simple: your target group must use protocol version HTTP1 with stickiness enabled, and the listener must pass the upgrade headers through. AWS ALBs support WebSocket natively, but only if the target group is set to HTTP/1.1—not HTTP/2.
Sticky Sessions Are Non-Negotiable
Here’s where a lot of indie devs get burned. You enable WebSocket support on the ALB, the handshake succeeds, and then your client starts getting 404 errors on subsequent messages. The problem is that Socket.io uses the initial HTTP response to assign a session ID. The WebSocket connection must land on the same EC2 instance that issued that session.
Without sticky sessions (also called session affinity), the ALB may route the upgrade request to a different instance than the one that handled the initial poll. That second instance has no memory of the session, so it rejects the upgrade. The handshake fails silently.
You enable stickiness on the target group using a load balancer-generated cookie. Duration should match your Socket.io ping timeout—typically 60 to 120 seconds. This ensures the ALB pins the client to one instance for the life of the socket connection.
The Proxy Headers Problem
Even after you fix the ALB settings, you might see your socket connection drop after a few minutes. The JavaScript console shows a “transport error” or “xhr poll error,” and the server logs show a disconnect with no client message.
This is usually a proxy header issue. Your Node.js server receives the WebSocket connection from the ALB’s IP, not the client’s IP. Socket.io uses the x-forwarded-for header to get the real client address. If your Express or Fastify server isn’t configured to trust proxy headers, Socket.io may reject the handshake based on IP mismatches during reconnection.
You need to enable proxy trust in your Node.js HTTP server. In Express, that’s app.set(‘trust proxy’, true). In a raw Node HTTP server, you parse x-forwarded-for manually and pass it to Socket.io’s origins option. Without this, the ALB’s health checks and the client’s reconnection attempts will fail intermittently.
Real-World Failure: The 2 AM Deploy
I hit this wall last year while deploying the real-time matchmaking system for a small iGaming platform. The game lobby worked flawlessly on a single t3.micro instance. We added a second instance behind an ALB for redundancy, and suddenly every other player failed to connect.
The pattern was unmistakable: odd-numbered connection attempts succeeded, even-numbered ones failed. The ALB was round-robining between two instances, and without stickiness, every other handshake landed on the wrong server. The fix took ten seconds in the AWS console—enable stickiness on the target group—but finding the root cause cost three hours of packet captures and frantic Googling.
The lesson: never assume your load balancer “just works” with WebSockets. Test with two instances behind the ALB before you go to production.
The Config Checklist for AWS ALB + Socket.io
Listener and Target Group Settings
Your ALB listener should be HTTPS (or HTTP for dev) on port 443 or 80. The target group must use protocol HTTP on port 3000 (or wherever your Node server runs). Set the target group protocol version to HTTP1—HTTP2 does not support the WebSocket upgrade mechanism.
Enable stickiness on the target group. Set the stickiness duration to 120 seconds minimum. Socket.io’s default ping interval is 25 seconds, so 120 seconds gives you a generous buffer. If your app has idle connections, push this to 300 seconds.
Security Group and Health Checks
The ALB security group must allow inbound on port 443 from 0.0.0.0/0 (or your CloudFront IP range). The instance security group must allow inbound on your Node port from the ALB security group only—never from the public internet directly.
Health checks on the target group should point to a simple endpoint like /health that returns 200. Do not point health checks at the Socket.io endpoint. Health checks use HTTP/1.0 with no upgrade headers, and hitting the socket path will create orphaned sessions in your Node process. A dedicated health route keeps your server logs clean and your auto-scaling group happy.
Socket.io Server Configuration
On the server side, set the cors option to your frontend domain. Enable allowRequest to validate the x-forwarded-for header if you’re behind an ALB. Set pingTimeout to 60000 and pingInterval to 25000 to match typical ALB idle timeout behavior.
Here’s a minimal config that works behind an AWS ALB:
const io = new Server(server, {
cors: {
origin: process.env.CLIENT_URL,
credentials: true
},
pingTimeout: 60000,
pingInterval: 25000,
transports: ['websocket', 'polling']
});
Note the transport order: WebSocket first, polling fallback second. This forces Socket.io to attempt the WebSocket upgrade immediately. If the handshake fails, it falls back to polling, which works over HTTP and bypasses the upgrade issue entirely—at the cost of higher latency.
When You Should Ditch the ALB
For indie devs running a single instance or a small cluster, an ALB adds complexity and cost (roughly $20/month for the ALB plus data processing fees). If your traffic is under 100 concurrent sockets, you can skip the load balancer entirely. Point a DNS A record directly at your EC2 instance, use Elastic IP for stability, and handle SSL with Certbot or Caddy.
You only need an ALB when you have multiple instances behind an auto-scaling group, or when you require SSL termination at the edge. If you’re at that scale, the ALB is the right tool—but you now know exactly where the handshake breaks and how to fix it.
One alternative worth considering: a Network Load Balancer. NLB passes raw TCP connections straight through to your instances without inspecting HTTP headers. This means WebSocket upgrades work without any special configuration. The trade-off is that NLB cannot terminate SSL, so you’ll need to handle HTTPS on your instances with a reverse proxy like Nginx or Caddy in front of your Node server.
Forward-Looking: The CloudFront and API Gateway Path
AWS now supports WebSocket connections through CloudFront (since late 2021) and API Gateway WebSocket APIs. If you’re building a new real-time feature, consider whether you really need to manage WebSocket servers at all.
API Gateway WebSocket API handles the connection lifecycle, scaling, and authentication for you. You pay per message and per connection minute. For a leaderboard or chat app, this often works out cheaper than an EC2 instance plus ALB. The downside: you lose control over the transport layer, and debugging is harder because you can’t SSH into the server to inspect socket state.
CloudFront with WebSocket support is a better fit if you already use CloudFront for your static assets and API. It terminates SSL at the edge, forwards WebSocket traffic to your origin, and handles DDoS protection. The ALB can sit behind CloudFront as the origin, giving you edge caching for static content and WebSocket passthrough for real-time traffic.
Your next step should be to write a simple integration test that connects two Socket.io clients through your ALB. If the handshake succeeds, the test passes. Run it in your CI pipeline before every deploy. That single test would have saved me three hours of debugging at 2 AM, and it will save you the same.