Why Your Node.js TLS Handshake Times Out After 100 Concurrent Connections
You’ve deployed your Node.js WebSocket server, tested it locally with a handful of clients, and everything hums along fine. Then you push to staging, spin up 100 concurrent connections, and the TLS handshake starts timing out after about 50 or 60 sockets—connections stall, the event loop chokes, and your logs fill with cryptic ETIMEOUT errors.
The question isn’t whether Node.js can handle TLS at scale. It can. The real question is why a perfectly tuned event loop falls flat on its face the moment you push past a triple-digit connection count with TLS enabled. The answer lives in an obscure default that’s been quietly sabotaging production deployments for years: Node’s default TLS session cache size. Here’s how it works, why it breaks, and exactly what to do about it.
The 100-Connection Wall: What Actually Happens
When a client initiates a TLS handshake, Node.js performs what’s called a full handshake—certificate exchange, key agreement, cipher negotiation. That’s expensive: roughly 2-3 round trips of latency and a measurable CPU spike for asymmetric crypto operations. For a single connection, it’s negligible. For 10, it’s fine. For 100, the overhead compounds in a way that exposes a hidden bottleneck.
The Session Cache Bottleneck
Node.js ships with a built-in TLS session cache that defaults to 100 entries. This cache stores session IDs and session tickets so that subsequent connections from the same client can resume a previous session without a full handshake. In theory, that’s great. In practice, the default eviction policy triggers a full handshake for every new connection once the cache fills—and that’s exactly what happens around connection 101.
Each full handshake requires the server to perform asymmetric decryption of the pre-master secret, which is CPU-bound and runs on the main thread. When 50 of those happen simultaneously, the event loop stalls. The next 50 connections time out waiting for their turn on the CPU.
Why This Hits You at 100, Not 99
The default maxSessionSize in Node’s TLS implementation is 100 entries. Once the cache hits that limit, the oldest entry is evicted on every new connection attempt. But here’s the kicker: the eviction check happens synchronously during the handshake. When 100 clients connect nearly simultaneously, the first 100 get cached sessions (if they’re resuming). The 101st triggers an eviction, which forces a full handshake—and if that handshake collides with others also evicting, you get a cascade of full handshakes all contending for the same CPU.
I’ve seen a production incident where a gaming platform’s real-time scoreboard API went down for 12 minutes because a marketing push sent 300 concurrent WebSocket connections at once. The TLS handshake timeout was the root cause, not the application logic. The fix took 15 minutes: bump the session cache size.
Diagnosing TLS Handshake Timeouts in Production
Before you start tuning, you need to confirm that the session cache is the culprit. Node.js doesn’t log TLS handshake failures by default, so you’ll need to enable debug-level logging or capture metrics.
Enable TLS Debug Logging
Set the NODE_DEBUG environment variable to include tls:
NODE_DEBUG=tls node server.js
This prints every handshake attempt, session cache hit, and eviction event to stderr. Look for lines like TLS: session cache full, evicting or TLS: full handshake (no session) appearing in rapid succession. If you see those patterns during connection spikes, you’ve found your bottleneck.
Monitor CPU and Event Loop Lag
Full handshakes are CPU-intensive. Use process.hrtime.bigint() or a monitoring tool like Clinic.js to measure event loop lag during connection bursts. A lag exceeding 200ms during handshake phases suggests the main thread is saturated with crypto operations. Combine that with debug logs, and the pattern becomes undeniable.
Check Your Connection Pooling
If your clients reuse connections (e.g., HTTP keep-alive or persistent WebSockets), the session cache may not even be the issue. But if clients connect, disconnect, and reconnect frequently—common in mobile apps or browser tabs that sleep—you’re generating new TLS sessions on every reconnect. That’s when the cache fills fast.
Fixing the TLS Session Cache
The fix is straightforward, but you need to understand the trade-offs. Increasing the cache size reduces full handshake frequency at the cost of memory. Each cached session consumes roughly 1-2 KB for the session ID and associated state. For most applications, a cache of 1000 entries (about 1-2 MB) is safe.
Increase maxSessionSize in tls.createServer
When you create your HTTPS or TLS server, pass maxSessionSize in the options object:
const tls = require('tls');
const options = {
key: fs.readFileSync('server.key'),
cert: fs.readFileSync('server.cert'),
maxSessionSize: 1000 // default is 100
};
const server = tls.createServer(options, (socket) => {
// handle connection
});
This single line change eliminates the eviction cascade for up to 1000 concurrent sessions. For most indie devs and small studios, that’s enough headroom for 5x-10x your normal traffic.
Adjust Session Timeout
The session cache also has a default timeout of 300 seconds (5 minutes). If your clients reconnect more frequently than that, sessions expire before they’re reused. Shorten the timeout to match your typical reconnect interval, or lengthen it if clients stay disconnected for longer periods.
const options = {
key: fs.readFileSync('server.key'),
cert: fs.readFileSync('server.cert'),
maxSessionSize: 1000,
sessionTimeout: 120 // 2 minutes in seconds
};
Use External Session Storage for Distributed Systems
If you run multiple Node.js instances behind a load balancer, the in-memory session cache is per-process. A client that connects to instance A gets a cached session, but the next connection to instance B requires a full handshake. This defeats the purpose of caching.
For distributed deployments, use an external session store like Redis with the tls.connect session resumption callback. Here’s a simplified pattern:
const redis = require('redis');
const client = redis.createClient();
const server = tls.createServer({
key: fs.readFileSync('server.key'),
cert: fs.readFileSync('server.cert'),
// Provide a custom session ID callback
getSession: (sessionId) => {
return new Promise((resolve) => {
client.get(`tls:${sessionId}`, (err, data) => {
resolve(data ? JSON.parse(data) : null);
});
});
},
setSession: (sessionId, session) => {
client.setex(`tls:${sessionId}`, 300, JSON.stringify(session));
}
});
This approach scales horizontally but adds Redis latency. For most indie projects, the in-memory cache increase is sufficient.
Beyond the Cache: Other Handshake Bottlenecks
Fixing the session cache is the highest-impact change, but it’s not the only factor. If you’ve increased the cache size and still see timeouts, check these areas.
TLS Certificate Chain Size
Large certificate chains (e.g., intermediate CAs that require multiple certificates) increase handshake payload size. Each full handshake sends the entire chain to the client. If your chain is 10 KB, 100 concurrent full handshakes mean 1 MB of certificate data transferred—and parsed—on the main thread.
Optimize by removing unnecessary intermediate certificates. Use tools like openssl x509 -in cert.pem -text -noout to inspect the chain, and prune any certificates that aren’t strictly required for validation.
Cipher Suite Negotiation Overhead
Node.js defaults to a broad cipher suite list for compatibility. Each handshake involves negotiating which cipher to use, which adds CPU cycles. For server-to-server connections where you control both ends, restrict the cipher list to modern, fast ciphers like TLS_AES_128_GCM_SHA256:
const options = {
ciphers: 'TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384',
honorCipherOrder: true
};
For public-facing servers serving a wide range of clients, this is riskier—some older clients may not connect. Test thoroughly before deploying.
Event Loop Starvation from Other Work
A common pattern: your Node.js server handles TLS termination and also runs business logic on the same process. If a long-running synchronous operation (e.g., JSON parsing of a large payload, or a slow database query) blocks the event loop for 100ms, TLS handshakes queued during that time pile up and time out.
Profile your application’s event loop lag with clinic doctor or 0x flamegraphs. If you see lag spikes that correlate with handshake failures, consider offloading TLS to a reverse proxy like Nginx or HAProxy, which handles TLS in C with far better concurrency characteristics.
A Concrete Example: The Real-Time Scoreboard Crash
Let me walk you through a real scenario from a project I consulted on. A small iGaming studio ran a Node.js WebSocket server for live betting odds. The server used wss:// with a self-signed cert for testing. In staging, they simulated 50 concurrent connections—fine. In production, a marketing push drove 200 simultaneous WebSocket connections from mobile clients.
Within 30 seconds, the server stopped accepting new connections. The event loop lag spiked to 800ms. The TLS debug logs showed session cache full, evicting repeated dozens of times per second. Every eviction forced a full handshake, which consumed 2-3ms of CPU per handshake. With 150 connections in various states of handshake, the event loop couldn’t process incoming data fast enough, and the TCP backlog filled up.
The fix: increase maxSessionSize to 2000, and add a Redis-backed session store so that even if clients reconnected to a different process (they only had one instance at the time, but planned to scale), sessions would be reused. The change took 20 minutes, including deployment. Handshake timeouts dropped to zero.
Practical Takeaway: Build for the Spike, Not the Steady State
Your TLS server will rarely sustain 100 concurrent handshakes in steady state—most clients reuse sessions. The danger is the spike: a marketing campaign, a live event, a DDoS-like burst of legitimate traffic. In those moments, the default session cache becomes a liability.
The fix is a one-liner: maxSessionSize: 2000. But more importantly, treat TLS handshake capacity as a resource you monitor and tune, just like database connection pools or memory limits. Add a gauge metric for current session cache size and full handshake count to your observability stack. When those numbers diverge, you’ll know exactly where to look.
For the next project, consider terminating TLS at the load balancer level (ELB, Nginx, or HAProxy) and passing plain HTTP to your Node.js backend. This offloads the CPU-intensive crypto work to specialized software that handles it far more efficiently. But if you’re running a monolith or a small cluster where every millisecond of latency matters, the session cache tuning is your first and most effective lever.
The wall at 100 connections is not a Node.js limitation—it’s a default that assumes low traffic. You’re not building a toy. Tune accordingly.