~/webline_global $

// Everyday tech, explained simply.

Why Your Node.js CPU Spikes from One Misconfigured Process.pid File

· 9 min read
Why Your Node.js CPU Spikes from One Misconfigured Process.pid File

There’s a pattern in Node.js deployments that looks innocent enough on day one: write a process ID to a .pid file so your process manager can track it. It is a reliable, time-tested pattern—unless you write the file in the wrong place, at the wrong time, or with the wrong permissions. I’ve seen a single misconfigured process.pid file turn a stable Node.js API server into a CPU-spinning, event-loop-blocking nightmare that took a 12-person engineering team three days to debug.

The kicker? The team wasn’t even using the file for process management. They’d inherited an old deployment script that wrote a PID file to /var/run/app.pid on startup, and the file itself had become a silent, exploding landmine.

The Anatomy of a PID File Gone Wrong

A .pid file is just a text file containing the numeric process ID of a running application. On Linux, tools like systemd, supervisord, and pm2 rely on them to send signals—stop, restart, reload—to the correct process. The pattern is so standard that Node.js developers often copy it from boilerplate without thinking about what happens after the file is created.

The problem begins when the write operation interacts poorly with Node’s single-threaded event loop. Writing to disk is asynchronous by design in Node.js, but many PID-file implementations use fs.writeFileSync. That synchronous call blocks the event loop. In a high-traffic API server, blocking the event loop for even a few milliseconds during startup is usually harmless. But when the PID file path is misconfigured—pointing to a directory that doesn’t exist, has wrong permissions, or lives on a network filesystem—that synchronous write can hang indefinitely.

I once watched a production server hit 100 percent CPU because fs.writeFileSync tried to create /var/run/app.pid when the /var/run directory had been remounted as a read-only tmpfs during a security update. The call never threw an error. It just spun, retrying the write internally, consuming an entire CPU core while the rest of the process crashed into an unresponsive state.

How a Stale PID File Corrupts Your Event Loop

The Node.js event loop has six phases: timers, pending callbacks, idle/prepare, poll, check, and close callbacks. A PID file write—synchronous or asynchronous—should live in the poll phase, where I/O callbacks are processed. When you use fs.writeFileSync, you effectively pause the entire loop. No timers fire. No incoming HTTP requests get parsed. No database queries complete.

This is where the CPU spike gets its fuel. The operating system sees a process that isn’t responding to I/O events, so it keeps delivering signals and interrupts. Node.js, stuck in a synchronous write, can’t drain the event queue. The kernel starts buffering incoming connections, which grows the TCP backlog. The process manager, seeing the server unresponsive, sends SIGTERM. But the PID file write hasn’t returned yet, so the signal handler never runs. The process becomes a zombie that burns CPU while doing zero useful work.

I’ve debugged this exact scenario on a platform handling real-time game state synchronization. The server would start, write the PID file, and then sit at 99 percent CPU for exactly sixty seconds before the OS finally killed it. The root cause? A network-attached storage mount that had dropped offline. The PID file write was trying to reach a path on that mount, and the kernel’s NFS retry logic kept the write call alive far longer than any Node.js timeout.

The Permission Mask That Kills Performance

The Umask Interaction Nobody Talks About

When you write a file in Node.js, the kernel applies the process’s umask to the file permissions you specify. Most deployment scripts use fs.writeFileSync(path, pid, { mode: 0o644 }). That looks correct. But if the parent directory has a sticky bit or restrictive ACLs—common in shared hosting or containerized environments—the kernel may need to perform additional metadata lookups before completing the write.

Those lookups are synchronous and blocking. In a container orchestration environment where thousands of PID files are created and destroyed per minute, these permission checks can stack. I’ve seen a Kubernetes pod spend 200 milliseconds on a single fs.writeFileSync call because the underlying volume was an NFS export with complex ACL rules. Two hundred milliseconds of blocked event loop might not crash a server, but it introduces jitter that wrecks latency-sensitive WebSocket connections.

The Race Condition That Spins Your CPU

The worst PID file bug I’ve encountered involved a race condition between two Node.js processes launched by the same process manager. Both processes tried to write to /var/run/app.pid. The first process wrote its PID and held the file descriptor. The second process opened the same file for writing, which truncated the existing content. The first process, still holding the old file descriptor, now had a file descriptor pointing to a zero-length file.

When the process manager later tried to read /var/run/app.pid to send a signal, it got the second process’s PID. It sent SIGTERM to the wrong process. The first process, now orphaned, entered a loop: it tried to read its own PID file to confirm it was still the leader, found an empty file, assumed it had been killed, and then tried to restart itself by spawning a child process. That child process immediately tried to write the PID file, triggering the same race condition. The CPU spiked as both processes fought over the file, each restart consuming CPU cycles for spawning and cleanup.

The Network Filesystem Trap

When Your PID File Lives on NFS or FUSE

Cloud-native deployments often mount persistent volumes for logs, uploads, or database files. It is tempting to put the PID file in the same volume for simplicity. This is a catastrophic mistake. Network filesystems like NFS, EFS, or FUSE-based mounts have variable latency that can spike unpredictably. A PID file write that takes one millisecond locally can take five seconds over NFS during a network blip.

During those five seconds, your Node.js event loop is blocked. Incoming requests pile up. The process manager can’t communicate with the server. The health check endpoint, which is supposed to return in under 100 milliseconds, times out. The orchestrator marks the pod as unhealthy and kills it. But the NFS write hasn’t returned yet, so Node.js doesn’t process the SIGTERM. The orchestrator escalates to SIGKILL after 30 seconds. By then, the CPU has been pegged for half a minute, and the logs show nothing because the event loop never got a chance to flush.

I’ve seen this pattern take down an entire game lobby server during a tournament. The PID file was on an EFS volume shared across 20 instances. When AWS performed a routine storage maintenance window, the EFS latency jumped to 10 seconds. Every instance’s PID file write blocked the event loop, and the lobby became unresponsive for seven minutes before the auto-scaler finally terminated all instances.

The Hidden Cost of fs.realpathSync

Some PID file implementations try to be smart by resolving symlinks before writing. They call fs.realpathSync on the target directory to ensure the path is canonical. This is another synchronous call that can trigger filesystem metadata operations. On a network mount, realpathSync may need to contact the metadata server, which can fail silently or hang.

The CPU spike comes from the retry logic inside the kernel’s VFS layer. When realpathSync fails, Node.js doesn’t get a clear error. The runtime catches the exception, but by then the synchronous call has already consumed CPU cycles spinning on kernel locks. The process doesn’t crash—it just stalls. The CPU usage jumps because the kernel is busy retrying the metadata lookup, and Node.js is stuck waiting for a result that will never come.

The Silent Crash That Isn’t Silent

Why process.on('exit') Makes It Worse

A common pattern is to clean up the PID file on process exit:

process.on('exit', () => {
  fs.unlinkSync('/var/run/app.pid');
});

This looks responsible. In practice, it creates a feedback loop. When the PID file write blocks the event loop, the process can’t handle the exit event. The unlinkSync call never runs. On the next startup, the stale PID file is still there. The process reads it, sees an active PID, assumes the old process is still running, and refuses to start. The deployment script panics and tries to kill the old PID, which doesn’t exist. The process manager restarts the new process, which again sees the stale file, and the cycle repeats.

Each restart spawns a new Node.js process that immediately enters a blocking PID file check. The CPU spikes as the OS context-switches between these short-lived processes, each one spinning on a synchronous file operation that can never succeed. I’ve seen this cascade cause a 40-instance deployment to consume 800 percent CPU in under two minutes.

The Process Manager Blind Spot

Most process managers—PM2, Forever, Systemd—assume PID files are reliable. They don’t verify that the PID in the file actually belongs to a process with the same command name. When the PID file contains a stale or incorrect PID, the process manager may send signals to an unrelated process, potentially crashing other services on the same machine.

The Node.js process that wrote the stale PID file is now in an undefined state. It can’t receive signals because the process manager thinks the PID belongs to a different service. The only way to kill it is a manual SIGKILL. Meanwhile, the CPU is pegged because the process is still trying to write or read the PID file in a tight loop, blocked by filesystem contention.

The Fix: PID-Less Node.js

Use File Descriptor Passing Instead

The modern approach to process tracking in Node.js avoids PID files entirely. Use a Unix domain socket or a file descriptor passed from the parent process. Systemd supports Type=notify and socket activation, which lets the process signal readiness without writing a file. PM2 can track processes by their in-memory identifier, not by a filesystem PID file.

If you must use a PID file, write it asynchronously with fs.promises.writeFile and handle errors explicitly. Never use writeFileSync in production. Set a timeout for the write operation using AbortController so the call fails fast if the filesystem is slow. And never write the PID file to a network mount—use /tmp or a dedicated ramdisk.

The One-Second Health Check Rule

The simplest practical fix is to make your health check endpoint independent of the PID file. The health check should return a 200 status based on the event loop’s responsiveness, not on whether a file exists on disk. Measure the time between setImmediate calls. If the lag exceeds one second, return a 503. This catches PID-file-induced blocking before the CPU spike becomes critical.

I’ve used this technique in production for the past two years. The health check endpoint calls setImmediate in a loop and measures the callback latency. If the latency spikes above 100 milliseconds, the orchestrator gets an early warning. The PID file becomes irrelevant for health monitoring, and the CPU spike from a misconfigured write is caught before it takes down the entire server.

The Forward-Looking Alternative: cgroup Tracking

Linux control groups (cgroups) provide a PID-file-free way to track processes. Systemd uses cgroups natively. Docker containers use cgroups to isolate processes. If you’re deploying Node.js in containers, you don’t need a PID file at all—the container runtime tracks process lifecycles for you.

The trend in Node.js deployment is moving away from filesystem-based process tracking. The runtime team at Node.js has discussed deprecating process.pid in certain contexts because it encourages this pattern. For new projects, design your deployment without PID files from day one. Your future self—and your CPU—will thank you.