Why Your Redis Cache Hits Drop to Zero After Deployment
You push your latest build to production. You watch the metrics dashboard like a hawk. Everything looks clean—CPU is stable, memory is flat, error rates are flatlining. Then you see it: Redis cache hits drop from 95% to 3% in under two minutes. The cache isn't warming. Every request is slamming your primary database. Your page load times double, then triple. You scramble to check your cache key generation logic, your TTL settings, your eviction policy. Nothing changed. The code is identical to staging. But the cache is cold, and it’s not getting hot again.
This isn’t a bug in your caching library or a misconfiguration of Redis itself. More often than not, the culprit is something far more mundane—and far more insidious. It’s a silent deployment mistake that invalidates every single cache entry the moment your new containers spin up. Let’s walk through exactly what causes this, how to detect it before it hits production, and how to architect your way out of it for good.
The Most Common Culprit: Hardcoded Cache Key Prefixes
The first place to look is your cache key generation. If you’re using a static prefix string baked into your source code—something like "user_profile:" + userId—you’ve built a ticking time bomb. Every time you deploy a new version, that prefix stays the same. That’s not the problem. The problem is that the old cache entries, written by the previous deployment’s process, are still sitting in Redis. Your new deployment should be able to read them just fine.
So why doesn’t it? The answer is almost always a mismatch between the cache key your app generates and the key that was stored. This happens when your prefix includes something that changes between builds—like a build timestamp, a git commit hash, or an environment variable that your CI pipeline injects at deploy time.
How Build-Time Variables Break Your Cache
Imagine you have a config file that looks like this:
const CACHE_PREFIX = process.env.CACHE_VERSION || 'v1';
In your staging environment, CACHE_VERSION is unset, so the prefix is v1. You deploy, the cache warms, everything works. Then you push to production. Your deployment pipeline sets CACHE_VERSION to the current git SHA—a3f2b1c. Suddenly, every cache key starts with a3f2b1c:user_profile:. The old keys under v1:user_profile: are still in Redis. They’re perfectly valid. But your app will never look them up. Cache hit rate: zero.
This pattern is shockingly common. Teams add a version string to the cache prefix to “bust the cache on deploy”—and they do, but they bust everything. All at once. The database gets hammered. The site slows to a crawl. And nobody remembers who added that environment variable or why.
The Deployment Orchestration Trap
Even if your cache keys are perfectly stable, your deployment strategy can still nuke the cache. The issue here isn’t key generation—it’s the order in which services restart.
Rolling Updates and Shared Redis
Say you run three instances of your Node.js API behind a load balancer. You perform a rolling update: instance A stops, the new code starts, instance B stops, new code starts, and so on. During this window, instance A (now running the new code) and instance B (still running the old code) share the same Redis instance. This is fine—they read and write the same keys.
But what if your deployment process includes a database migration that changes the schema of cached objects? Or what if the new code serializes user data differently—maybe it adds a field that the old code doesn’t expect? In that case, when the new instance writes a cache entry, the old instance might read it and crash because the structure doesn’t match. To “fix” this, teams sometimes add a FLUSHALL command to the deployment script.
That’s the nuclear option. It works—for about five minutes. Then the new instances start serving traffic, the cache is empty, and you’re back to zero hit rates until the cache slowly warms up again.
The Blue-Green Deployment Blindspot
Blue-green deployments feel safer. You spin up a full new fleet (green) alongside the old one (blue). You test green, then flip the load balancer. The problem? Both fleets point to the same Redis. When green starts writing cache entries, it uses the same keys as blue. That’s fine for read-heavy workloads. But if green’s code changes the cache key format—even slightly—green will never find blue’s entries. And blue is still writing to Redis until you tear it down. You’ve now got a mixed state where some keys are in the old format, some in the new, and your hit rate oscillates wildly until blue is fully decommissioned.
Cache Invalidation at Scale: The Silent Killer
Cache invalidation is famously one of the two hard things in computer science. But the problem isn’t invalidation logic itself—it’s the scope of invalidation that most teams get wrong.
The “Invalidate Everything” Anti-Pattern
I once consulted for a small iGaming studio that ran a live casino lobby. Every time a new game was added or removed, their backend would call FLUSHDB on the Redis instance. That cleared every cache entry—player sessions, game state, leaderboard snapshots, everything. The lobby would reload instantly, but every other feature would suffer for the next 20 minutes while caches rebuilt. Their hit rate chart looked like a sawtooth wave.
The fix was simple: use key namespaces and invalidate only the affected namespace. Instead of FLUSHDB, they’d delete keys matching lobby:games:*. But they didn’t know Redis supported pattern-based deletion with SCAN and DEL. They assumed a full flush was the only option.
Time-To-Live (TTL) Mismatch Between Environments
Another subtle killer: your development Redis has a TTL of 24 hours. Your staging Redis has a TTL of 1 hour. Your production Redis has a TTL of 5 minutes. You test everything in staging and see a 90% hit rate. You deploy to production, and the hit rate is 10%. The difference? Production traffic is so high that keys expire faster than they can be refreshed. Your cache is effectively a short-term buffer, not a cache.
This is especially nasty because it doesn’t look like a bug. Your code is correct. Redis is working. The keys exist—they just don’t live long enough to serve repeated requests. The solution is to profile your production access patterns and set TTLs based on actual request intervals, not arbitrary guesses.
The Connection Pooling and Serialization Gotcha
Sometimes the cache isn’t cold—your application just can’t reach it.
Redis Connection Limits
Most Redis-as-a-service providers cap the number of concurrent connections. If your deployment doubles the number of application instances (say, during a blue-green deployment), you might exceed that limit. New connections are refused. Your app falls back to the database. Cache hit rate drops to zero because your app can’t even ask Redis if the key exists.
This looks like a cache miss in your metrics, but it’s actually a connection failure. The fix is to monitor your Redis connection count during deployments and either increase the limit or stagger your instance spin-up.
Serialization Format Changes
If you switch your serialization format—say from JSON to MessagePack, or from a plain object to a compressed string—old cache entries become unreadable. Your app will try to deserialize them, throw an error, and treat the result as a cache miss. Worse, some error-handling code will silently swallow the exception and return null, making it look like the key simply wasn’t found.
This happened to a team I worked with that migrated from JSON.stringify to superjson for Date object support. They forgot to invalidate the old cache entries. Every user who had a cached session from the previous deployment got an error on their first request. The app fell back to the database, but it logged a warning that nobody read. For three days, their hit rate hovered at 15% because half the keys were in the old format and half in the new.
How to Diagnose a Zero-Hit-Rate Deployment
You can’t fix what you can’t see. Here’s a quick diagnostic checklist to run the moment your hit rate drops after a deployment.
Check the Cache Key Prefix Live
SSH into a running instance and manually generate a cache key. Use redis-cli to check if that key exists. If it doesn’t, generate the same key from the previous deployment’s code (you should have the artifact or image tagged). If the keys differ, you’ve found a prefix mismatch.
Inspect the Deployment Script
Look for any FLUSHALL, FLUSHDB, or DEL commands in your CI/CD pipeline. If you see one, that’s your smoking gun. Replace it with targeted namespace deletion using SCAN and DEL.
Monitor Redis Connection Count
Set up an alert for Redis connection count exceeding 80% of your plan’s limit. During deployments, if connections spike, your new instances are probably failing to connect and falling back to the database.
Compare TTLs Across Environments
Run a script that samples cache keys from staging and production, then reports the average remaining TTL. If production’s average TTL is significantly lower, your production TTL is too short for your traffic pattern.
Building a Deployment-Safe Caching Strategy
You don’t have to live with cache cold starts. Here’s a practical architecture that survives deployments without a dip.
Use a Stable, Environment-Scoped Prefix
Don’t embed build metadata in your cache keys. Use a prefix that includes only the environment name and the logical cache namespace—like prod:user_profile:. If you need to invalidate the cache after a deployment, do it selectively. Use SCAN 0 MATCH prod:user_profile:* to find and delete only the affected keys. This keeps your session cache, game state cache, and other namespaces intact.
Implement a “Cache Priming” Endpoint
Before you flip the load balancer in a blue-green deployment, hit a warm-up endpoint on your new instances. This endpoint runs through your most popular queries—top 100 user profiles, the lobby game list, the leaderboard—and writes them to Redis. By the time real traffic arrives, the cache is already populated. Your hit rate starts at 70% instead of 0%.
Separate Cache by Data Sensitivity
Not all cached data is equal. Your user session data should have a long TTL and be invalidated only on logout or password change. Your game lobby data can have a short TTL and be invalidated on any content update. Keep them in separate Redis namespaces or even separate Redis instances. This way, a deployment that updates the lobby doesn’t nuke user sessions.
Use a Two-Phase Deployment for Schema Changes
If your deployment includes a change to the serialization format of cached objects, deploy in two phases. Phase one: deploy code that can read both the old and new formats. Let it run for one full TTL cycle so all old entries expire naturally. Phase two: deploy code that only reads the new format. This avoids the “half the keys are unreadable” problem.
The Real Cost of Cold Caches
Every cache miss that hits your primary database adds latency. For a typical PostgreSQL instance serving a read-heavy workload, a cache miss can add 10–50 milliseconds per request. If you have 10,000 requests per second and your hit rate drops to zero, that’s 500 seconds of extra database time per second of wall clock time. Your database will saturate its connection pool, queries will queue, and your pager will go off.
I’ve seen this take down a mid-size gaming platform for 45 minutes during a major tournament. The dev team had added a CACHE_VERSION environment variable to “make cache invalidation explicit.” They forgot to set it in the production deployment configuration. The default value was dev, which didn’t match the prod prefix from the previous deployment. Every player session was fetched from the database. The PostgreSQL replica fell over. The primary followed shortly after. The tournament had to be paused.
What You Can Do Right Now
Before your next deployment, audit your cache key generation. Search your codebase for any string that’s prepended to cache keys. If you find a variable that could change between builds—a git hash, a timestamp, an environment variable—remove it. Replace it with a stable, environment-scoped prefix that only changes when you explicitly want to invalidate that namespace.
Then, add a Redis hit-rate alert to your monitoring system. Set the threshold at 50%. If it drops below that within five minutes of a deployment, you’ll know something is wrong before your users do.
You don’t need a PhD in distributed systems to keep your cache warm. You just need to stop shooting yourself in the foot with deployment-time cache key changes. The next time you deploy, your cache will thank you.