Why Your Python Async Tasks Stall After 30 Simultaneous Uploads
You’re running a small studio, building a Python backend that handles file uploads. Testing locally, everything flies—ten users, twenty, maybe twenty-five. Then the thirty-first upload hits, and the whole thing freezes. Your async tasks stall, connections drop, and you’re staring at a log file full of asyncio.Task was destroyed but it is pending errors.
This isn’t a bug. It’s a design failure, and it’s shockingly common in Python async backends built on asyncio or FastAPI. The culprit isn’t your code’s logic—it’s your runtime’s default concurrency model and how it handles blocking I/O under load. Let’s break down exactly why that happens and how to fix it before your users rage-quit your platform.
The Hidden Limit in Python’s Async Event Loop
Python’s asyncio event loop is single-threaded by design. It uses cooperative multitasking: each task yields control voluntarily when it hits an await point. That works beautifully for I/O-bound work like HTTP requests or database queries, where the task spends most of its time waiting on the network.
But file uploads are different. They involve reading raw bytes from a socket, parsing multipart form data, and often writing to disk—all of which can block the event loop if not handled correctly. The problem isn’t that Python is slow. It’s that your event loop has a fixed pool of worker threads for blocking operations, and that pool defaults to a number that’s too small for concurrent uploads.
The Default ThreadPoolExecutor Trap
When you use async def with await in FastAPI or Starlette, the framework runs synchronous file I/O in a separate thread pool. That pool is managed by ThreadPoolExecutor, and its default max_workers is set to min(32, os.cpu_count() + 4). On a typical 8-core cloud instance, that gives you 12 workers.
Here’s the kicker: each upload task ties up one of those workers for the entire duration of the file read. If you have 30 simultaneous uploads, you’re asking for 30 threads. The executor will queue the excess tasks, but the event loop won’t yield to them until a worker frees up. Meanwhile, your running tasks are waiting for the event loop to schedule their await points—and the loop is stuck waiting on threads that are themselves blocked on disk writes.
You end up with a textbook deadlock: the event loop can’t process new tasks because all workers are busy, and the workers can’t finish because they need the event loop to complete their I/O. The thirty-first upload never even starts.
How Multipart Parsing Magnifies the Problem
The real bottleneck isn’t just the thread pool—it’s how Python’s standard libraries parse incoming multipart data. When a client uploads a file, the HTTP server (like uvicorn or gunicorn) receives the raw bytes and passes them to your framework. FastAPI, for example, uses python-multipart to parse the stream.
That parser is synchronous. It reads the entire request body into memory or a temporary file before your handler even sees the UploadFile object. During that parse, the event loop is blocked on the thread that’s running the parser. If you have 30 concurrent uploads, each one locks a thread for the entire parse duration—which can be seconds for large files.
A Concrete Example: 50 MB Video Uploads on a 4-Core Server
I ran into this exact problem at a previous studio. We were building a real-time highlight reel platform for amateur sports. Users uploaded 50 MB video clips directly to our FastAPI backend. On a 4-core DigitalOcean droplet, the default thread pool had 8 workers. The first eight uploads parsed and processed fine. The ninth through sixteenth queued but eventually finished. The seventeenth upload stalled indefinitely.
The logs showed asyncio.exceptions.CancelledError on tasks that had been pending for over 60 seconds. The client-side JavaScript retried the upload, which only made things worse—each retry spawned a new task that competed for the same exhausted pool. We had a cascading failure: the event loop was so saturated with stalled tasks that health check endpoints timed out.
The fix wasn’t more CPU or RAM. It was rethinking how we handled the upload pipeline.
Architecting for True Concurrency: Three Fixes That Actually Work
You don’t need to rewrite your entire stack. You need to match your concurrency model to the workload. Here are three concrete strategies, ordered from simplest to most robust.
Increase the Thread Pool—But Know the Cost
The easiest band-aid is to increase max_workers in your executor. You can do this at the application level in FastAPI by overriding the default thread pool:
import asyncio
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=64)
loop = asyncio.get_event_loop()
loop.set_default_executor(executor)
This works up to a point. More threads mean more concurrent blocking operations, but each thread consumes memory (about 8 MB per thread on Linux) and adds context-switching overhead. At 64 threads, you’re looking at 512 MB of overhead just for the thread stacks. On a small instance, that’s real memory pressure.
This approach also doesn’t solve the fundamental problem: synchronous multipart parsing still blocks a thread. You’re just buying headroom. For a production iGaming or media platform, you need a better solution.
Offload Uploads to a Background Task Queue
The smarter move is to decouple the upload acceptance from the upload processing. Accept the file bytes quickly in your async handler, write them to a temporary location (preferably a fast, ephemeral filesystem like /dev/shm or a RAM disk), and then enqueue a background task to process the file.
Here’s a minimal pattern using FastAPI and asyncio.create_task:
from fastapi import FastAPI, UploadFile
import aiofiles
import asyncio
app = FastAPI()
async def process_upload(file_path: str):
# Heavy processing like transcoding or virus scanning
await asyncio.sleep(0) # Yield control
# ... processing logic ...
@app.post("/upload")
async def upload_file(file: UploadFile):
# Write to temp location quickly
async with aiofiles.open(f"/tmp/{file.filename}", "wb") as f:
while chunk := await file.read(1024 * 1024): # 1 MB chunks
await f.write(chunk)
# Offload processing without blocking the response
asyncio.create_task(process_upload(f"/tmp/{file.filename}"))
return {"status": "accepted", "file": file.filename}
Notice the await file.read() loop. By reading in chunks and yielding the event loop after each chunk, you prevent any single upload from monopolizing a thread. The asyncio.create_task then runs the heavy processing concurrently. The client gets an immediate 202 Accepted response, and your event loop stays responsive.
This pattern scales far better because the bottleneck moves from thread count to disk I/O throughput and memory. You can handle hundreds of simultaneous uploads on a modest instance if your disk subsystem can keep up.
Use a Dedicated Upload Service with Streaming
For high-availability systems—think iGaming platforms processing player ID verifications or payout documents—you shouldn’t handle file uploads in your application server at all. Use a dedicated upload service or a reverse proxy that streams directly to object storage.
Nginx, for example, can stream uploads to a backend via proxy_request_buffering off. Combined with a service like MinIO or S3, your Python backend never touches the raw bytes. The upload service handles the blocking I/O, and your app receives a simple URL once the file is stored.
location /upload {
proxy_pass http://your-python-backend;
proxy_request_buffering off;
client_max_body_size 500M;
proxy_http_version 1.1;
}
On the backend side, you just accept a metadata payload:
@app.post("/upload-complete")
async def upload_complete(payload: dict):
file_url = payload["url"]
# Process the file URL asynchronously
asyncio.create_task(process_file_from_url(file_url))
return {"status": "ok"}
This is the pattern used by major platforms like Dropbox and Google Drive. Your Python backend stays pure async, your thread pool handles only lightweight tasks, and the file I/O happens in a layer designed for it.
The Real-Time Dimension: Why This Matters for Live Systems
If you’re building anything with real-time features—WebSocket-based chat, live game state sync, or streaming analytics—the stall problem is existential. A blocked event loop doesn’t just affect uploads. It kills your WebSocket connections, delays your game state broadcasts, and breaks your heartbeat signals.
I’ve seen this happen in an online casino platform where players were uploading KYC documents (passports, utility bills) for identity verification. The upload handler would stall the event loop, and the WebSocket connection that streamed the live roulette wheel state would drop. Players saw frozen screens and assumed the game was rigged. The support tickets flooded in within seconds.
The fix was moving all file uploads to a separate microservice behind an Nginx proxy, as described above. The main game server never touched a file. Its event loop stayed responsive, WebSocket pings went through, and the KYC service could scale independently based on upload volume.
Monitoring for Event Loop Health
You can’t fix what you don’t measure. Add event loop latency monitoring to your production stack. Tools like aiomonitor or custom middleware that logs the delta between scheduled and actual execution time can catch stalls before they cascade.
A simple health check endpoint that measures how long a await asyncio.sleep(0) takes to return gives you a real-time view of loop saturation:
import time
import asyncio
from fastapi import FastAPI
app = FastAPI()
@app.get("/health/loop")
async def health_loop():
start = time.monotonic()
await asyncio.sleep(0)
elapsed = time.monotonic() - start
if elapsed > 0.1:
# Event loop is stalling
return {"status": "degraded", "loop_latency_ms": elapsed * 1000}
return {"status": "ok", "loop_latency_ms": elapsed * 1000}
If you see latencies above 50 milliseconds, you have a problem. Above 100 milliseconds, your system is in trouble.
What You Should Do Next
Don’t just throw more hardware at the problem. Profile your upload path with asyncio debug mode enabled (PYTHONASYNCIODEBUG=1). Look for tasks that take longer than 100 milliseconds to complete. Identify where the blocking actually happens—is it multipart parsing, disk writes, or the thread pool exhaustion?
Then pick one of the three strategies above and implement it. Start with the chunked read loop and background task offload—it’s the lowest effort and gives you the most immediate relief. If your upload volume grows beyond a few hundred concurrent requests, graduate to the dedicated upload service pattern.
Your users don’t care about your event loop architecture. They care that their upload finishes in under five seconds and that the app stays responsive while it happens. Give them that, and they’ll never know the difference between a threaded and an async backend.