Your Python/FastAPI service consumes RAM during request processing and doesn't release it back to the OS, even after the request completes. Over time (hours/days), memory usage grows from 500MB → 1GB → 2GB → eventually crashes.
- ✅ Memory increases during processing (expected)
- ❌ Memory stays high after processing (problem)
- ❌ Each request adds a bit more memory (accumulation)
- ❌ Eventually: OOM (Out of Memory) errors or service crash
Python doesn't immediately return memory to the operating system, even after garbage collection:
# Your code does this:
large_data = fetch_1gb_of_data() # Memory: 1 GB
process(large_data)
del large_data # Python frees the object
gc.collect() # Garbage collector runs
# OS still sees: 1 GB used
# Python thinks: "I might need this memory again soon, I'll keep it"This is BY DESIGN, not a bug! Python's memory allocator (pymalloc) holds onto freed memory to reuse for future allocations, avoiding expensive OS calls.
Beyond Python's normal behavior, these cause actual leaks:
# 1. Global state accumulation
request_cache = [] # Grows forever
@app.post("/api")
async def endpoint(data):
request_cache.append(data) # Never cleared!
return {"status": "ok"}
# 2. Circular references
class Request:
def __init__(self):
self.response = Response(self)
class Response:
def __init__(self, request):
self.request = request # Circular!
# 3. Large objects held in memory
processed_data = {} # Global dict
@app.post("/api")
async def endpoint(request: LargeModel):
processed_data[request.id] = request # Keeps entire model
return {"status": "ok"}Periodically kill and restart workers to release memory back to OS.
- 2+ workers for zero downtime
- With 1 worker: accepts brief downtime during restart
#!/bin/bash
# entrypoint.sh
gunicorn src.main:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000 \
--max-requests 1000 \
--max-requests-jitter 100 \
--timeout 30 \
--graceful-timeout 30 \
--keep-alive 5Key parameters:
--max-requests 1000: Restart worker after handling 1000 requests--max-requests-jitter 100: Add randomness (900-1100) so all workers don't restart simultaneously--graceful-timeout 30: Give worker 30 seconds to finish current requests before killing
Worker 1: Request 1-1043 → Finishes current request → Shuts down
New Worker 1 spawns while old one shutting down
Worker 2: Request 1-967 → Finishes current request → Shuts down
New Worker 2 spawns while old one shutting down
Worker 3: Request 1-1089 → Finishes current request → Shuts down
New Worker 3 spawns while old one shutting down
Zero downtime: New workers spawn before old ones die
Memory released: OS reclaims memory when worker process exits
#!/bin/bash
# entrypoint.sh
uvicorn src.main:app \
--host 0.0.0.0 \
--port 8000 \
--workers 4 \
--max-requests 1000 \
--max-requests-jitter 100Note
You need a slightly sophisticated solution if you only have 1 worker
Pros:
- ✅ Industry standard solution
- ✅ Handles ALL memory issues (Python's behavior + real leaks)
- ✅ No code changes required
- ✅ Zero downtime with 2+ workers
Cons:
⚠️ Requires 2+ workers for zero downtime⚠️ Doesn't fix root cause (mitigates it)
Industry Standard • Used as safety net by most production systems
Set hard memory limits and let Docker/Kubernetes restart the container when exceeded.
# docker-compose.yml
services:
your-service:
build: .
deploy:
resources:
limits:
memory: 4G # Kill if exceeds 4GB
reservations:
memory: 2G # Guaranteed minimum
restart: unless-stopped # Auto-restart after kill
Memory usage: 2GB → 3GB → 3.9GB → 4GB
Docker: SIGKILL (container killed immediately)
Docker: Restarts container (based on restart policy)
Downtime: 5-10 seconds
Memory: Back to ~500MB baseline
Pros:
- ✅ Prevents one service from killing entire server
- ✅ Automatic recovery
- ✅ No code changes
- ✅ Works as failsafe with Solution 1
Cons:
- ❌ Brief service interruption when limit hit
- ❌ Doesn't fix root cause
Best practice: Set limit higher than normal usage to catch runaway cases only.
Common Practice • Good for: Low-traffic services or known memory patterns
Restart the container during low-traffic periods.
# Restart every day at 3 AM
0 3 * * * docker restart your-service-containerPros:
- ✅ Simple to implement
- ✅ Predictable restart time
- ✅ No code changes
Cons:
⚠️ Requires planned downtime window⚠️ Only works for services with predictable traffic patterns⚠️ May restart too frequently or not frequently enough
Warning
Not really usable in real-world
Note
This would be the last option for me
Force Python's garbage collector to run more frequently.
Use only for heavy usage endpoints.
import gc
from functools import wraps
def with_gc(func):
"""Force GC after endpoint execution"""
@wraps(func)
async def wrapper(*args, **kwargs):
try:
return await func(*args, **kwargs)
finally:
gc.collect()
return wrapper
# Only use on known heavy endpoints
@app.post("/api/heavy-processing")
@with_gc
async def heavy_processing(data: LargeModel):
result = process_large_data(data)
return {"result": result}Pros:
- ✅ Can help with circular references
- ✅ No infrastructure changes
- ✅ Easy to add/remove
Cons:
- ❌ Blocks the event loop (causes latency spikes)
- ❌ May not release memory to OS anyway
- ❌ Can hurt throughput if overused
- ❌ Controversial - many senior engineers avoid this
When to consider:
- You have specific heavy endpoints with circular references
When to avoid:
- As a first solution (use worker recycling instead)
- On every request (too much overhead)
- Without measuring the impact
Note
This helps us find the actual root cause
Use profiling to identify and fix actual memory leaks in your code.
pip install memory-profilerfrom memory_profiler import profile
@profile
@app.post("/api/suspected-leak")
async def suspected_leak(data: LargeModel):
parsed = data.model_dump()
result = await process_data(parsed)
cache[data.id] = result
return {"status": "ok"}# Run your app with profiling
python -m memory_profiler your_app.py
# Or send output to file
python -m memory_profiler your_app.py > memory.log 2>&1Line # Mem usage Increment Line Contents
================================================
3 50.2 MiB 50.2 MiB @profile
async def suspected_leak(data):
5 50.5 MiB 0.3 MiB parsed = data.model_dump()
8 150.8 MiB 100.3 MiB result = await process_data(parsed)
11 251.0 MiB 100.2 MiB cache[data.id] = result # ← LEAK!
14 251.0 MiB 0.0 MiB return {"status": "ok"}
# Issue 1: Unbounded cache
cache = {} # Grows forever
# Fix: Add size limit
from collections import OrderedDict
cache = OrderedDict()
MAX_CACHE = 1000
def add_to_cache(key, value):
if len(cache) >= MAX_CACHE:
cache.popitem(last=False) # Remove oldest
cache[key] = value
# Issue 2: Holding references to large objects
requests_log = [] # Keeps all requests
# Fix: Only keep what you need
requests_log = []
MAX_LOG = 100
def log_request(request_id):
if len(requests_log) >= MAX_LOG:
requests_log.pop(0)
requests_log.append(request_id) # Just the ID, not entire object
# Issue 3: Circular references
class Handler:
def __init__(self):
self.processor = Processor(self) # Circular!
# Fix: Use weak references
import weakref
class Handler:
def __init__(self):
self.processor = Processor(weakref.ref(self))Pros:
- ✅ Fixes actual root cause
- ✅ Improves overall code quality
- ✅ One-time effort (fix once, solved forever)
Cons:
⚠️ Time-consuming (profiling and analysis)⚠️ Performance overhead during profiling⚠️ Requires understanding of output