Skip to content

Instantly share code, notes, and snippets.

@adimyth
Created September 30, 2025 11:09
Show Gist options
  • Select an option

  • Save adimyth/142c9d2964ae2fec5d6848913b4904dc to your computer and use it in GitHub Desktop.

Select an option

Save adimyth/142c9d2964ae2fec5d6848913b4904dc to your computer and use it in GitHub Desktop.
Memory Growth in Python FastAPI Services: A Complete Guide

Memory Growth in Python FastAPI Services: A Complete Guide

The Problem

Your Python/FastAPI service consumes RAM during request processing and doesn't release it back to the OS, even after the request completes. Over time (hours/days), memory usage grows from 500MB → 1GB → 2GB → eventually crashes.

Symptoms:

  • ✅ Memory increases during processing (expected)
  • ❌ Memory stays high after processing (problem)
  • ❌ Each request adds a bit more memory (accumulation)
  • ❌ Eventually: OOM (Out of Memory) errors or service crash

Why This Happens

Python's Memory Management

Python doesn't immediately return memory to the operating system, even after garbage collection:

# Your code does this:
large_data = fetch_1gb_of_data()  # Memory: 1 GB
process(large_data)
del large_data                     # Python frees the object
gc.collect()                       # Garbage collector runs

# OS still sees: 1 GB used
# Python thinks: "I might need this memory again soon, I'll keep it"

This is BY DESIGN, not a bug! Python's memory allocator (pymalloc) holds onto freed memory to reuse for future allocations, avoiding expensive OS calls.

Common Real Memory Leaks

Beyond Python's normal behavior, these cause actual leaks:

# 1. Global state accumulation
request_cache = []  # Grows forever

@app.post("/api")
async def endpoint(data):
    request_cache.append(data)  # Never cleared!
    return {"status": "ok"}

# 2. Circular references
class Request:
    def __init__(self):
        self.response = Response(self)

class Response:
    def __init__(self, request):
        self.request = request  # Circular!

# 3. Large objects held in memory
processed_data = {}  # Global dict

@app.post("/api")
async def endpoint(request: LargeModel):
    processed_data[request.id] = request  # Keeps entire model
    return {"status": "ok"}

Solutions

1️⃣ Worker Recycling

Periodically kill and restart workers to release memory back to OS.

Requirements:

  • 2+ workers for zero downtime
  • With 1 worker: accepts brief downtime during restart

Implementation with Gunicorn:

#!/bin/bash
# entrypoint.sh

gunicorn src.main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --max-requests 1000 \
  --max-requests-jitter 100 \
  --timeout 30 \
  --graceful-timeout 30 \
  --keep-alive 5

Key parameters:

  • --max-requests 1000: Restart worker after handling 1000 requests
  • --max-requests-jitter 100: Add randomness (900-1100) so all workers don't restart simultaneously
  • --graceful-timeout 30: Give worker 30 seconds to finish current requests before killing

How it works:

Worker 1: Request 1-1043 → Finishes current request → Shuts down
          New Worker 1 spawns while old one shutting down
Worker 2: Request 1-967  → Finishes current request → Shuts down
          New Worker 2 spawns while old one shutting down
Worker 3: Request 1-1089 → Finishes current request → Shuts down
          New Worker 3 spawns while old one shutting down

Zero downtime: New workers spawn before old ones die
Memory released: OS reclaims memory when worker process exits

Alternative with Uvicorn:

#!/bin/bash
# entrypoint.sh

uvicorn src.main:app \
  --host 0.0.0.0 \
  --port 8000 \
  --workers 4 \
  --max-requests 1000 \
  --max-requests-jitter 100

Note

You need a slightly sophisticated solution if you only have 1 worker

Pros:

  • ✅ Industry standard solution
  • ✅ Handles ALL memory issues (Python's behavior + real leaks)
  • ✅ No code changes required
  • ✅ Zero downtime with 2+ workers

Cons:

  • ⚠️ Requires 2+ workers for zero downtime
  • ⚠️ Doesn't fix root cause (mitigates it)

2️⃣ Memory Limits + Auto-Restart

Industry Standard • Used as safety net by most production systems

Set hard memory limits and let Docker/Kubernetes restart the container when exceeded.

# docker-compose.yml
services:
  your-service:
    build: .
    deploy:
      resources:
        limits:
          memory: 4G              # Kill if exceeds 4GB
        reservations:
          memory: 2G              # Guaranteed minimum
    restart: unless-stopped       # Auto-restart after kill
    

What happens:

Memory usage: 2GB → 3GB → 3.9GB → 4GB
Docker: SIGKILL (container killed immediately)
Docker: Restarts container (based on restart policy)
Downtime: 5-10 seconds
Memory: Back to ~500MB baseline

Pros:

  • ✅ Prevents one service from killing entire server
  • ✅ Automatic recovery
  • ✅ No code changes
  • ✅ Works as failsafe with Solution 1

Cons:

  • ❌ Brief service interruption when limit hit
  • ❌ Doesn't fix root cause

Best practice: Set limit higher than normal usage to catch runaway cases only.


3️⃣ Scheduled Container Restarts

Common Practice • Good for: Low-traffic services or known memory patterns

Restart the container during low-traffic periods.

With Cron:

# Restart every day at 3 AM
0 3 * * * docker restart your-service-container

Pros:

  • ✅ Simple to implement
  • ✅ Predictable restart time
  • ✅ No code changes

Cons:

  • ⚠️ Requires planned downtime window
  • ⚠️ Only works for services with predictable traffic patterns
  • ⚠️ May restart too frequently or not frequently enough

Warning

Not really usable in real-world


4️⃣ Manual Garbage Collection

Note

This would be the last option for me

Force Python's garbage collector to run more frequently.

Per-endpoint

Use only for heavy usage endpoints.

import gc
from functools import wraps

def with_gc(func):
    """Force GC after endpoint execution"""
    @wraps(func)
    async def wrapper(*args, **kwargs):
        try:
            return await func(*args, **kwargs)
        finally:
            gc.collect()
    return wrapper

# Only use on known heavy endpoints
@app.post("/api/heavy-processing")
@with_gc
async def heavy_processing(data: LargeModel):
    result = process_large_data(data)
    return {"result": result}

Pros:

  • ✅ Can help with circular references
  • ✅ No infrastructure changes
  • ✅ Easy to add/remove

Cons:

  • ❌ Blocks the event loop (causes latency spikes)
  • ❌ May not release memory to OS anyway
  • ❌ Can hurt throughput if overused
  • ❌ Controversial - many senior engineers avoid this

When to consider:

  • You have specific heavy endpoints with circular references

When to avoid:

  • As a first solution (use worker recycling instead)
  • On every request (too much overhead)
  • Without measuring the impact

5️⃣ Profile and Fix Root Cause

Note

This helps us find the actual root cause

Use profiling to identify and fix actual memory leaks in your code.

Setup:

pip install memory-profiler

Profile specific endpoint:

from memory_profiler import profile

@profile
@app.post("/api/suspected-leak")
async def suspected_leak(data: LargeModel):
    parsed = data.model_dump()
    
    result = await process_data(parsed)
    
    cache[data.id] = result
    
    return {"status": "ok"}

Run and analyze:

# Run your app with profiling
python -m memory_profiler your_app.py

# Or send output to file
python -m memory_profiler your_app.py > memory.log 2>&1

Output:

Line #    Mem usage    Increment   Line Contents
================================================
     3     50.2 MiB     50.2 MiB   @profile
                                   async def suspected_leak(data):
     5     50.5 MiB      0.3 MiB       parsed = data.model_dump()
     8    150.8 MiB    100.3 MiB       result = await process_data(parsed)
    11    251.0 MiB    100.2 MiB       cache[data.id] = result  # ← LEAK!
    14    251.0 MiB      0.0 MiB       return {"status": "ok"}

Common issues found:

# Issue 1: Unbounded cache
cache = {}  # Grows forever

# Fix: Add size limit
from collections import OrderedDict

cache = OrderedDict()
MAX_CACHE = 1000

def add_to_cache(key, value):
    if len(cache) >= MAX_CACHE:
        cache.popitem(last=False)  # Remove oldest
    cache[key] = value

# Issue 2: Holding references to large objects
requests_log = []  # Keeps all requests

# Fix: Only keep what you need
requests_log = []
MAX_LOG = 100

def log_request(request_id):
    if len(requests_log) >= MAX_LOG:
        requests_log.pop(0)
    requests_log.append(request_id)  # Just the ID, not entire object

# Issue 3: Circular references
class Handler:
    def __init__(self):
        self.processor = Processor(self)  # Circular!

# Fix: Use weak references
import weakref

class Handler:
    def __init__(self):
        self.processor = Processor(weakref.ref(self))

Pros:

  • ✅ Fixes actual root cause
  • ✅ Improves overall code quality
  • ✅ One-time effort (fix once, solved forever)

Cons:

  • ⚠️ Time-consuming (profiling and analysis)
  • ⚠️ Performance overhead during profiling
  • ⚠️ Requires understanding of output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment