Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created May 1, 2025 15:32
Show Gist options
  • Save decagondev/88beeb33ff23f942854d1908b5227b8d to your computer and use it in GitHub Desktop.
Save decagondev/88beeb33ff23f942854d1908b5227b8d to your computer and use it in GitHub Desktop.

Rate Limiting Implementation Guide

Overview

This document outlines rate limiting strategies for LangServe applications to prevent abuse, manage resources, and control costs.

Implementation Options

Option 1: External Rate Limiting

Using a Reverse Proxy

Client → Nginx/Traefik → LangServe

Nginx Configuration Example:

http {
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        location /api/ {
            limit_req zone=api_limit burst=20 nodelay;
            proxy_pass http://langserve_app;
        }
    }
}

Using API Gateway Services

  • AWS API Gateway: Offers throttling limits at the route level
  • Azure API Management: Provides rate limiting policies
  • Google Cloud Endpoints/Apigee: Advanced quota and spike arrest policies

Using CDN/Edge Services

  • Cloudflare: Rate limiting rules based on client IP
  • Fastly: Rate limiting via VCL
  • Akamai: API gateway with quota management

Option 2: Application-Level Rate Limiting

FastAPI + Slowapi

from fastapi import FastAPI, Depends, HTTPException
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from langserve import add_routes

app = FastAPI()
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.get("/")
@limiter.limit("5/minute")
async def root(request: Request):
    return {"message": "Hello World"}

# Apply rate limits when adding LangServe routes
add_routes(
    app,
    my_chain,
    path="/my-chain",
    custom_route_handlers=[limiter.limit("2/minute")]
)

FastAPI + Redis-based Rate Limiting

For distributed setups:

from fastapi import FastAPI, Request, HTTPException
import redis
import time

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, db=0)

def rate_limit(requests_per_minute: int = 10):
    def decorator(func):
        async def wrapper(request: Request, *args, **kwargs):
            client_ip = request.client.host
            key = f"rate_limit:{client_ip}"
            
            # Get current count
            current = redis_client.get(key)
            if current is None:
                redis_client.set(key, 1, ex=60)  # 60 second expiry
            elif int(current) >= requests_per_minute:
                raise HTTPException(status_code=429, detail="Rate limit exceeded")
            else:
                redis_client.incr(key)
                
            return await func(request, *args, **kwargs)
        return wrapper
    return decorator

@app.get("/")
@rate_limit(requests_per_minute=5)
async def root(request: Request):
    return {"message": "Hello World"}

Option 3: Platform-Specific Rate Limiting

Render

Configure rate limiting in the Render dashboard for your service:

  • Navigate to your service
  • Go to Settings → Rate Limiting
  • Configure requests per second/minute

Heroku

Use add-ons like "Threshold" or implement using a Redis add-on.

Best Practices

  1. Tiered Rate Limiting:

    • API Key based (higher limits for authenticated users)
    • IP-based (for anonymous users)
    • Endpoint-specific limits (higher for non-intensive operations)
  2. Response Headers:

    • Include rate limit headers to inform clients
    X-RateLimit-Limit: 100
    X-RateLimit-Remaining: 95
    X-RateLimit-Reset: 1620046800
    
  3. Graceful Degradation:

    • Return 429 status code with retry information
    • Include "Retry-After" header
  4. Logging & Monitoring:

    • Track rate limit hits and near-misses
    • Set up alerts for abuse patterns

Implementation Decision Matrix

Factor External Solution Application Solution
Team Size Small teams (simpler) Larger teams (more control)
Traffic Volume High Low to Medium
Deployment Multiple services Single service
Complexity Lower Higher
Budget Higher Lower

Rate Limit Considerations for LLM Applications

  1. Token-based Limits: Consider limiting by token count rather than just request count
  2. Streaming Considerations: Special handling for SSE/streaming responses
  3. Cost Management: Tighter limits on expensive model endpoints
  4. Queue vs Reject: For expensive operations, consider queuing rather than rejecting

Example Implementation Plan

  1. Start with platform-level rate limiting (e.g., Render limits)
  2. Add application-level authentication with tiered limits
  3. Progress to dedicated API gateway as usage grows
  4. Implement token-based tracking for fine-grained control
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment