Rate Limiting Implementation Guide

Overview

This document outlines rate limiting strategies for LangServe applications to prevent abuse, manage resources, and control costs.

Implementation Options

Option 1: External Rate Limiting

Using a Reverse Proxy

Client → Nginx/Traefik → LangServe

Nginx Configuration Example:

http {
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        location /api/ {
            limit_req zone=api_limit burst=20 nodelay;
            proxy_pass http://langserve_app;
        }
    }
}

Using API Gateway Services

AWS API Gateway: Offers throttling limits at the route level
Azure API Management: Provides rate limiting policies
Google Cloud Endpoints/Apigee: Advanced quota and spike arrest policies

Using CDN/Edge Services

Cloudflare: Rate limiting rules based on client IP
Fastly: Rate limiting via VCL
Akamai: API gateway with quota management

Option 2: Application-Level Rate Limiting

FastAPI + Slowapi

from fastapi import FastAPI, Depends, HTTPException
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from langserve import add_routes

app = FastAPI()
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.get("/")
@limiter.limit("5/minute")
async def root(request: Request):
    return {"message": "Hello World"}

# Apply rate limits when adding LangServe routes
add_routes(
    app,
    my_chain,
    path="/my-chain",
    custom_route_handlers=[limiter.limit("2/minute")]
)

FastAPI + Redis-based Rate Limiting

For distributed setups:

from fastapi import FastAPI, Request, HTTPException
import redis
import time

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, db=0)

def rate_limit(requests_per_minute: int = 10):
    def decorator(func):
        async def wrapper(request: Request, *args, **kwargs):
            client_ip = request.client.host
            key = f"rate_limit:{client_ip}"
            
            # Get current count
            current = redis_client.get(key)
            if current is None:
                redis_client.set(key, 1, ex=60)  # 60 second expiry
            elif int(current) >= requests_per_minute:
                raise HTTPException(status_code=429, detail="Rate limit exceeded")
            else:
                redis_client.incr(key)
                
            return await func(request, *args, **kwargs)
        return wrapper
    return decorator

@app.get("/")
@rate_limit(requests_per_minute=5)
async def root(request: Request):
    return {"message": "Hello World"}

Option 3: Platform-Specific Rate Limiting

Render

Configure rate limiting in the Render dashboard for your service:

Navigate to your service
Go to Settings → Rate Limiting
Configure requests per second/minute

Heroku

Use add-ons like "Threshold" or implement using a Redis add-on.

Best Practices

Tiered Rate Limiting:
- API Key based (higher limits for authenticated users)
- IP-based (for anonymous users)
- Endpoint-specific limits (higher for non-intensive operations)

Response Headers:

Include rate limit headers to inform clients

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1620046800

Graceful Degradation:
- Return 429 status code with retry information
- Include "Retry-After" header
Logging & Monitoring:
- Track rate limit hits and near-misses
- Set up alerts for abuse patterns

Implementation Decision Matrix

Factor	External Solution	Application Solution
Team Size	Small teams (simpler)	Larger teams (more control)
Traffic Volume	High	Low to Medium
Deployment	Multiple services	Single service
Complexity	Lower	Higher
Budget	Higher	Lower

Rate Limit Considerations for LLM Applications

Token-based Limits: Consider limiting by token count rather than just request count
Streaming Considerations: Special handling for SSE/streaming responses
Cost Management: Tighter limits on expensive model endpoints
Queue vs Reject: For expensive operations, consider queuing rather than rejecting

Example Implementation Plan

Start with platform-level rate limiting (e.g., Render limits)
Add application-level authentication with tiered limits
Progress to dedicated API gateway as usage grows
Implement token-based tracking for fine-grained control

decagondev/rate-limiting.md