This document outlines rate limiting strategies for LangServe applications to prevent abuse, manage resources, and control costs.
Client → Nginx/Traefik → LangServe
Nginx Configuration Example:
http {
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
server {
location /api/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://langserve_app;
}
}
}
- AWS API Gateway: Offers throttling limits at the route level
- Azure API Management: Provides rate limiting policies
- Google Cloud Endpoints/Apigee: Advanced quota and spike arrest policies
- Cloudflare: Rate limiting rules based on client IP
- Fastly: Rate limiting via VCL
- Akamai: API gateway with quota management
from fastapi import FastAPI, Depends, HTTPException
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from langserve import add_routes
app = FastAPI()
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.get("/")
@limiter.limit("5/minute")
async def root(request: Request):
return {"message": "Hello World"}
# Apply rate limits when adding LangServe routes
add_routes(
app,
my_chain,
path="/my-chain",
custom_route_handlers=[limiter.limit("2/minute")]
)
For distributed setups:
from fastapi import FastAPI, Request, HTTPException
import redis
import time
app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def rate_limit(requests_per_minute: int = 10):
def decorator(func):
async def wrapper(request: Request, *args, **kwargs):
client_ip = request.client.host
key = f"rate_limit:{client_ip}"
# Get current count
current = redis_client.get(key)
if current is None:
redis_client.set(key, 1, ex=60) # 60 second expiry
elif int(current) >= requests_per_minute:
raise HTTPException(status_code=429, detail="Rate limit exceeded")
else:
redis_client.incr(key)
return await func(request, *args, **kwargs)
return wrapper
return decorator
@app.get("/")
@rate_limit(requests_per_minute=5)
async def root(request: Request):
return {"message": "Hello World"}
Configure rate limiting in the Render dashboard for your service:
- Navigate to your service
- Go to Settings → Rate Limiting
- Configure requests per second/minute
Use add-ons like "Threshold" or implement using a Redis add-on.
-
Tiered Rate Limiting:
- API Key based (higher limits for authenticated users)
- IP-based (for anonymous users)
- Endpoint-specific limits (higher for non-intensive operations)
-
Response Headers:
- Include rate limit headers to inform clients
X-RateLimit-Limit: 100 X-RateLimit-Remaining: 95 X-RateLimit-Reset: 1620046800
-
Graceful Degradation:
- Return 429 status code with retry information
- Include "Retry-After" header
-
Logging & Monitoring:
- Track rate limit hits and near-misses
- Set up alerts for abuse patterns
Factor | External Solution | Application Solution |
---|---|---|
Team Size | Small teams (simpler) | Larger teams (more control) |
Traffic Volume | High | Low to Medium |
Deployment | Multiple services | Single service |
Complexity | Lower | Higher |
Budget | Higher | Lower |
- Token-based Limits: Consider limiting by token count rather than just request count
- Streaming Considerations: Special handling for SSE/streaming responses
- Cost Management: Tighter limits on expensive model endpoints
- Queue vs Reject: For expensive operations, consider queuing rather than rejecting
- Start with platform-level rate limiting (e.g., Render limits)
- Add application-level authentication with tiered limits
- Progress to dedicated API gateway as usage grows
- Implement token-based tracking for fine-grained control