Ticket: MVP-761 — Lock down API to prevent IP theft
Date: 2026-04-25
Status: In Progress
Stakeholders: Engineering, Product, SEO
The public posts API (/v1/posts) is fully unauthenticated with a limit parameter
accepting up to 10,000 posts per request. A single IP can continuously dump the
entire content catalogue with no friction. Cloudflare sits in front of production,
giving us a partial shield, but no deliberate strategy exists for:
- Distinguishing legitimate SEO crawlers (Googlebot, Bingbot) from scrapers
- Enforcing per-IP and per-session rate limits
- Publishing a
robots.txtpolicy - Validating crawler identity (reverse-DNS verification)
| Class | Examples | Intent | Should we allow? |
|---|---|---|---|
| SEO crawlers | Googlebot, Bingbot, DuckDuckBot, Applebot, Baiduspider | Index content for search | Yes — full access to public pages |
| AI training crawlers | GPTBot, CCBot, ClaudeBot, Bytespider, Diffbot | Harvest training data | Conditional — see policy below |
| Monitoring / uptime | UptimeRobot, Pingdom, StatusCake | Health checks | Yes — rate-limited |
| Legitimate aggregators | Feedly, Pocket, Flipboard | Content syndication | Yes — with API key (future) |
| Content scrapers | Unidentified bots, rotating proxies | IP theft / bulk harvest | Block |
| Malicious bots | Spam bots, credential stuffers | Attack surface | Block |
| Endpoint | Auth | limit max |
Risk |
|---|---|---|---|
GET /v1/posts |
None | 10,000 | Critical |
GET /v1/next/posts |
None | 10,000 | Critical |
GET /v1/posts/{slug} |
None | N/A | Low |
GET /sitemap.xml |
None | N/A | Intentional |
GET /sitemap-{page}.xml |
None | N/A | Intentional |
- Production: Cloudflare (Flexible SSL) → ALB → EKS (
sh-production) - Dev/Staging: ACM cert → ALB → EKS (no Cloudflare)
- Rate limiting: None currently
robots.txt: Does not exist
Cloudflare sits in front of production and provides several free/low-cost knobs:
Enable under Security → Bots → Bot Fight Mode. Automatically challenges known bad bots using JS challenges. Zero code changes.
Configure under Security → WAF → Rate Limiting:
Rule: API Rate Limit
Match: http.request.uri.path matches "^/v1/"
Rate: 60 requests / 1 minute / IP
Action: Block (429)
Duration: 10 minutes
Block known scraper user-agents while explicitly allowing verified crawlers:
# Allow verified SEO crawlers (validate via reverse DNS — see Layer 3)
if user_agent contains "Googlebot" → allow
if user_agent contains "Bingbot" → allow
# Challenge unknown high-volume bots
if http.request.uri.path matches "^/v1/" AND
not cf.client.bot → JS Challenge
Cloudflare maintains a Verified Bots list it updates automatically. Under Security → Bots, enable "Allow Verified Bots" to whitelist Googlebot et al. without manual UA matching.
The team discussed Gubernator — a stateless, cloud-native distributed rate limiter (no Redis/Memcached required).
# k8s/gubernator-deployment.yaml (sh-production namespace)
apiVersion: apps/v1
kind: Deployment
metadata:
name: gubernator
namespace: sh-production
spec:
replicas: 2
...The FastAPI app calls Gubernator via HTTP/gRPC before processing /v1/posts
requests. Gubernator is stateless — each rate-limit request carries its own config,
so no coordination overhead.
| Parameter | Current | Proposed |
|---|---|---|
limit max per request |
10,000 posts | 50 posts (unauthenticated) / 200 (API key) |
| Requests per IP per minute | Unlimited | 60 req/min |
| Requests per IP per hour | Unlimited | 500 req/hour |
| Burst allowance | None | 20 req burst |
Wajahat's math: 50 posts × 60 req/min = 3,000 posts/min max for a persistent scraper — still high. Adding an hourly cap of 500 brings worst-case to 25,000 posts/hour from a single IP, and realistic medical professionals are far below that.
# app/middleware/rate_limit.py
import httpx
from fastapi import Request, HTTPException
GUBERNATOR_URL = "http://gubernator.sh-production.svc.cluster.local:8080"
async def check_rate_limit(request: Request, key: str, limit: int, duration: int):
resp = httpx.post(f"{GUBERNATOR_URL}/v1/RateLimits/Check", json={
"requests": [{
"name": "api_posts",
"unique_key": key,
"hits": 1,
"limit": limit,
"duration": duration,
"algorithm": 0, # TOKEN_BUCKET
"behavior": 0,
}]
}, timeout=0.5)
rl = resp.json()["responses"][0]
if rl["status"] == 1: # OVER_LIMIT
raise HTTPException(status_code=429, headers={
"Retry-After": str(rl.get("reset_time", 60)),
"X-RateLimit-Limit": str(limit),
"X-RateLimit-Remaining": "0",
})User-Agent strings are trivially spoofable. Legitimate crawlers publish their IP ranges and support reverse-DNS verification:
nslookup <crawler-IP>
# Returns: crawl-66-249-64-1.googlebot.com
nslookup crawl-66-249-64-1.googlebot.com
# Must resolve back to same IP — if not, it's a fake Googlebot
| Crawler | Verification | Published IP ranges |
|---|---|---|
| Googlebot | Reverse DNS → *.googlebot.com |
Google IP ranges JSON |
| Bingbot | Reverse DNS → *.search.msn.com |
Bing crawler IPs |
| DuckDuckBot | Reverse DNS → *.duckduckgo.com |
No published range — DNS verify only |
| Applebot | Reverse DNS → applebot.apple.com |
Apple subnet list |
| Baiduspider | Reverse DNS → *.baidu.com |
DNS verify only |
Implementation: A lightweight FastAPI middleware or Cloudflare Worker can perform async reverse-DNS checks for known bot UAs and either allow or downgrade them to a lower rate limit tier.
No robots.txt currently exists. Proposed policy:
# sherpahealthy.com robots.txt
# Last updated: 2026-04-25
# --- Legitimate SEO crawlers: full access ---
User-agent: Googlebot
Allow: /
Crawl-delay: 1
User-agent: Bingbot
Allow: /
Crawl-delay: 2
User-agent: DuckDuckBot
Allow: /
Crawl-delay: 2
User-agent: Applebot
Allow: /
Crawl-delay: 2
User-agent: Baiduspider
Allow: /
Crawl-delay: 5
# --- AI training crawlers: disallow (opt-out) ---
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
# --- Block admin and API internals from all crawlers ---
User-agent: *
Disallow: /admin/
Disallow: /v1/admin/
Disallow: /v1/posts/next/
Disallow: /docs
Disallow: /redoc
Disallow: /openapi.json
Allow: /v1/posts
Allow: /sitemap.xml
Allow: /sitemap-*.xml
Sitemap: https://sherpahealthy.com/sitemap.xmlNote:
robots.txtis advisory only — malicious bots ignore it. It signals legal intent (CFAA protection) and prevents accidental indexing of admin pages. Enforcement happens in Cloudflare + Gubernator.
Add a robots.txt endpoint in FastAPI (environment-aware):
# app/sherpahealthy/robots.py
from fastapi import APIRouter
from fastapi.responses import PlainTextResponse
from app.config import settings
robots_router = APIRouter()
ROBOTS_PRODUCTION = """...""" # full policy above
ROBOTS_NONPROD = """User-agent: *\nDisallow: /\n""" # block all in dev/staging
@robots_router.get("/robots.txt", response_class=PlainTextResponse)
async def robots_txt():
if settings.ENV == "production":
return PlainTextResponse(ROBOTS_PRODUCTION, media_type="text/plain")
return PlainTextResponse(ROBOTS_NONPROD, media_type="text/plain")Dev/staging must return Disallow: / for all agents to prevent test content
from being indexed by search engines.
Immediate change: reduce the hard ceiling in feeds_http.py:
# BEFORE
limit: int = Query(100, description="Number of posts to return", ge=1, le=10000)
# AFTER (unauthenticated)
limit: int = Query(20, description="Number of posts to return", ge=1, le=50)Future: Allow higher limits (le=500) for requests presenting a valid API key,
enabling legitimate partners/aggregators without exposing the full 10k ceiling.
| Option | Effort | Cost | Blocks scrapers | SEO safe | Complexity |
|---|---|---|---|---|---|
| Cloudflare Bot Fight Mode (free) | Very Low | $0 | Partial | Yes | None |
| Cloudflare Rate Limiting | Low | Cloudflare Pro ($20/mo) | Good | Yes | Low |
robots.txt |
Low | $0 | Advisory only | Yes | None |
Cap limit parameter to 50 |
Very Low | $0 | Reduces blast radius | N/A | None |
| Gubernator in-cluster | Medium | $0 (infra only) | Strong | Yes | Medium |
| API Key auth for high-limit | High | $0 | Strong | Yes | High |
Full JWT auth on /v1/posts |
Very High | $0 | Complete | Risk (breaks SEO) | Very High |
Recommended phase order:
- Now (hours): Enable Cloudflare Bot Fight Mode + cap
limitto 50 + addrobots.txt - Sprint (days): Deploy Gubernator sidecar + add rate-limit middleware + Cloudflare rate limit rules
- Next quarter: API key tier for partners; revisit AI crawler opt-out policy
Tightening the API must not break Googlebot or Bing indexing:
- Sitemap (
/sitemap.xml,/sitemap-*.xml) must remain fully accessible /v1/posts/{slug}individual post endpoints should be allowed (low abuse risk)- Rate limits should be generous enough for crawl budgets (Googlebot typically crawls 1–10 req/sec; our proposed 60 req/min is well within normal)
X-Robots-Tag: noindexheaders on/v1/JSON responses prevent double-indexing (content rendered by the frontend is already indexed via HTML pages)
- AI crawler policy: Should we disallow all AI training crawlers (GPTBot etc.)? This is a legal/product decision, not just technical.
- API key rollout: Which partners need
limit > 50? Do we have existing integrations? - Dev/staging parity: Gubernator should also run in
sh-dev/sh-stagingfor testing. Crawl-delay: Google ignoresCrawl-delayin robots.txt — use Search Console to adjust Googlebot crawl rate instead.- Tor / VPN exit nodes: Cloudflare's bot score accounts for these. No extra work needed.