Bot, Crawler & SEO Protection Strategy

Ticket: MVP-761 — Lock down API to prevent IP theft
Date: 2026-04-25
Status: In Progress
Stakeholders: Engineering, Product, SEO

Problem Statement

The public posts API (/v1/posts) is fully unauthenticated with a limit parameter accepting up to 10,000 posts per request. A single IP can continuously dump the entire content catalogue with no friction. Cloudflare sits in front of production, giving us a partial shield, but no deliberate strategy exists for:

Distinguishing legitimate SEO crawlers (Googlebot, Bingbot) from scrapers
Enforcing per-IP and per-session rate limits
Publishing a robots.txt policy
Validating crawler identity (reverse-DNS verification)

Taxonomy: Crawlers vs Bots vs Scrapers

Class	Examples	Intent	Should we allow?
SEO crawlers	Googlebot, Bingbot, DuckDuckBot, Applebot, Baiduspider	Index content for search	Yes — full access to public pages
AI training crawlers	GPTBot, CCBot, ClaudeBot, Bytespider, Diffbot	Harvest training data	Conditional — see policy below
Monitoring / uptime	UptimeRobot, Pingdom, StatusCake	Health checks	Yes — rate-limited
Legitimate aggregators	Feedly, Pocket, Flipboard	Content syndication	Yes — with API key (future)
Content scrapers	Unidentified bots, rotating proxies	IP theft / bulk harvest	Block
Malicious bots	Spam bots, credential stuffers	Attack surface	Block

Current State Audit

API Exposure

Endpoint	Auth	`limit` max	Risk
`GET /v1/posts`	None	10,000	Critical
`GET /v1/next/posts`	None	10,000	Critical
`GET /v1/posts/{slug}`	None	N/A	Low
`GET /sitemap.xml`	None	N/A	Intentional
`GET /sitemap-{page}.xml`	None	N/A	Intentional

Infrastructure

Production: Cloudflare (Flexible SSL) → ALB → EKS (sh-production)
Dev/Staging: ACM cert → ALB → EKS (no Cloudflare)
Rate limiting: None currently
robots.txt: Does not exist

Proposed Strategy

Layer 1 — Cloudflare (Production, immediate wins)

Cloudflare sits in front of production and provides several free/low-cost knobs:

1a. Bot Fight Mode (Free tier)

Enable under Security → Bots → Bot Fight Mode. Automatically challenges known bad bots using JS challenges. Zero code changes.

1b. Rate Limiting Rules (Pro+)

Configure under Security → WAF → Rate Limiting:

Rule: API Rate Limit
  Match: http.request.uri.path matches "^/v1/"
  Rate: 60 requests / 1 minute / IP
  Action: Block (429)
  Duration: 10 minutes

1c. User-Agent Firewall Rules

Block known scraper user-agents while explicitly allowing verified crawlers:

# Allow verified SEO crawlers (validate via reverse DNS — see Layer 3)
if user_agent contains "Googlebot" → allow
if user_agent contains "Bingbot"   → allow

# Challenge unknown high-volume bots
if http.request.uri.path matches "^/v1/" AND
   not cf.client.bot → JS Challenge

1d. Verified Bot List (Cloudflare)

Cloudflare maintains a Verified Bots list it updates automatically. Under Security → Bots, enable "Allow Verified Bots" to whitelist Googlebot et al. without manual UA matching.

Layer 2 — Application-Level Rate Limiting (Gubernator)

The team discussed Gubernator — a stateless, cloud-native distributed rate limiter (no Redis/Memcached required).

Deployment option: Sidecar or same-namespace service

# k8s/gubernator-deployment.yaml (sh-production namespace)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gubernator
  namespace: sh-production
spec:
  replicas: 2
  ...

The FastAPI app calls Gubernator via HTTP/gRPC before processing /v1/posts requests. Gubernator is stateless — each rate-limit request carries its own config, so no coordination overhead.

Recommended limits (from team discussion)

Parameter	Current	Proposed
`limit` max per request	10,000 posts	50 posts (unauthenticated) / 200 (API key)
Requests per IP per minute	Unlimited	60 req/min
Requests per IP per hour	Unlimited	500 req/hour
Burst allowance	None	20 req burst

Wajahat's math: 50 posts × 60 req/min = 3,000 posts/min max for a persistent scraper — still high. Adding an hourly cap of 500 brings worst-case to 25,000 posts/hour from a single IP, and realistic medical professionals are far below that.

FastAPI integration sketch

# app/middleware/rate_limit.py
import httpx
from fastapi import Request, HTTPException

GUBERNATOR_URL = "http://gubernator.sh-production.svc.cluster.local:8080"

async def check_rate_limit(request: Request, key: str, limit: int, duration: int):
    resp = httpx.post(f"{GUBERNATOR_URL}/v1/RateLimits/Check", json={
        "requests": [{
            "name": "api_posts",
            "unique_key": key,
            "hits": 1,
            "limit": limit,
            "duration": duration,
            "algorithm": 0,  # TOKEN_BUCKET
            "behavior": 0,
        }]
    }, timeout=0.5)
    rl = resp.json()["responses"][0]
    if rl["status"] == 1:  # OVER_LIMIT
        raise HTTPException(status_code=429, headers={
            "Retry-After": str(rl.get("reset_time", 60)),
            "X-RateLimit-Limit": str(limit),
            "X-RateLimit-Remaining": "0",
        })

Layer 3 — Crawler Validation (Reverse DNS)

User-Agent strings are trivially spoofable. Legitimate crawlers publish their IP ranges and support reverse-DNS verification:

Googlebot verification

nslookup <crawler-IP>
# Returns: crawl-66-249-64-1.googlebot.com
nslookup crawl-66-249-64-1.googlebot.com
# Must resolve back to same IP — if not, it's a fake Googlebot

Verified crawler IP ranges (published by vendors)

Crawler	Verification	Published IP ranges
Googlebot	Reverse DNS → `*.googlebot.com`	Google IP ranges JSON
Bingbot	Reverse DNS → `*.search.msn.com`	Bing crawler IPs
DuckDuckBot	Reverse DNS → `*.duckduckgo.com`	No published range — DNS verify only
Applebot	Reverse DNS → `applebot.apple.com`	Apple subnet list
Baiduspider	Reverse DNS → `*.baidu.com`	DNS verify only

Implementation: A lightweight FastAPI middleware or Cloudflare Worker can perform async reverse-DNS checks for known bot UAs and either allow or downgrade them to a lower rate limit tier.

Layer 4 — `robots.txt`

No robots.txt currently exists. Proposed policy:

# sherpahealthy.com robots.txt
# Last updated: 2026-04-25

# --- Legitimate SEO crawlers: full access ---
User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: Bingbot
Allow: /
Crawl-delay: 2

User-agent: DuckDuckBot
Allow: /
Crawl-delay: 2

User-agent: Applebot
Allow: /
Crawl-delay: 2

User-agent: Baiduspider
Allow: /
Crawl-delay: 5

# --- AI training crawlers: disallow (opt-out) ---
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

# --- Block admin and API internals from all crawlers ---
User-agent: *
Disallow: /admin/
Disallow: /v1/admin/
Disallow: /v1/posts/next/
Disallow: /docs
Disallow: /redoc
Disallow: /openapi.json
Allow: /v1/posts
Allow: /sitemap.xml
Allow: /sitemap-*.xml

Sitemap: https://sherpahealthy.com/sitemap.xml

Note: robots.txt is advisory only — malicious bots ignore it. It signals legal intent (CFAA protection) and prevents accidental indexing of admin pages. Enforcement happens in Cloudflare + Gubernator.

Backend implementation

Add a robots.txt endpoint in FastAPI (environment-aware):

# app/sherpahealthy/robots.py
from fastapi import APIRouter
from fastapi.responses import PlainTextResponse
from app.config import settings

robots_router = APIRouter()

ROBOTS_PRODUCTION = """..."""  # full policy above
ROBOTS_NONPROD = """User-agent: *\nDisallow: /\n"""  # block all in dev/staging

@robots_router.get("/robots.txt", response_class=PlainTextResponse)
async def robots_txt():
    if settings.ENV == "production":
        return PlainTextResponse(ROBOTS_PRODUCTION, media_type="text/plain")
    return PlainTextResponse(ROBOTS_NONPROD, media_type="text/plain")

Dev/staging must return Disallow: / for all agents to prevent test content from being indexed by search engines.

Layer 5 — API `limit` Parameter Cap

Immediate change: reduce the hard ceiling in feeds_http.py:

# BEFORE
limit: int = Query(100, description="Number of posts to return", ge=1, le=10000)

# AFTER (unauthenticated)
limit: int = Query(20, description="Number of posts to return", ge=1, le=50)

Future: Allow higher limits (le=500) for requests presenting a valid API key, enabling legitimate partners/aggregators without exposing the full 10k ceiling.

Decision Matrix for Stakeholders

Option	Effort	Cost	Blocks scrapers	SEO safe	Complexity
Cloudflare Bot Fight Mode (free)	Very Low	$0	Partial	Yes	None
Cloudflare Rate Limiting	Low	Cloudflare Pro ($20/mo)	Good	Yes	Low
`robots.txt`	Low	$0	Advisory only	Yes	None
Cap `limit` parameter to 50	Very Low	$0	Reduces blast radius	N/A	None
Gubernator in-cluster	Medium	$0 (infra only)	Strong	Yes	Medium
API Key auth for high-limit	High	$0	Strong	Yes	High
Full JWT auth on `/v1/posts`	Very High	$0	Complete	Risk (breaks SEO)	Very High

Recommended phase order:

Now (hours): Enable Cloudflare Bot Fight Mode + cap limit to 50 + add robots.txt
Sprint (days): Deploy Gubernator sidecar + add rate-limit middleware + Cloudflare rate limit rules
Next quarter: API key tier for partners; revisit AI crawler opt-out policy

SEO Impact Assessment

Tightening the API must not break Googlebot or Bing indexing:

Sitemap (/sitemap.xml, /sitemap-*.xml) must remain fully accessible
/v1/posts/{slug} individual post endpoints should be allowed (low abuse risk)
Rate limits should be generous enough for crawl budgets (Googlebot typically crawls 1–10 req/sec; our proposed 60 req/min is well within normal)
X-Robots-Tag: noindex headers on /v1/ JSON responses prevent double-indexing (content rendered by the frontend is already indexed via HTML pages)

Open Questions

AI crawler policy: Should we disallow all AI training crawlers (GPTBot etc.)? This is a legal/product decision, not just technical.
API key rollout: Which partners need limit > 50? Do we have existing integrations?
Dev/staging parity: Gubernator should also run in sh-dev/sh-staging for testing.
Crawl-delay: Google ignores Crawl-delay in robots.txt — use Search Console to adjust Googlebot crawl rate instead.
Tor / VPN exit nodes: Cloudflare's bot score accounts for these. No extra work needed.

aldrinleal/20260425-bot-crawler-seo-strategy.md

Select an option

No results found

Select an option

No results found

Bot, Crawler & SEO Protection Strategy

Problem Statement

Taxonomy: Crawlers vs Bots vs Scrapers

Current State Audit

API Exposure

Infrastructure

Proposed Strategy

Layer 1 — Cloudflare (Production, immediate wins)

1a. Bot Fight Mode (Free tier)

1b. Rate Limiting Rules (Pro+)

1c. User-Agent Firewall Rules

1d. Verified Bot List (Cloudflare)

Layer 2 — Application-Level Rate Limiting (Gubernator)

Deployment option: Sidecar or same-namespace service

Recommended limits (from team discussion)

FastAPI integration sketch

Layer 3 — Crawler Validation (Reverse DNS)

Googlebot verification

Verified crawler IP ranges (published by vendors)

Layer 4 — `robots.txt`

Backend implementation

Layer 5 — API `limit` Parameter Cap

Decision Matrix for Stakeholders

SEO Impact Assessment

Open Questions

References

aldrinleal/20260425-bot-crawler-seo-strategy.md

Bot, Crawler & SEO Protection Strategy

Problem Statement

Taxonomy: Crawlers vs Bots vs Scrapers

Current State Audit

API Exposure

Infrastructure

Proposed Strategy

Layer 1 — Cloudflare (Production, immediate wins)

1a. Bot Fight Mode (Free tier)

1b. Rate Limiting Rules (Pro+)

1c. User-Agent Firewall Rules

1d. Verified Bot List (Cloudflare)

Layer 2 — Application-Level Rate Limiting (Gubernator)

Deployment option: Sidecar or same-namespace service

Recommended limits (from team discussion)

FastAPI integration sketch

Layer 3 — Crawler Validation (Reverse DNS)

Googlebot verification

Verified crawler IP ranges (published by vendors)

Layer 4 — robots.txt

Backend implementation

Layer 5 — API limit Parameter Cap

Decision Matrix for Stakeholders

SEO Impact Assessment

Open Questions

References

Layer 4 — `robots.txt`

Layer 5 — API `limit` Parameter Cap