Skip to content

Instantly share code, notes, and snippets.

@aldrinleal
Created April 25, 2026 18:39
Show Gist options
  • Select an option

  • Save aldrinleal/11614d8b76d7fffa9cb829ebf5f3670d to your computer and use it in GitHub Desktop.

Select an option

Save aldrinleal/11614d8b76d7fffa9cb829ebf5f3670d to your computer and use it in GitHub Desktop.
MVP-761: Bot, Crawler & SEO Protection Strategy for sherpahealthy.com

Bot, Crawler & SEO Protection Strategy

Ticket: MVP-761 — Lock down API to prevent IP theft
Date: 2026-04-25
Status: In Progress
Stakeholders: Engineering, Product, SEO


Problem Statement

The public posts API (/v1/posts) is fully unauthenticated with a limit parameter accepting up to 10,000 posts per request. A single IP can continuously dump the entire content catalogue with no friction. Cloudflare sits in front of production, giving us a partial shield, but no deliberate strategy exists for:

  • Distinguishing legitimate SEO crawlers (Googlebot, Bingbot) from scrapers
  • Enforcing per-IP and per-session rate limits
  • Publishing a robots.txt policy
  • Validating crawler identity (reverse-DNS verification)

Taxonomy: Crawlers vs Bots vs Scrapers

Class Examples Intent Should we allow?
SEO crawlers Googlebot, Bingbot, DuckDuckBot, Applebot, Baiduspider Index content for search Yes — full access to public pages
AI training crawlers GPTBot, CCBot, ClaudeBot, Bytespider, Diffbot Harvest training data Conditional — see policy below
Monitoring / uptime UptimeRobot, Pingdom, StatusCake Health checks Yes — rate-limited
Legitimate aggregators Feedly, Pocket, Flipboard Content syndication Yes — with API key (future)
Content scrapers Unidentified bots, rotating proxies IP theft / bulk harvest Block
Malicious bots Spam bots, credential stuffers Attack surface Block

Current State Audit

API Exposure

Endpoint Auth limit max Risk
GET /v1/posts None 10,000 Critical
GET /v1/next/posts None 10,000 Critical
GET /v1/posts/{slug} None N/A Low
GET /sitemap.xml None N/A Intentional
GET /sitemap-{page}.xml None N/A Intentional

Infrastructure

  • Production: Cloudflare (Flexible SSL) → ALB → EKS (sh-production)
  • Dev/Staging: ACM cert → ALB → EKS (no Cloudflare)
  • Rate limiting: None currently
  • robots.txt: Does not exist

Proposed Strategy

Layer 1 — Cloudflare (Production, immediate wins)

Cloudflare sits in front of production and provides several free/low-cost knobs:

1a. Bot Fight Mode (Free tier)

Enable under Security → Bots → Bot Fight Mode. Automatically challenges known bad bots using JS challenges. Zero code changes.

1b. Rate Limiting Rules (Pro+)

Configure under Security → WAF → Rate Limiting:

Rule: API Rate Limit
  Match: http.request.uri.path matches "^/v1/"
  Rate: 60 requests / 1 minute / IP
  Action: Block (429)
  Duration: 10 minutes

1c. User-Agent Firewall Rules

Block known scraper user-agents while explicitly allowing verified crawlers:

# Allow verified SEO crawlers (validate via reverse DNS — see Layer 3)
if user_agent contains "Googlebot" → allow
if user_agent contains "Bingbot"   → allow

# Challenge unknown high-volume bots
if http.request.uri.path matches "^/v1/" AND
   not cf.client.bot → JS Challenge

1d. Verified Bot List (Cloudflare)

Cloudflare maintains a Verified Bots list it updates automatically. Under Security → Bots, enable "Allow Verified Bots" to whitelist Googlebot et al. without manual UA matching.


Layer 2 — Application-Level Rate Limiting (Gubernator)

The team discussed Gubernator — a stateless, cloud-native distributed rate limiter (no Redis/Memcached required).

Deployment option: Sidecar or same-namespace service

# k8s/gubernator-deployment.yaml (sh-production namespace)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gubernator
  namespace: sh-production
spec:
  replicas: 2
  ...

The FastAPI app calls Gubernator via HTTP/gRPC before processing /v1/posts requests. Gubernator is stateless — each rate-limit request carries its own config, so no coordination overhead.

Recommended limits (from team discussion)

Parameter Current Proposed
limit max per request 10,000 posts 50 posts (unauthenticated) / 200 (API key)
Requests per IP per minute Unlimited 60 req/min
Requests per IP per hour Unlimited 500 req/hour
Burst allowance None 20 req burst

Wajahat's math: 50 posts × 60 req/min = 3,000 posts/min max for a persistent scraper — still high. Adding an hourly cap of 500 brings worst-case to 25,000 posts/hour from a single IP, and realistic medical professionals are far below that.

FastAPI integration sketch

# app/middleware/rate_limit.py
import httpx
from fastapi import Request, HTTPException

GUBERNATOR_URL = "http://gubernator.sh-production.svc.cluster.local:8080"

async def check_rate_limit(request: Request, key: str, limit: int, duration: int):
    resp = httpx.post(f"{GUBERNATOR_URL}/v1/RateLimits/Check", json={
        "requests": [{
            "name": "api_posts",
            "unique_key": key,
            "hits": 1,
            "limit": limit,
            "duration": duration,
            "algorithm": 0,  # TOKEN_BUCKET
            "behavior": 0,
        }]
    }, timeout=0.5)
    rl = resp.json()["responses"][0]
    if rl["status"] == 1:  # OVER_LIMIT
        raise HTTPException(status_code=429, headers={
            "Retry-After": str(rl.get("reset_time", 60)),
            "X-RateLimit-Limit": str(limit),
            "X-RateLimit-Remaining": "0",
        })

Layer 3 — Crawler Validation (Reverse DNS)

User-Agent strings are trivially spoofable. Legitimate crawlers publish their IP ranges and support reverse-DNS verification:

Googlebot verification

nslookup <crawler-IP>
# Returns: crawl-66-249-64-1.googlebot.com
nslookup crawl-66-249-64-1.googlebot.com
# Must resolve back to same IP — if not, it's a fake Googlebot

Verified crawler IP ranges (published by vendors)

Crawler Verification Published IP ranges
Googlebot Reverse DNS → *.googlebot.com Google IP ranges JSON
Bingbot Reverse DNS → *.search.msn.com Bing crawler IPs
DuckDuckBot Reverse DNS → *.duckduckgo.com No published range — DNS verify only
Applebot Reverse DNS → applebot.apple.com Apple subnet list
Baiduspider Reverse DNS → *.baidu.com DNS verify only

Implementation: A lightweight FastAPI middleware or Cloudflare Worker can perform async reverse-DNS checks for known bot UAs and either allow or downgrade them to a lower rate limit tier.


Layer 4 — robots.txt

No robots.txt currently exists. Proposed policy:

# sherpahealthy.com robots.txt
# Last updated: 2026-04-25

# --- Legitimate SEO crawlers: full access ---
User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: Bingbot
Allow: /
Crawl-delay: 2

User-agent: DuckDuckBot
Allow: /
Crawl-delay: 2

User-agent: Applebot
Allow: /
Crawl-delay: 2

User-agent: Baiduspider
Allow: /
Crawl-delay: 5

# --- AI training crawlers: disallow (opt-out) ---
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

# --- Block admin and API internals from all crawlers ---
User-agent: *
Disallow: /admin/
Disallow: /v1/admin/
Disallow: /v1/posts/next/
Disallow: /docs
Disallow: /redoc
Disallow: /openapi.json
Allow: /v1/posts
Allow: /sitemap.xml
Allow: /sitemap-*.xml

Sitemap: https://sherpahealthy.com/sitemap.xml

Note: robots.txt is advisory only — malicious bots ignore it. It signals legal intent (CFAA protection) and prevents accidental indexing of admin pages. Enforcement happens in Cloudflare + Gubernator.

Backend implementation

Add a robots.txt endpoint in FastAPI (environment-aware):

# app/sherpahealthy/robots.py
from fastapi import APIRouter
from fastapi.responses import PlainTextResponse
from app.config import settings

robots_router = APIRouter()

ROBOTS_PRODUCTION = """..."""  # full policy above
ROBOTS_NONPROD = """User-agent: *\nDisallow: /\n"""  # block all in dev/staging

@robots_router.get("/robots.txt", response_class=PlainTextResponse)
async def robots_txt():
    if settings.ENV == "production":
        return PlainTextResponse(ROBOTS_PRODUCTION, media_type="text/plain")
    return PlainTextResponse(ROBOTS_NONPROD, media_type="text/plain")

Dev/staging must return Disallow: / for all agents to prevent test content from being indexed by search engines.


Layer 5 — API limit Parameter Cap

Immediate change: reduce the hard ceiling in feeds_http.py:

# BEFORE
limit: int = Query(100, description="Number of posts to return", ge=1, le=10000)

# AFTER (unauthenticated)
limit: int = Query(20, description="Number of posts to return", ge=1, le=50)

Future: Allow higher limits (le=500) for requests presenting a valid API key, enabling legitimate partners/aggregators without exposing the full 10k ceiling.


Decision Matrix for Stakeholders

Option Effort Cost Blocks scrapers SEO safe Complexity
Cloudflare Bot Fight Mode (free) Very Low $0 Partial Yes None
Cloudflare Rate Limiting Low Cloudflare Pro ($20/mo) Good Yes Low
robots.txt Low $0 Advisory only Yes None
Cap limit parameter to 50 Very Low $0 Reduces blast radius N/A None
Gubernator in-cluster Medium $0 (infra only) Strong Yes Medium
API Key auth for high-limit High $0 Strong Yes High
Full JWT auth on /v1/posts Very High $0 Complete Risk (breaks SEO) Very High

Recommended phase order:

  1. Now (hours): Enable Cloudflare Bot Fight Mode + cap limit to 50 + add robots.txt
  2. Sprint (days): Deploy Gubernator sidecar + add rate-limit middleware + Cloudflare rate limit rules
  3. Next quarter: API key tier for partners; revisit AI crawler opt-out policy

SEO Impact Assessment

Tightening the API must not break Googlebot or Bing indexing:

  • Sitemap (/sitemap.xml, /sitemap-*.xml) must remain fully accessible
  • /v1/posts/{slug} individual post endpoints should be allowed (low abuse risk)
  • Rate limits should be generous enough for crawl budgets (Googlebot typically crawls 1–10 req/sec; our proposed 60 req/min is well within normal)
  • X-Robots-Tag: noindex headers on /v1/ JSON responses prevent double-indexing (content rendered by the frontend is already indexed via HTML pages)

Open Questions

  1. AI crawler policy: Should we disallow all AI training crawlers (GPTBot etc.)? This is a legal/product decision, not just technical.
  2. API key rollout: Which partners need limit > 50? Do we have existing integrations?
  3. Dev/staging parity: Gubernator should also run in sh-dev/sh-staging for testing.
  4. Crawl-delay: Google ignores Crawl-delay in robots.txt — use Search Console to adjust Googlebot crawl rate instead.
  5. Tor / VPN exit nodes: Cloudflare's bot score accounts for these. No extra work needed.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment