Skip to content

Instantly share code, notes, and snippets.

@dwillitzer
Created December 29, 2025 22:41
Show Gist options
  • Select an option

  • Save dwillitzer/88d41f26d46c17ee99493ba8441edf40 to your computer and use it in GitHub Desktop.

Select an option

Save dwillitzer/88d41f26d46c17ee99493ba8441edf40 to your computer and use it in GitHub Desktop.
Deepcrawl Research: Open-Source Firecrawl Alternative - Technical Analysis & TCO Comparison

Deepcrawl: Open-Source Firecrawl Alternative

Overview

Repository: github.com/lumpinif/deepcrawl Documentation: deepcrawl.dev/docs License: MIT (100% open source) Author: Felix Lu (@felixLu) Stars: 318+ | Forks: 35

Deepcrawl is a 100% free, open-source alternative to Firecrawl - an agents-oriented website data context extraction platform designed to make any website data AI-ready.


What It Does

Deepcrawl extracts:

  • Cleaned markdown of page content
    • Hierarchical links tree (agent-favored navigation structure)
    • Page metadata that LLMs can digest with minimal token cost

The goal is to reduce context switching and hallucination when feeding web content to AI agents.


Technical Architecture

Core Stack

Component Technology Purpose
API Workers Cloudflare Workers Edge-first fetching, normalization, link graph generation
Dashboard Next.js 16 App Router UI for crawls, API playground, logs, key management
Routing/Middleware Hono Lightweight edge-first framework
RPC Layer oRPC Contract-first RPC generating OpenAPI + SDK from single source
Auth better-auth Enterprise-grade session and API key management
Schemas Zod v4 Type safety across workers, SDKs, and docs
Database Drizzle ORM Supports both Neon PostgreSQL and Cloudflare D1 SQLite
Caching Upstash Redis + Cloudflare KV Layered caching for performance/cost balance
UI shadcn/ui + Tailwind CSS v4 Accessible component system
State Management nuqs URL state serialization for shareable configurations
Data Fetching React Query Server-side prefetching and cache sync
Build Turborepo (monorepo) Shared packages, database modules, scripts
Bundler tsdown (Rolldown-based) Rust-powered bundling
Linting Biome Unified formatter and linter

Language Breakdown

  • TypeScript: 94.5%
    • MDX: 4.8%
    • Other: 0.7%

Core API Endpoints

1. getMarkdown (GET /read)

Fast endpoint returning prompt-ready markdown. Ideal for caching snippets or feeding LLM prompts.

curl \
  -H "Authorization: Bearer $DEEPCRAWL_API_KEY" \
    "https://api.deepcrawl.dev/read?url=https://example.com"
    ```
    
    ### 2. `readUrl` (POST /read)
    Full operation with metadata, cleaned HTML, markdown, robots, sitemap data, and metrics in one payload.
    
    ### 3. `extractLinks` (POST /links)
    Configurable crawl endpoint that builds agent-navigable links tree, filters external/media links.
    
    ### 4. `getLinks` (GET /links)
    Quick fetch of extracted links for a page.
    
    ---
    
    ## SDK Usage
    
    ```typescript
    import { DeepcrawlApp } from 'deepcrawl';
    
    const deepcrawl = new DeepcrawlApp({
      apiKey: process.env.DEEPCRAWL_API_KEY as string,
      });
      
      // Simple markdown extraction
      const markdown = await deepcrawl.getMarkdown('https://example.com');
      
      // With options
      const markdown = await deepcrawl.getMarkdown('https://example.com', {
        cacheOptions: { expirationTtl: 3600 },
          cleaningProcessor: 'cheerio-reader', // or 'html-rewriter'
          });
          ```
          
          ---
          
          ## Total Cost of Ownership (TCO) Analysis
          
          ### Deepcrawl Self-Hosted Costs
          
          | Component | Free Tier | Paid Tier |
          |-----------|-----------|-----------|
          | **Software** | $0 (MIT license) | $0 |
          | **Cloudflare Workers** | $0/mo (100k req/day) | $5/mo + $0.30/M requests |
          | **Cloudflare KV** | Included | Included |
          | **Cloudflare R2** | Zero egress | Zero egress |
          | **Vercel (Dashboard)** | $0/mo (Hobby) | $20/mo (Pro) |
          | **Database (D1/Neon)** | Free tier available | Variable |
          
          **Typical TCO: $0/month** for most use cases
          
          ### Firecrawl Comparison (Commercial)
          
          | Plan | Monthly Cost | Credits | Cost/1K pages |
          |------|-------------|---------|---------------|
          | Free | $0 (one-time) | 500 | N/A |
          | Hobby | $16/mo | 3,000 | $5.33 |
          | Standard | $83/mo | 100,000 | $0.83 |
          | Growth | $333/mo | 500,000 | $0.67 |
          
          ### Cost Savings Summary
          
          | Usage Level | Firecrawl Cost | Deepcrawl Cost | Savings |
          |-------------|---------------|----------------|---------|
          | 3,000 pages/mo | $16/mo | $0 | $192/year |
          | 100,000 pages/mo | $83/mo | $0-5/mo | $936-996/year |
          | 500,000 pages/mo | $333/mo | $5-20/mo | $3,756-3,936/year |
          
          ---
          
          ## Key Features & Advantages
          
          ### Performance
          - **5-10x faster** than alternatives on standard HTML parsing
          - No headless browser overhead for simple content
          - Edge-native V8 Workers return responses in milliseconds
          - Dynamic cache controls reduce redundant crawls
          
          ### AI Optimization
          - **Markdown-first output** removes ads, scripts, boilerplate
          - **Token-efficient formatting** cuts prompt costs
          - **Link tree intelligence** maps site topology for agent navigation
          
          ### Developer Experience
          - Fully typed TypeScript SDK
          - Zod schemas plug directly into AI frameworks (ai-sdk, LangChain)
          - OpenAPI and oRPC endpoints work across any runtime
          - Consistent REST API across curl, Python, serverless functions
          
          ### Infrastructure
          - Cloudflare Workers run globally, minimizing latency
          - CDN-backed responses for consistent worldwide performance
          - Automatic retries for flaky sites
          
          ---
          
          ## Deployment Options
          
          ### Option 1: Use Hosted API (Free)
          - Sign up at deepcrawl.dev
          - Get API key from dashboard
          - Use SDK or REST API
          
          ### Option 2: Self-Host (Full Control)
          ```bash
          # Clone repo
          git clone https://github.com/lumpinif/deepcrawl.git
          
          # Deploy dashboard to Vercel
          # Deploy workers to Cloudflare
          # Configure database (D1 or Neon PostgreSQL)
          ```
          
          Full self-hosting guide: [deepcrawl.dev/docs/reference/self-hosting](https://deepcrawl.dev/docs/reference/self-hosting)
          
          ---
          
          ## Integration Examples
          
          ### With Vercel AI SDK
          ```typescript
          import { DeepcrawlApp } from 'deepcrawl';
          import { generateText } from 'ai';
          
          const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });
          
          const markdown = await deepcrawl.getMarkdown('https://example.com');
          const result = await generateText({
            model: yourModel,
              prompt: `Summarize this page:\n\n${markdown}`,
              });
              ```
              
              ### With LangChain
              ```typescript
              const markdown = await deepcrawl.getMarkdown(url);
              const docs = [new Document({ pageContent: markdown })];
              // Use with your LangChain pipeline
              ```
              
              ---
              
              ## Limitations & Considerations
              
               **Active Development Warning:**
               > DO NOT USE DEEPCRAWL IN PRODUCTION RIGHT NOW AS IT IS SUBJECT TO CHANGE AND STILL UNDER RAPID DEVELOPMENT. USE WITH YOUR OWN RISK!
               
               ### Current Limitations
               - APIs and SDKs subject to change
               - Many planned features still in development
               - Early-stage stability
               - Community support (no enterprise SLA)
               
               ### When to Choose Firecrawl Instead
               - Need production SLA guarantees
               - Require enterprise support
               - Don't want to manage infrastructure
               - Need features not yet in Deepcrawl
               
               ---
               
               ## Quick Start
               
               ```bash
               # Install SDK
               npm install deepcrawl
               
               # Set API key
               export DEEPCRAWL_API_KEY="dc_sk_live_your_key"
               ```
               
               ```typescript
               import { DeepcrawlApp } from 'deepcrawl';
               
               const dc = new DeepcrawlApp({
                 apiKey: process.env.DEEPCRAWL_API_KEY
                 });
                 
                 // Get markdown from any URL
                 const content = await dc.getMarkdown('https://news.ycombinator.com');
                 console.log(content);
                 ```
                 
                 ---
                 
                 ## Resources
                 
                 - **GitHub:** https://github.com/lumpinif/deepcrawl
                 - **Documentation:** https://deepcrawl.dev/docs
                 - **API Reference:** https://api.deepcrawl.dev/docs
                 - **Online Playground:** https://deepcrawl.dev/app
                 - **Author Twitter:** [@felixlu1018](https://twitter.com/felixlu1018)
                 
                 ---
                 
                 ## Conclusion
                 
                 Deepcrawl offers a compelling open-source alternative to Firecrawl with:
                 
                  **$0 TCO** for most use cases (vs $16-333/mo for Firecrawl)
                   **Full source code access** and self-hosting capability
                    **Modern tech stack** (Next.js 16, Cloudflare Workers, TypeScript)
                     **AI-optimized output** designed for LLM consumption
                      **Active development** with 318+ stars and growing community
                      
                      The main trade-off is stability - it's explicitly marked as not production-ready. For side projects, development, or cost-sensitive applications, it's an excellent choice. For mission-critical production workloads requiring SLA guarantees, evaluate carefully or wait for v1.0.
                      
                      ---
                      
                      *Research compiled: December 2024*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment