Repository: github.com/lumpinif/deepcrawl Documentation: deepcrawl.dev/docs License: MIT (100% open source) Author: Felix Lu (@felixLu) Stars: 318+ | Forks: 35
Deepcrawl is a 100% free, open-source alternative to Firecrawl - an agents-oriented website data context extraction platform designed to make any website data AI-ready.
Deepcrawl extracts:
- Cleaned markdown of page content
-
- Hierarchical links tree (agent-favored navigation structure)
-
- Page metadata that LLMs can digest with minimal token cost
The goal is to reduce context switching and hallucination when feeding web content to AI agents.
| Component | Technology | Purpose |
|---|---|---|
| API Workers | Cloudflare Workers | Edge-first fetching, normalization, link graph generation |
| Dashboard | Next.js 16 App Router | UI for crawls, API playground, logs, key management |
| Routing/Middleware | Hono | Lightweight edge-first framework |
| RPC Layer | oRPC | Contract-first RPC generating OpenAPI + SDK from single source |
| Auth | better-auth | Enterprise-grade session and API key management |
| Schemas | Zod v4 | Type safety across workers, SDKs, and docs |
| Database | Drizzle ORM | Supports both Neon PostgreSQL and Cloudflare D1 SQLite |
| Caching | Upstash Redis + Cloudflare KV | Layered caching for performance/cost balance |
| UI | shadcn/ui + Tailwind CSS v4 | Accessible component system |
| State Management | nuqs | URL state serialization for shareable configurations |
| Data Fetching | React Query | Server-side prefetching and cache sync |
| Build | Turborepo (monorepo) | Shared packages, database modules, scripts |
| Bundler | tsdown (Rolldown-based) | Rust-powered bundling |
| Linting | Biome | Unified formatter and linter |
- TypeScript: 94.5%
-
- MDX: 4.8%
-
- Other: 0.7%
Fast endpoint returning prompt-ready markdown. Ideal for caching snippets or feeding LLM prompts.
curl \
-H "Authorization: Bearer $DEEPCRAWL_API_KEY" \
"https://api.deepcrawl.dev/read?url=https://example.com"
```
### 2. `readUrl` (POST /read)
Full operation with metadata, cleaned HTML, markdown, robots, sitemap data, and metrics in one payload.
### 3. `extractLinks` (POST /links)
Configurable crawl endpoint that builds agent-navigable links tree, filters external/media links.
### 4. `getLinks` (GET /links)
Quick fetch of extracted links for a page.
---
## SDK Usage
```typescript
import { DeepcrawlApp } from 'deepcrawl';
const deepcrawl = new DeepcrawlApp({
apiKey: process.env.DEEPCRAWL_API_KEY as string,
});
// Simple markdown extraction
const markdown = await deepcrawl.getMarkdown('https://example.com');
// With options
const markdown = await deepcrawl.getMarkdown('https://example.com', {
cacheOptions: { expirationTtl: 3600 },
cleaningProcessor: 'cheerio-reader', // or 'html-rewriter'
});
```
---
## Total Cost of Ownership (TCO) Analysis
### Deepcrawl Self-Hosted Costs
| Component | Free Tier | Paid Tier |
|-----------|-----------|-----------|
| **Software** | $0 (MIT license) | $0 |
| **Cloudflare Workers** | $0/mo (100k req/day) | $5/mo + $0.30/M requests |
| **Cloudflare KV** | Included | Included |
| **Cloudflare R2** | Zero egress | Zero egress |
| **Vercel (Dashboard)** | $0/mo (Hobby) | $20/mo (Pro) |
| **Database (D1/Neon)** | Free tier available | Variable |
**Typical TCO: $0/month** for most use cases
### Firecrawl Comparison (Commercial)
| Plan | Monthly Cost | Credits | Cost/1K pages |
|------|-------------|---------|---------------|
| Free | $0 (one-time) | 500 | N/A |
| Hobby | $16/mo | 3,000 | $5.33 |
| Standard | $83/mo | 100,000 | $0.83 |
| Growth | $333/mo | 500,000 | $0.67 |
### Cost Savings Summary
| Usage Level | Firecrawl Cost | Deepcrawl Cost | Savings |
|-------------|---------------|----------------|---------|
| 3,000 pages/mo | $16/mo | $0 | $192/year |
| 100,000 pages/mo | $83/mo | $0-5/mo | $936-996/year |
| 500,000 pages/mo | $333/mo | $5-20/mo | $3,756-3,936/year |
---
## Key Features & Advantages
### Performance
- **5-10x faster** than alternatives on standard HTML parsing
- No headless browser overhead for simple content
- Edge-native V8 Workers return responses in milliseconds
- Dynamic cache controls reduce redundant crawls
### AI Optimization
- **Markdown-first output** removes ads, scripts, boilerplate
- **Token-efficient formatting** cuts prompt costs
- **Link tree intelligence** maps site topology for agent navigation
### Developer Experience
- Fully typed TypeScript SDK
- Zod schemas plug directly into AI frameworks (ai-sdk, LangChain)
- OpenAPI and oRPC endpoints work across any runtime
- Consistent REST API across curl, Python, serverless functions
### Infrastructure
- Cloudflare Workers run globally, minimizing latency
- CDN-backed responses for consistent worldwide performance
- Automatic retries for flaky sites
---
## Deployment Options
### Option 1: Use Hosted API (Free)
- Sign up at deepcrawl.dev
- Get API key from dashboard
- Use SDK or REST API
### Option 2: Self-Host (Full Control)
```bash
# Clone repo
git clone https://github.com/lumpinif/deepcrawl.git
# Deploy dashboard to Vercel
# Deploy workers to Cloudflare
# Configure database (D1 or Neon PostgreSQL)
```
Full self-hosting guide: [deepcrawl.dev/docs/reference/self-hosting](https://deepcrawl.dev/docs/reference/self-hosting)
---
## Integration Examples
### With Vercel AI SDK
```typescript
import { DeepcrawlApp } from 'deepcrawl';
import { generateText } from 'ai';
const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });
const markdown = await deepcrawl.getMarkdown('https://example.com');
const result = await generateText({
model: yourModel,
prompt: `Summarize this page:\n\n${markdown}`,
});
```
### With LangChain
```typescript
const markdown = await deepcrawl.getMarkdown(url);
const docs = [new Document({ pageContent: markdown })];
// Use with your LangChain pipeline
```
---
## Limitations & Considerations
**Active Development Warning:**
> DO NOT USE DEEPCRAWL IN PRODUCTION RIGHT NOW AS IT IS SUBJECT TO CHANGE AND STILL UNDER RAPID DEVELOPMENT. USE WITH YOUR OWN RISK!
### Current Limitations
- APIs and SDKs subject to change
- Many planned features still in development
- Early-stage stability
- Community support (no enterprise SLA)
### When to Choose Firecrawl Instead
- Need production SLA guarantees
- Require enterprise support
- Don't want to manage infrastructure
- Need features not yet in Deepcrawl
---
## Quick Start
```bash
# Install SDK
npm install deepcrawl
# Set API key
export DEEPCRAWL_API_KEY="dc_sk_live_your_key"
```
```typescript
import { DeepcrawlApp } from 'deepcrawl';
const dc = new DeepcrawlApp({
apiKey: process.env.DEEPCRAWL_API_KEY
});
// Get markdown from any URL
const content = await dc.getMarkdown('https://news.ycombinator.com');
console.log(content);
```
---
## Resources
- **GitHub:** https://github.com/lumpinif/deepcrawl
- **Documentation:** https://deepcrawl.dev/docs
- **API Reference:** https://api.deepcrawl.dev/docs
- **Online Playground:** https://deepcrawl.dev/app
- **Author Twitter:** [@felixlu1018](https://twitter.com/felixlu1018)
---
## Conclusion
Deepcrawl offers a compelling open-source alternative to Firecrawl with:
**$0 TCO** for most use cases (vs $16-333/mo for Firecrawl)
**Full source code access** and self-hosting capability
**Modern tech stack** (Next.js 16, Cloudflare Workers, TypeScript)
**AI-optimized output** designed for LLM consumption
**Active development** with 318+ stars and growing community
The main trade-off is stability - it's explicitly marked as not production-ready. For side projects, development, or cost-sensitive applications, it's an excellent choice. For mission-critical production workloads requiring SLA guarantees, evaluate carefully or wait for v1.0.
---
*Research compiled: December 2024*