Crawl4AI (version 0.3.73) is a powerful, open-source Python library tailored for large-scale web crawling and data extraction. It simplifies integration with Large Language Models (LLMs) and AI applications through robust, efficient, and flexible extraction techniques.
- Introduction
- Key Features
- Installation
- Basic Usage
- Advanced Features
- Extraction & Processing
- Advanced Examples
- REST API Usage
- Error Handling & Debugging
- Performance Tips
- Support & Troubleshooting
Crawl4AI makes large-scale web crawling and data extraction efficient, especially for dynamic and LLM-oriented data processing. It offers multi-browser support (Chromium, Firefox, WebKit), high-performance extraction capabilities, and sophisticated content handling. This guide walks you through using Crawl4AI to maximize efficiency and data quality.
- High Performance: Built for speed, leveraging asynchronous architecture and optimized for multi-page, multi-URL extractions.
- Flexible Output: Generate JSON, cleaned HTML, and LLM-friendly markdown for seamless integration with AI models.
- Comprehensive Media Extraction: Extracts images, audio, videos, links, and more.
- Customizable Hooks: Modify headers, user-agents, JavaScript execution, and custom pre/post-processing scripts.
- Advanced Strategies: Use built-in chunking methods and clustering algorithms, including LLM-based extraction.
- Multi-Browser Support: Crawl with Chromium, Firefox, or WebKit for best-in-class web rendering.
- Enhanced Image Processing: Automatic detection of lazy-loaded images.
- Proxy & Security: Manage anonymity, security, and restricted web access seamlessly.
- Error Handling: Improved recovery for failed fetches, with screenshot and logging options.
- LLM Extraction: Integrate with LLMs using providers like OpenAI or Groq for semantic analysis.
- Magic Mode: Automated configuration for common crawling scenarios, making setups easier.
Install Crawl4AI via pip:
pip install crawl4ai
For advanced media and browser support, install additional dependencies:
pip install crawl4ai[full]
Here’s a basic example of using Crawl4AI asynchronously:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown[:500]) # Print first 500 characters
asyncio.run(main())
Crawl multiple URLs concurrently with ease:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
urls = [
"https://python.org",
"https://github.com",
"https://stackoverflow.com",
"https://news.ycombinator.com"
]
results = await crawler.arun_many(
urls=urls,
word_count_threshold=100,
bypass_cache=True,
verbose=True
)
for result in results:
if result.success:
print(f"Successfully crawled: {result.url}")
print(f"Title: {result.metadata.get('title', 'N/A')}")
print(f"Number of images: {len(result.media.get('images', []))}")
print("---")
else:
print(f"Failed to crawl: {result.url}")
print(f"Error: {result.error_message}")
asyncio.run(main())
Emulate different browsers or access restricted pages:
async with AsyncWebCrawler(
headers={"User-Agent": "CustomBot"},
timeout=15
) as crawler:
data = await crawler.fetch("https://example.com")
Maintain sessions across multiple requests, ideal for sites requiring authentication:
async with AsyncWebCrawler(session=True) as crawler:
login_page = await crawler.fetch("https://example.com/login")
# Use the session for subsequent requests
Utilize proxies for anonymity or to bypass restrictions:
async with AsyncWebCrawler(
proxy="http://proxyserver:port",
proxy_auth=("username", "password")
) as crawler:
data = await crawler.fetch("https://restricted-site.com")
Enable Magic Mode to simplify crawling with smart defaults:
async with AsyncWebCrawler(magic_mode=True) as crawler:
# Auto-configure settings for efficient extraction
data = await crawler.fetch("https://example.com")
Execute JavaScript on dynamic pages:
async with AsyncWebCrawler() as crawler:
js_code = [
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
]
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
wait_for="article.tease-card:nth-child(10)", # Wait for specific element
bypass_cache=True
)
print(result.markdown[:500]) # Print first 500 characters
Customize hooks for authentication or modify requests:
async with AsyncWebCrawler(
on_request=lambda req: req.headers.update({"Authorization": "Bearer TOKEN"})
) as crawler:
data = await crawler.fetch("https://example.com")
Customize browser settings for optimal crawling:
async with AsyncWebCrawler(
browser_type="firefox",
headless=False
) as crawler:
# Configure specific browser settings
await crawler.fetch("https://example.com")
Extract elements using CSS selectors:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
css_selector=".wide-tease-item__description",
bypass_cache=True
)
print(result.markdown[:500]) # Print first 500 characters
Organize content using advanced chunking strategies:
await crawler.chunk_content(data, strategy="topic")
Implement clustering for semantic grouping:
await crawler.extract_with_clustering(data, strategy="cosine")
Customize extractions using LLMs for context-aware output:
from crawl4ai import LLMExtractionStrategy
data = {
"urls": ["https://www.nbcnews.com/business"],
"extraction_strategy": "LLMExtractionStrategy",
"extraction_strategy_args": {
"provider": "groq/llama3-8b-8192",
"api_token": os.getenv("GROQ_API_KEY"),
"instruction": "Extract financial news and translate into French."
},
}
Use Pydantic models to structure extracted content precisely:
from pydantic import BaseModel, Field
import json
from crawl4ai import WebCrawler, LLMExtractionStrategy
class PageSummary(BaseModel):
title: str = Field(..., description="Page title")
summary: str = Field(..., description="Detailed page summary")
keywords: list[str] = Field(..., description="List of keywords")
crawler = WebCrawler()
result = crawler.run(
url="https://example.com",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
schema=PageSummary.model_json_schema()
)
)
page_summary = json.loads(result.extracted_content)
print(page_summary)
Crawl pages that load content dynamically:
async def crawl_dynamic_page():
async with AsyncWebCrawler() as crawler:
js_code = [
"const loadMoreButton = document.querySelector('button.load-more'); loadMoreButton && loadMoreButton.click();"
]
result = await crawler.arun(
url="https://example.com",
js_code=js_code,
wait_for="document.querySelector('.loaded-content')"
)
print(result.markdown[:500]) # Print first 500 characters
await crawl_dynamic_page()
Efficiently handle multiple pages of content using JavaScript:
import re
from bs4 import BeautifulSoup
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_multi_page_content():
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
# JavaScript to click "Next" and wait
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
js=js_next_page if page > 0 else None,
bypass_cache=True,
headless=True,
)
assert result.success, f"Failed to crawl page {page + 1}"
soup = BeautifulSoup(result.cleaned_html, "html.parser")
commits = soup.select("li")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Total commits found across 3 pages: {len(all_commits)}")
await crawl_multi_page_content()
Use Crawl4AI’s REST API for integration into other services:
import requests
data = {
"urls": ["https://www.nbcnews.com/business"],
"screenshot": True
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()["results"][0]
print(result.keys())
- Timeout Errors: Increase the timeout setting using
timeout
inAsyncWebCrawler
. - Verbose Logging: Use
verbose=True
for detailed logs and insights into what Crawl4AI is doing. - Retries: Implement retries for resilience in unstable network conditions.
- Use Asynchronous Crawling: Maximize efficiency with
AsyncWebCrawler
. - Optimize Timeout Settings: Adjust
timeout
to balance speed and reliability. - Lazy-Loading Images: Enable lazy-loading detection to ensure complete image extraction.
- Cache Management: Use
bypass_cache
to force fresh content fetching when necessary.
- Common Issues: Review connection errors, timeout settings, and ensure correct proxy configuration.
- Debugging: Use
verbose=True
for detailed logs and insights. - Community Support: Post issues on GitHub or join our Twitter community at @unclecode.
Maximize Crawl4AI’s potential with advanced extraction and LLM integration for comprehensive, scalable web data processing.