Skip to content

Instantly share code, notes, and snippets.

@wyattowalsh
Created November 5, 2024 18:10
Show Gist options
  • Save wyattowalsh/5447933f8b6c38cef9fa73f08ca8a979 to your computer and use it in GitHub Desktop.
Save wyattowalsh/5447933f8b6c38cef9fa73f08ca8a979 to your computer and use it in GitHub Desktop.
Crawl4AI User Guide

Crawl4AI Enhanced User Guide

Crawl4AI (version 0.3.73) is a powerful, open-source Python library tailored for large-scale web crawling and data extraction. It simplifies integration with Large Language Models (LLMs) and AI applications through robust, efficient, and flexible extraction techniques.


Table of Contents


Introduction

Crawl4AI makes large-scale web crawling and data extraction efficient, especially for dynamic and LLM-oriented data processing. It offers multi-browser support (Chromium, Firefox, WebKit), high-performance extraction capabilities, and sophisticated content handling. This guide walks you through using Crawl4AI to maximize efficiency and data quality.

Key Features

  • High Performance: Built for speed, leveraging asynchronous architecture and optimized for multi-page, multi-URL extractions.
  • Flexible Output: Generate JSON, cleaned HTML, and LLM-friendly markdown for seamless integration with AI models.
  • Comprehensive Media Extraction: Extracts images, audio, videos, links, and more.
  • Customizable Hooks: Modify headers, user-agents, JavaScript execution, and custom pre/post-processing scripts.
  • Advanced Strategies: Use built-in chunking methods and clustering algorithms, including LLM-based extraction.
  • Multi-Browser Support: Crawl with Chromium, Firefox, or WebKit for best-in-class web rendering.
  • Enhanced Image Processing: Automatic detection of lazy-loaded images.
  • Proxy & Security: Manage anonymity, security, and restricted web access seamlessly.
  • Error Handling: Improved recovery for failed fetches, with screenshot and logging options.
  • LLM Extraction: Integrate with LLMs using providers like OpenAI or Groq for semantic analysis.
  • Magic Mode: Automated configuration for common crawling scenarios, making setups easier.

Installation

Install Crawl4AI via pip:

pip install crawl4ai

For advanced media and browser support, install additional dependencies:

pip install crawl4ai[full]

Basic Usage

Quick Start

Here’s a basic example of using Crawl4AI asynchronously:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  # Print first 500 characters

asyncio.run(main())

Crawling Multiple URLs

Crawl multiple URLs concurrently with ease:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        urls = [
            "https://python.org",
            "https://github.com",
            "https://stackoverflow.com",
            "https://news.ycombinator.com"
        ]
        results = await crawler.arun_many(
            urls=urls,
            word_count_threshold=100,
            bypass_cache=True,
            verbose=True
        )
        for result in results:
            if result.success:
                print(f"Successfully crawled: {result.url}")
                print(f"Title: {result.metadata.get('title', 'N/A')}")
                print(f"Number of images: {len(result.media.get('images', []))}")
                print("---")
            else:
                print(f"Failed to crawl: {result.url}")
                print(f"Error: {result.error_message}")

asyncio.run(main())

Custom Headers & User-Agent

Emulate different browsers or access restricted pages:

async with AsyncWebCrawler(
    headers={"User-Agent": "CustomBot"},
    timeout=15
) as crawler:
    data = await crawler.fetch("https://example.com")

Advanced Features

Session Management

Maintain sessions across multiple requests, ideal for sites requiring authentication:

async with AsyncWebCrawler(session=True) as crawler:
    login_page = await crawler.fetch("https://example.com/login")
    # Use the session for subsequent requests

Proxy & Security

Utilize proxies for anonymity or to bypass restrictions:

async with AsyncWebCrawler(
    proxy="http://proxyserver:port", 
    proxy_auth=("username", "password")
) as crawler:
    data = await crawler.fetch("https://restricted-site.com")

Magic Mode

Enable Magic Mode to simplify crawling with smart defaults:

async with AsyncWebCrawler(magic_mode=True) as crawler:
    # Auto-configure settings for efficient extraction
    data = await crawler.fetch("https://example.com")

JavaScript Execution

Execute JavaScript on dynamic pages:

async with AsyncWebCrawler() as crawler:
    js_code = [
        "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
    ]
    result = await crawler.arun(
        url="https://www.nbcnews.com/business",
        js_code=js_code,
        wait_for="article.tease-card:nth-child(10)",  # Wait for specific element
        bypass_cache=True
    )
    print(result.markdown[:500])  # Print first 500 characters

Hooks & Authentication

Customize hooks for authentication or modify requests:

async with AsyncWebCrawler(
    on_request=lambda req: req.headers.update({"Authorization": "Bearer TOKEN"})
) as crawler:
    data = await crawler.fetch("https://example.com")

Browser Configuration

Customize browser settings for optimal crawling:

async with AsyncWebCrawler(
    browser_type="firefox",
    headless=False
) as crawler:
    # Configure specific browser settings
    await crawler.fetch("https://example.com")

Extraction & Processing

CSS-Based Extraction

Extract elements using CSS selectors:

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://www.nbcnews.com/business",
        css_selector=".wide-tease-item__description",
        bypass_cache=True
    )
    print(result.markdown[:500])  # Print first 500 characters

Chunking & Clustering Strategies

Organize content using advanced chunking strategies:

await crawler.chunk_content(data, strategy="topic")

Implement clustering for semantic grouping:

await crawler.extract_with_clustering(data, strategy="cosine")

Advanced Strategies with LLMs

Customize extractions using LLMs for context-aware output:

from crawl4ai import LLMExtractionStrategy

data = {
    "urls": ["https://www.nbcnews.com/business"],
    "extraction_strategy": "LLMExtractionStrategy",
    "extraction_strategy_args": {
        "provider": "groq/llama3-8b-8192",
        "api_token": os.getenv("GROQ_API_KEY"),
        "instruction": "Extract financial news and translate into French."
    },
}

Schema-Based Extraction

Use Pydantic models to structure extracted content precisely:

from pydantic import BaseModel, Field
import json
from crawl4ai import WebCrawler, LLMExtractionStrategy

class PageSummary(BaseModel):
    title: str = Field(..., description="Page title")
    summary: str = Field(..., description="Detailed page summary")
    keywords: list[str] = Field(..., description="List of keywords")

crawler = WebCrawler()
result = crawler.run(
    url="https://example.com",
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token=os.getenv("OPENAI_API_KEY"),
        schema=PageSummary.model_json_schema()
    )
)

page_summary = json.loads(result.extracted_content)
print(page_summary)

Advanced Examples

Dynamic Content Crawling

Crawl pages that load content dynamically:

async def crawl_dynamic_page():
    async with AsyncWebCrawler() as crawler:
        js_code = [
            "const loadMoreButton = document.querySelector('button.load-more'); loadMoreButton && loadMoreButton.click();"
        ]
        result = await crawler.arun(
            url="https://example.com",
            js_code=js_code,
            wait_for="document.querySelector('.loaded-content')"
        )
        print(result.markdown[:500])  # Print first 500 characters

await crawl_dynamic_page()

Multi-Page Crawling with JavaScript

Efficiently handle multiple pages of content using JavaScript:

import re
from bs4 import BeautifulSoup
import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_multi_page_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        # JavaScript to click "Next" and wait
        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """
        
        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
                bypass_cache=True,
                headless=True,
            )
            
            assert result.success, f"Failed to crawl page {page + 1}"
            soup = BeautifulSoup(result.cleaned_html, "html.parser")
            commits = soup.select("li")
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Total commits found across 3 pages: {len(all_commits)}")

await crawl_multi_page_content()

REST API Usage

Use Crawl4AI’s REST API for integration into other services:

import requests

data = {
    "urls": ["https://www.nbcnews.com/business"],
    "screenshot": True
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()["results"][0]
print(result.keys())

Error Handling & Debugging

  • Timeout Errors: Increase the timeout setting using timeout in AsyncWebCrawler.
  • Verbose Logging: Use verbose=True for detailed logs and insights into what Crawl4AI is doing.
  • Retries: Implement retries for resilience in unstable network conditions.

Performance Tips

  • Use Asynchronous Crawling: Maximize efficiency with AsyncWebCrawler.
  • Optimize Timeout Settings: Adjust timeout to balance speed and reliability.
  • Lazy-Loading Images: Enable lazy-loading detection to ensure complete image extraction.
  • Cache Management: Use bypass_cache to force fresh content fetching when necessary.

Support & Troubleshooting

  • Common Issues: Review connection errors, timeout settings, and ensure correct proxy configuration.
  • Debugging: Use verbose=True for detailed logs and insights.
  • Community Support: Post issues on GitHub or join our Twitter community at @unclecode.

Maximize Crawl4AI’s potential with advanced extraction and LLM integration for comprehensive, scalable web data processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment