Objective: Develop an MCP-compatible service (mcp-doc-retriever) using FastAPI, packaged in a Docker container. This service will recursively download website content starting from a given URL, storing files in a mirrored site structure (like wget). It uses requests primarily, with automatic Playwright fallback for detected JS-heavy pages. It avoids re-downloading existing files by path unless forced (--force), handles common errors, and provides a two-phase search capability across the downloaded content: a fast keyword scan on decoded text, followed by precise selector-based text extraction on candidate pages identified via an index file.
Core Components:
- API Interface: FastAPI server (/download endpoint, /search endpoint, /health).
- Downloader Engine: Handles recursive fetching, requests/Playwright auto-fallback,
--no-clobber(path-based)/--forcelogic, error handling, saving files in mirror structure, maintaining an index file per download job. Must be asynchronous with concurrency limits. - Search Engine: Implements the two-phase search using the download job's index file:
- Fast keyword scan on decoded text content of candidate files.
- Precise text extraction using BeautifulSoup (CSS selectors) on candidate files.
- Input/Output Models: Pydantic models (DownloadRequest, DownloadStatus, SearchRequest, SearchResultItem, SearchResponse, IndexRecord).
- Containerization: Dockerfile (including playwright install) and docker-compose.yml with volume mapping.
- Packaging: Standard pyproject.toml and uv for dependency management.
- File Storage:
- Content:
/app/downloads/content/{hostname}/{path}/{filename}.html(mirrored structure). - Index:
/app/downloads/index/{download_id}.jsonl(maps URLs to local paths, content MD5, status).
- Content:
- Documentation: Sparse download of library docs, specific header comments, inline examples, lessons_learned.json.
- File Size Limit: <= 500 lines per Python file where feasible.
Recovery Plan (If Session Crashes):
- Identify Last Completed Task: Review this task plan document ([X] marker).
- Identify Next Task: Resume development at the next task marked [ ].
- Restart Process: Relaunch the development environment. Ensure Docker (
docker compose up -d) is running if needed. - Continue Development: Proceed with the next task.
Task Plan Visualization
gantt
dateFormat YYYY-MM-DD
title MCP Document Retriever Task Plan
section Phase 1: Setup
Initialize Project Structure & Packaging : P1T1, 2025-04-08, 1d
Add Basic Dependencies : P1T2, after P1T1, 1d
Define Pydantic Models : P1T3, after P1T2, 1d
Create Basic FastAPI App & URL Utils : P1T4, after P1T3, 1d
Phase 1 Verification & Demo : P1T5, after P1T4, 1d
section Phase 2: Downloader
Implement Core Downloading (Requests) : P2T1, after P1T5, 2d
Implement Recursion & Indexing : P2T2, after P2T1, 2d
Implement Playwright Fallback : P2T3, after P2T2, 1d
Implement Error Handling & Checks : P2T4, after P2T3, 1d
Implement Concurrency Control : P2T5, after P2T4, 1d
Create Downloader Test Script : P2T6, after P2T5, 1d
Phase 2 Verification & Demo : P2T7, after P2T6, 1d
section Phase 3: Searcher
Implement Fast Keyword Scan : P3T1, after P2T7, 1d
Implement Precise Selector Extraction : P3T2, after P3T1, 1d
Integrate Two-Phase Search Logic : P3T3, after P3T2, 1d
Create Searcher Test Script : P3T4, after P3T3, 1d
Phase 3 Verification & Demo : P3T5, after P3T4, 1d
section Phase 4: API & Containerization
Integrate Downloader with /download : P4T1, after P3T5, 1d
Integrate Searcher with /search : P4T2, after P4T1, 1d
Create Dockerfile : P4T3, after P4T2, 1d
Create docker-compose.yml : P4T4, after P4T3, 1d
Create API Test Script : P4T5, after P4T4, 1d
Phase 4 Verification & Demo : P4T6, after P4T5, 1d
section Phase 5: Finalization
Perform End-to-End Testing : P5T1, after P4T6, 1d
Create/Finalize README.md : P5T2, after P5T1, 1d
Define MCP Config Example : P5T3, after P5T2, 1d
Phase 5 Verification & Demo : P5T4, after P5T3, 1d
Final Task Plan: mcp-doc-retriever Development
Phase 1: Project Setup & Core Dependencies
- Task 1.1: Initialize Project Structure & Packaging
- Action: Create directory structure (
src/mcp_doc_retriever,scripts,repo_docs,src/mcp_doc_retriever/docs),pyproject.toml,.gitignore. Initializeuv. Create placeholdersrc/mcp_doc_retriever/docs/lessons_learned.json. - Deliverable: Project directory structure,
pyproject.toml,lessons_learned.json.
- Action: Create directory structure (
- Task 1.2: Add Basic Dependencies
- Action: Add FastAPI, Uvicorn, Pydantic, Requests, BeautifulSoup4, lxml, Playwright using
uv add. (Playwright browser install handled in Dockerfile). - Deliverable: Updated
pyproject.toml,uv.lock. Confirmation of dependency installation locally (if tested).
- Action: Add FastAPI, Uvicorn, Pydantic, Requests, BeautifulSoup4, lxml, Playwright using
- Task 1.3: Define Pydantic Models
- Action: Create
src/mcp_doc_retriever/models.py. Define API models (DownloadRequest,DownloadStatus,SearchRequest,SearchResultItem,SearchResponse). Define internalIndexRecordmodel (fields:original_url,canonical_url,local_path,content_md5,fetch_status,http_status,error_message). EnsureDownloadStatusincludesdownload_id. - Deliverable:
src/mcp_doc_retriever/models.pywith defined models.
- Action: Create
- Task 1.4: Create Basic FastAPI App & URL Utils
- Action: Create
src/mcp_doc_retriever/main.pywith basic app instance, placeholder endpoints (/download,/search), and/health. Createsrc/mcp_doc_retriever/utils.pywith functions for: URL canonicalization (lowercase scheme/host, remove default ports/fragments), generatingdownload_id(MD5 of canonical start URL), generating local file path from canonical URL. - Deliverable:
main.pywith basic app/placeholders.utils.pywith helper functions.
- Action: Create
- Task 1.5: Phase 1 Verification, Demo & Finalization
- Goal: Verify project setup, dependencies, models, basic FastAPI app, and utility functions. Demonstrate project structure, models, running basic FastAPI app showing endpoints, and testing utility functions (URL canonicalization, ID generation, path generation). Explain basic setup.
- Actions: Review artifacts, perform demo,
git add .,git commit -m "Complete Phase 1: Project Setup & Basic FastAPI",git tag v0.1-setup, potentially update lessons learned. - Deliverable:
- Project directory structure (
src/mcp_doc_retriever,scripts,repo_docs,src/mcp_doc_retriever/docs) created. pyproject.tomland.gitignorefiles created and configured.uvinitialized and basic dependencies (FastAPI, Uvicorn, Pydantic, Requests, BeautifulSoup4, lxml, Playwright) added.- Placeholder
src/mcp_doc_retriever/docs/lessons_learned.jsoncreated. - Basic FastAPI app in
main.pyruns and placeholder endpoints are accessible. - Demo showcasing project structure and basic FastAPI app running.
- Project directory structure (
Phase 2: Downloader Implementation
- Task 2.1: Implement Core Downloading Logic (Requests) & Storage
- Action: Create
src/mcp_doc_retriever/downloader.py. Implement async functionfetch_single_url_requests. Takes canonical URL, target local path, force flag. Usesrequests. Handles basic connection/HTTP errors. Calculates MD5 hash of content. Saves content to the specifiedlocal_path(mirror structure) if--no-clobbercheck (path existence) passes orforce=True. Returns status, content MD5, detected links. - Deliverable:
downloader.pywithfetch_single_url_requests, basic error handling, path-based clobber logic, mirror storage, MD5 calculation.
- Action: Create
- Task 2.2: Implement Recursive Download Orchestration & Indexing
- Action: Implement main async orchestrator function
start_recursive_download. Takes start URL, depth, force flag,download_id. Creates/opens index file (/app/downloads/index/{download_id}.jsonl). Manages queue/set of visited canonical URLs. Calls fetch function (initiallyfetch_single_url_requests). Extracts links, canonicalizes them, checks domain/subdomain policy, checks depth, checksrobots.txt(utils.pyhelper needed). AppendsIndexRecordto index file for each attempt (success/failure). Respects concurrency limits (see Task 2.6). - Deliverable: Updated
downloader.pywith recursive orchestration logic, domain/depth checking,robots.txtchecking helper call, index file writing.
- Action: Implement main async orchestrator function
- Task 2.3: Implement Playwright Fallback & Auto-Detection
- Action: Implement async function
fetch_single_url_playwright. Add logic tostart_recursive_download(or a wrapper around fetch functions) to: a) Use Playwright ifuse_playwright=Truefrom API, b) Automatically retry with Playwright iffetch_single_url_requestsreturns content matching the heuristic (e.g., <1024 chars AND contains<div id="root">/<div id="app">with little other body). Manage Playwright resources. Update index record on retry/success/failure. - Deliverable: Updated
downloader.pywithfetch_single_url_playwrightand automatic fallback logic based on heuristic.
- Action: Implement async function
- Task 2.4: Implement Error Handling & Basic Checks
- Action: Enhance fetch functions and orchestration logic. Handle timeouts. Implement basic login/paywall check heuristic (scan content for "Login", "Sign In", password fields) and update index record status if detected. Log errors clearly. Ensure fetch status (
success,failed_request,failed_robotstxt,failed_paywall, etc.) is recorded in the index file. - Deliverable: Enhanced
downloader.pywith improved error handling and status reporting in index file.
- Action: Enhance fetch functions and orchestration logic. Handle timeouts. Implement basic login/paywall check heuristic (scan content for "Login", "Sign In", password fields) and update index record status if detected. Log errors clearly. Ensure fetch status (
- Task 2.5: Implement Concurrency Control
- Action: Integrate
asyncio.Semaphoreinto the recursive download orchestration logic. Use one semaphore forrequestscalls (limit ~10) and a separate one for Playwright calls (limit ~2-4). - Deliverable: Updated
downloader.pywith semaphore-based concurrency limits.
- Action: Integrate
- Task 2.6: Create Downloader Test Script
- Action: Create
scripts/test_download.pyto test thestart_recursive_downloadfunction directly. Requires setting up mock responses or pointing to safe local test URLs. Test recursion depth, force/no-clobber, basic Playwright trigger (manual flag), index file creation/content. - Deliverable:
scripts/test_download.py.
- Action: Create
- Task 2.7: Phase 2 Verification, Demo & Finalization
- Goal: Verify downloader logic, including recursion, indexing, fallback, concurrency, and error handling via code review and test script. Demonstrate running
test_download.pyshowing mirror structure creation, index file content, and explaining the key logic paths (requests, playwright fallback, concurrency limits). - Actions: Review code/tests, perform demo,
git add .,git commit -m "Complete Phase 2: Downloader Implementation",git tag v0.2-downloader, potentially update lessons learned. - Deliverable:
downloader.pyfile created withfetch_single_url_requestsfunction implementing core downloading logic usingrequests.- Recursive download orchestration logic implemented in
start_recursive_downloadfunction, including domain/depth checking androbots.txthandling. - Playwright fallback logic implemented in
fetch_single_url_playwrightand integrated into the download process. - Basic error handling and checks (connection errors, paywall detection) implemented and index record status updated.
- Concurrency control using
asyncio.Semaphoreimplemented forrequestsand Playwright. scripts/test_download.pycreated to test downloader functionality.- Demo showcasing downloader functionality, including recursion, Playwright fallback (manual trigger), and index file creation.
- Goal: Verify downloader logic, including recursion, indexing, fallback, concurrency, and error handling via code review and test script. Demonstrate running
Phase 3: Searcher Implementation
- Task 3.1: Implement Fast Keyword Scan
- Action: Create
src/mcp_doc_retriever/searcher.py. Implement functionscan_files_for_keywords. Takes list of local file paths, list of scan keywords. For each path, opens file, decodes HTML content to text (handle encoding errors gracefully - log/skip), performs case-insensitive keyword search. Returns list of paths that contain all keywords. - Deliverable:
searcher.pywithscan_files_for_keywordsfunction using decoded text.
- Action: Create
- Task 3.2: Implement Precise Selector Extraction (Text Only)
- Action: Implement function
extract_text_with_selector. Takes a single local file path, CSS selector, optional list of extract keywords. Parses HTML with BeautifulSoup. Finds elements matching selector. Extracts text content (.get_text()). If extract keywords provided, filters results keeping only those whose text contains all extract keywords (case-insensitive). Returns list of extracted text snippets. - Deliverable:
searcher.pywithextract_text_with_selectorfunction returning text only.
- Action: Implement function
- Task 3.3: Integrate Two-Phase Search Logic & Index Lookup
- Action: Create main search function
perform_search. Takesdownload_id, scan keywords, selector, extract keywords. Reads the index file (/app/downloads/index/{download_id}.jsonl). Filters records forfetch_status='success'to get list of relevantlocal_pathvalues. Callsscan_files_for_keywordswith these paths. Callsextract_text_with_selectorfor each candidate path returned by the scan. Looks uporiginal_urlfrom the index file for each successful extraction. Structures final results as list ofSearchResultItem(containingoriginal_url,extracted_content). - Deliverable: Integrated two-phase
perform_searchlogic insearcher.pyusing index file lookup.
- Action: Create main search function
- Task 3.4: Create Searcher Test Script
- Action: Create
scripts/test_search.py. Requires sample downloaded files and a corresponding sample index file (.jsonl). Test theperform_searchfunction: ensure index is read, scan filters correctly, selector extracts text, results include original URL. - Deliverable:
scripts/test_search.py.
- Action: Create
- Task 3.5: Phase 3 Verification, Demo & Finalization
- Goal: Verify the search logic via code review and test script. Demonstrate running
test_search.pyagainst sample data, showing index lookup, keyword scan filtering, selector-based text extraction, and correctly mapped original URLs in results. Explain the two-phase flow using the index. - Actions: Review code/tests, perform demo,
git add .,git commit -m "Complete Phase 3: Searcher Implementation",git tag v0.3-searcher, potentially update lessons learned. - Deliverable:
searcher.pyfile created withscan_files_for_keywordsfunction for fast keyword scanning.extract_text_with_selectorfunction implemented for precise text extraction using CSS selectors and BeautifulSoup.- Two-phase search logic integrated in
perform_searchfunction, utilizing index file lookup. scripts/test_search.pycreated to test searcher functionality.- Demo showcasing search functionality, including keyword scan filtering, selector-based text extraction, and correct mapping of original URLs in results.
- Goal: Verify the search logic via code review and test script. Demonstrate running
Phase 4: API Integration & Containerization
- Task 4.1: Integrate Downloader with /download Endpoint
- Action: Update
main.py. Implement/downloadendpoint. Canonicalize input URL, generatedownload_id. Return{"status": "started", "download_id": "..."}immediately. Use FastAPI'sBackgroundTasks(or similar async pattern) to run thestart_recursive_downloadfunction fromdownloader.pyin the background. Pass necessary parameters (URL, depth, force,download_id). Handle immediate errors like invalid start URL format. - Deliverable: Functional
/downloadendpoint triggering background download task and returning immediate status.
- Action: Update
- Task 4.2: Integrate Searcher with /search Endpoint
- Action: Update
main.py. Implement/searchendpoint. TakesSearchRequest. Checks if index file fordownload_idexists. Callsperform_searchfunction fromsearcher.py. ReturnsSearchResponsecontaining results or appropriate error (e.g., 404 if index not found). - Deliverable: Functional
/searchendpoint calling search logic.
- Action: Update
- Task 4.3: Create Dockerfile
- Action: Create
Dockerfile. Include Python setup, dependency install (uv sync --frozen),RUN playwright install --with-deps(installs browsers and needed OS libs), copy source code, set workdir, define volume mount point/app/downloads, expose port 8000, set entrypoint (uvicorn). DefineDOWNLOAD_BASE_DIR=/app/downloads. - Deliverable:
Dockerfile.
- Action: Create
- Task 4.4: Create docker-compose.yml
- Action: Create
docker-compose.yml. Definemcp-doc-retrieverservice. Build from context. Map ports (e.g.,8001:8000). Define volume mappingdownload_data:/app/downloads(using a named volumedownload_datais often better than host path). - Deliverable:
docker-compose.yml.
- Action: Create
- Task 4.5: Create API Test Script
- Action: Create
scripts/test_api.py. Make calls to/download(for a small, safe site), wait briefly (since status endpoint is V2), then call/searchusing the returneddownload_id. Verify expected structure of search results. Testforceflag. - Deliverable:
scripts/test_api.py.
- Action: Create
- Task 4.6: Phase 4 Verification, Demo & Finalization
- Goal: Verify API integration and containerization. Review code, Dockerfile, compose file, and API test script results against the containerized service. Demonstrate
docker compose build,docker compose up, runningtest_api.py(showing calls to/download, checking volume content viadocker execor host mount, calling/search), explaining container setup and volume persistence. - Actions: Review artifacts/tests, perform demo,
git add .,git commit -m "Complete Phase 4: API Integration & Containerization",git tag v0.4-docker, potentially update lessons learned. - Deliverable:
/downloadendpoint inmain.pyintegrated withdownloader.pyto trigger background download tasks./searchendpoint inmain.pyintegrated withsearcher.pyto perform searches based ondownload_id.Dockerfilecreated, including Python setup, dependency installation, Playwright browser installation, and volume mount point definition.docker-compose.ymlcreated to define themcp-doc-retrieverservice with port mapping and volume definition.scripts/test_api.pycreated to test API endpoints and basic containerized service functionality.- Demo showcasing API endpoints working, containerized service running via
docker compose, and volume persistence.
- Goal: Verify API integration and containerization. Review code, Dockerfile, compose file, and API test script results against the containerized service. Demonstrate
Phase 5: Final Testing, Documentation & Finalization
- Task 5.1: Perform End-to-End Testing
- Action: Enhance
scripts/test_api.py. Test scenarios: download with depth, search misses, search with specific selectors, download failure (invalid URL),force=truedownload. - Deliverable: Updated
test_api.pyand confirmation of successful execution.
- Action: Enhance
- Task 5.2: Create/Finalize README.md
- Action: Create/finalize
README.mdbased on the updated structure/decisions. Detail features, setup, accurate API usage examples, key concepts (mirroring, index file, auto-fallback, two-phase search), MCP config, Roomodes context. Include Mermaid diagram. - Deliverable: Comprehensive
README.md.
- Action: Create/finalize
- Task 5.3: Define MCP Server Configuration Example
- Action: Add specific, accurate
mcp_settings.jsonexample to README, showing volume mapping (download_data:/app/downloads) and tool names (doc_download,doc_search). - Deliverable: MCP configuration example section in
README.md.
- Action: Add specific, accurate
- Task 5.4: Phase 5 Verification, Demo & Finalization
- Goal: Review E2E tests and README. Demonstrate running enhanced
test_api.py, walk through the finalized README explaining key sections and updated concepts, showcase final service functionality. - Actions: Review artifacts, perform demo,
git add .,git commit -m "Complete Phase 5: Final Testing & Documentation",git tag v1.0-release, potentially update lessons learned. - Deliverable:
- Enhanced
scripts/test_api.pywith comprehensive end-to-end tests covering various scenarios (depth, search misses, selectors, failures, force flag). - Comprehensive
README.mdfinalized, detailing features, setup, API usage, key concepts, MCP configuration, and Roomodes context, including a Mermaid diagram. - Accurate
mcp_settings.jsonexample added toREADME.mdwith volume mapping and tool definitions. - Demo showcasing end-to-end testing using
test_api.pyand finalized documentation inREADME.md.
- Enhanced
- Goal: Review E2E tests and README. Demonstrate running enhanced