Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created September 10, 2025 16:15
Show Gist options
  • Select an option

  • Save decagondev/719090981b706af5586771559fd9c10b to your computer and use it in GitHub Desktop.

Select an option

Save decagondev/719090981b706af5586771559fd9c10b to your computer and use it in GitHub Desktop.

MCP Servers for PDF Interaction with Cursor for Knowledge Base

This document outlines several Model Context Protocol (MCP) servers designed to interact with PDF files and integrate with Cursor IDE to create or query a knowledge base from PDF content. Each server is detailed with its capabilities, setup instructions, and use cases.

1. PDF Reader MCP Server

  • Description: Integrates with PyPDF2 for efficient text extraction and information retrieval from PDF documents, suitable for knowledge base applications. Supports both local and URL-based PDFs with standardized JSON output for seamless Cursor integration.
  • Capabilities:
    • Extracts text from various PDF formats.
    • Handles both local and remote PDFs.
    • Provides structured output for AI-driven queries.
  • Setup in Cursor:
    1. Create or edit ~/.cursor/mcp.json for global access or .cursor/mcp.json in your project directory.
    2. Add the following configuration:
      {
        "mcpServers": {
          "pdf-reader": {
            "command": "docker",
            "args": ["run", "-i", "--rm", "-v", "/path/to/pdfs:/pdfs", "mcp/pdf-reader"],
            "disabled": false,
            "autoApprove": []
          }
        }
      }
    3. Replace /path/to/pdfs with the actual path to your PDF files directory.
    4. Restart Cursor or refresh MCP settings (Settings > MCP > Refresh).
    5. Use commands like read_local_pdf to extract text for knowledge base queries.
  • Use Case: Ideal for extracting text from documentation or research papers to build a searchable knowledge base.

2. PDF RAG MCP Server

  • Description: A powerful document knowledge base system leveraging PDF processing, vector storage, and semantic search (available on GitHub: hyson666/pdf-rag-mcp-server). Supports uploading, processing, and querying PDFs, with a modern web interface.
  • Capabilities:
    • Uploads and processes PDFs, extracting and chunking content for vectorization.
    • Supports semantic search using a FAISS index.
    • Provides a React/Chakra UI web interface and WebSocket updates.
  • Setup in Cursor:
    1. Clone the repository: git clone https://github.com/hyson666/pdf-rag-mcp-server.
    2. Install dependencies: uv pip install -r requirements.txt.
    3. Configure environment variables for FAISS index or knowledge base directory.
    4. Add to ~/.cursor/mcp.json:
      {
        "mcpServers": {
          "pdf-rag": {
            "command": "python",
            "args": ["/path/to/pdf-rag-mcp-server/main.py"],
            "env": {
              "KNOWLEDGE_BASES_ROOT_DIR": "/path/to/knowledge_bases",
              "FAISS_INDEX_PATH": "/path/to/knowledge_bases/.faiss"
            }
          }
        }
      }
    5. Replace paths with your local directories.
    6. Start the server and refresh Cursor’s MCP settings.
    7. Use tools like retrieve_knowledge for semantic search.
  • Use Case: Perfect for advanced knowledge bases requiring semantic search across large PDF collections, such as technical manuals or academic papers.

3. PDF Extraction MCP Server

  • Description: Focuses on text extraction and OCR for PDFs, designed for document analysis and content indexing.
  • Capabilities:
    • Extracts text and supports OCR for scanned documents.
    • Provides tools for content indexing to build a structured knowledge base.
  • Setup in Cursor:
    1. Install: uv pip install pdf_extraction.
    2. Configure in ~/.cursor/mcp.json:
      {
        "mcpServers": {
          "pdf_extraction": {
            "command": "uvx",
            "args": ["pdf_extraction"]
          }
        }
      }
    3. Restart Cursor or refresh MCP settings.
    4. Use tools like extract_pdf_content to process PDFs for knowledge base integration.
  • Use Case: Useful for indexing and querying PDF content, especially scanned documents requiring OCR.

4. PDF Forms MCP Server

  • Description: Uses PyMuPDF to extract and visualize form field information from PDFs, suitable for structured data knowledge bases.
  • Capabilities:
    • Locates and extracts form field data.
    • Visualizes form fields for structured data retrieval.
  • Setup in Cursor:
    1. Add to ~/.cursor/mcp.json:
      {
        "mcpServers": {
          "pdf-forms": {
            "command": "python",
            "args": ["/path/to/pdf-forms-mcp-server/main.py"]
          }
        }
      }
    2. Replace the path with the actual server script location.
    3. Refresh Cursor’s MCP settings.
    4. Use tools like extract_form_fields for structured data integration.
  • Use Case: Best for knowledge bases with structured data from PDF forms, such as legal or application forms.

5. Knowledge Base MCP Server

  • Description: Provides semantic search and knowledge graph capabilities for structured repositories, including PDF-derived content (powered by txtai).
  • Capabilities:
    • Semantic search and knowledge graph creation.
    • Processes text extracted from PDFs for advanced querying.
  • Setup in Cursor:
    1. Install: uv pip install kb-mcp-server.
    2. Configure in ~/.cursor/mcp.json:
      {
        "mcpServers": {
          "kb-server": {
            "command": "kb-mcp-server",
            "args": ["--embeddings", "/path/to/knowledge_base.tar.gz"],
            "cwd": "/path/to/working/directory"
          }
        }
      }
    3. Replace paths with your directories or archives.
    4. Start the server and refresh Cursor.
    5. Use tools like retrieve_knowledge to query the knowledge base.
  • Use Case: Enhances PDF-based knowledge bases with semantic search and knowledge graphs.

Recommended Approach

Combine PDF RAG MCP Server with Knowledge Base MCP Server for a robust PDF-based knowledge base:

  • Why PDF RAG? Excels at processing and vectorizing PDFs for semantic search.
  • Why Knowledge Base? Adds semantic search and knowledge graph capabilities.
  • Workflow:
    1. Use PDF RAG to process PDFs into a FAISS index.
    2. Integrate Knowledge Base MCP Server for semantic searches and knowledge graphs.
    3. Configure both in Cursor’s mcp.json for AI-driven queries.

Setup Example for Combined Approach

  1. Install Dependencies:
    • Ensure Python 3.10+ and uv: pip install -U uv.
    • Install PDF RAG: uv pip install -r requirements.txt (from cloned repository).
    • Install Knowledge Base: uv pip install kb-mcp-server.
  2. Configure in Cursor:
    {
      "mcpServers": {
        "pdf-rag": {
          "command": "python",
          "args": ["/path/to/pdf-rag-mcp-server/main.py"],
          "env": {
            "KNOWLEDGE_BASES_ROOT_DIR": "/path/to/knowledge_bases",
            "FAISS_INDEX_PATH": "/path/to/knowledge_bases/.faiss"
          }
        },
        "kb-server": {
          "command": "kb-mcp-server",
          "args": ["--embeddings", "/path/to/knowledge_base.tar.gz"],
          "cwd": "/path/to/working/directory"
        }
      }
    }
  3. Start Servers:
    • Start both servers and refresh Cursor’s MCP settings.
    • Query the knowledge base in Cursor’s Agent mode with prompts like: “Search my PDF knowledge base for [topic].”

Security Considerations

  • API Keys: Use environment variables for sensitive data.
  • Local Hosting: Run servers locally with stdio transport.
  • Review Tool Calls: Verify tool calls in Cursor before execution.
  • Isolated Environments: Use Docker for isolation.

Limitations

  • PDF Complexity: Some servers may struggle with complex PDFs (e.g., scanned documents). Use PDF Extraction for OCR.
  • Performance: FAISS index in PDF RAG requires significant memory for large collections.
  • Cursor Integration: Manual refreshes may be needed after configuration changes.

Conclusion

The PDF RAG MCP Server and Knowledge Base MCP Server combination offers a comprehensive solution for a PDF-based knowledge base in Cursor, with robust text extraction, semantic search, and knowledge graph capabilities. For simpler needs, use PDF Reader or PDF Extraction. For PDF forms, consider PDF Forms MCP Server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment