A simple CLI tool for extracting tables from PDF files and saving them as CSV files.
- Extract tables from PDF files using local LLM inference
- Save extracted tables as CSV files
- Works with both single PDF files and directories of PDFs
- Configurable model parameters
- Python 3.8+
- LLM model file (e.g., Llama 2/3 or Mistral in GGUF format)
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install llama-cpp-python pypdf pandas
python pdf_table_extractor.py extract --input sample.pdf --output tables/ --model_path models/llama-2-7b-chat.Q4_0.gguf
python pdf_table_extractor.py extract --input pdf_directory/ --output tables/ --model_path models/llama-2-7b-chat.Q4_0.gguf
python pdf_table_extractor.py extract --input sample.pdf --output tables/ --model_path models/llama-2-7b-chat.Q4_0.gguf --n_gpu_layers 32
--input
,-i
: Path to input PDF file or directory--output
,-o
: Directory to save CSV files--model_path
,-m
: Path to the LLM model file--n_ctx
: Context window size (default: 2048)--n_gpu_layers
: Number of layers to offload to GPU (default: 0)
You can download compatible GGUF models from:
Recommended models:
- Llama-2-7B-Chat.Q4_0.gguf
- Mistral-7B-Instruct-v0.2.Q4_0.gguf
- Llama-3-8B-Instruct.Q4_0.gguf
- The tool extracts text from each page of the PDF
- It uses the LLM to detect if a page contains a table
- When a table is detected, the LLM parses the table structure
- The parsed table is saved as a CSV file
- Complex tables with merged cells may not be parsed correctly
- Very large tables exceeding the LLM's context window may be truncated
- Image-based PDFs require OCR pre-processing (not included)
- Add OCR support for image-based PDFs
- Improve table detection accuracy
- Add support for more output formats
- Implement batch processing with progress tracking