This project provides a tool for processing PDF files to enhance accessibility by adding OCR-based text, document structure, and generating a thumbnail for preview purposes. The tool is packaged within a Docker container, ensuring a consistent and isolated environment for execution.
- OCR Processing: Extracts text from scanned PDFs using Tesseract OCR and embeds it as invisible text in the PDF.
- Document Structure: Adds structural metadata such as language, title, and author to the PDF.
- Thumbnail Creation: Generates a thumbnail image of the first page of the PDF for easy preview.
- URL Support: The tool can download and process PDFs directly from a URL.
- Non-root Execution: The application runs within a Docker container using a non-root user for enhanced security.
- Output Directory: All generated files (processed PDFs, thumbnails) are saved in the
output
directory, while the original PDF file remains unchanged.
- Docker
To build the Docker image, navigate to the project directory and run:
docker build -t pdf-processor .
To process a PDF file, you can run the Docker container with the following command, which supports mounting both the current working directory and a specific PDF file:
docker run --rm \
-v "$(pwd):/home/appuser/app" \
-v "/path/to/your/pdf_file:/home/appuser/app/your_pdf_file.pdf" \
pdf-processor \
python /home/appuser/app/process_pdf_for_compliance.py /home/appuser/app/your_pdf_file.pdf [thumbnail_size]
$(pwd):/home/appuser/app
: Mounts your current working directory to/home/appuser/app
inside the container./path/to/your/pdf_file:/home/appuser/app/your_pdf_file.pdf
: Mounts the specific PDF file from your host system into the container.your_pdf_file.pdf
: The file name to be processed inside the container.[thumbnail_size]
: Optional argument to specify the size of the thumbnail (e.g.,300
for a 300x300 thumbnail).
Assuming you have a PDF file in your current directory:
docker run --rm \
-v "$(pwd):/home/appuser/app" \
-v "/home/don/Downloads/Sample_PDF_file.pdf:/home/appuser/app/Sample_PDF_file.pdf" \
pdf-processor \
python /home/appuser/app/process_pdf_for_compliance.py /home/appuser/app/Sample_PDF_file.pdf 200
If you want to process a PDF from a URL and specify a thumbnail size:
docker run --rm \
-v "$(pwd):/home/appuser/app" \
pdf-processor \
python /home/appuser/app/process_pdf_for_compliance.py https://example.com/document.pdf 300
- The processed PDF file and thumbnail will be saved in the
output
directory within the container. - The original PDF file will remain unchanged in its original location.
- Dockerfile: Defines the environment for running the application.
- process_pdf_for_compliance.py: Main script that processes the PDF for accessibility and creates a thumbnail.
- README.md: Project documentation.
The script performs the following steps:
- Download File (if URL is provided): Downloads the PDF from a URL.
- File Existence Check: Confirms the PDF file exists before processing.
- Analyze PDF Structure: Logs the number of pages and the text length on each page.
- OCR Processing: Runs OCR on each page to extract text and embeds it into the PDF.
- Add Document Structure: Adds metadata such as document title, author, and language.
- Create Thumbnail: Generates a thumbnail image of the first page.
- Save Outputs: Saves the processed PDF and thumbnail in the
output
directory. - Cleanup: Removes the downloaded file if the PDF was sourced from a URL.
The application logs all significant actions and errors, making it easy to track the processing steps and troubleshoot any issues.
- The application runs as a non-root user inside the Docker container, reducing the risk of potential security vulnerabilities.
- Ensure that any PDFs processed by the tool are from trusted sources to avoid processing potentially malicious files.