Skip to content

Instantly share code, notes, and snippets.

@m0o0scar
Last active July 31, 2024 10:01
Show Gist options
  • Save m0o0scar/eebeb4250a5f0774f7717d9d234bd7d4 to your computer and use it in GitHub Desktop.
Save m0o0scar/eebeb4250a5f0774f7717d9d234bd7d4 to your computer and use it in GitHub Desktop.
VikParuchuri/marker. Continue this conversation at http://localhost:3000?gist=eebeb4250a5f0774f7717d9d234bd7d4

[github] VikParuchuri/marker

Source

Python / 2.9K lines of code. Convert PDF to markdown quickly with high accuracy

URL: https://github.com/VikParuchuri/marker

Conversation

β–² Explain the major function of the project and how it works by referencing the most important and relevant code snippets.

The project, marker, is a tool that converts PDF documents to markdown format. It aims to achieve this conversion quickly and accurately, especially for books and scientific papers.

Here's a breakdown of its key functionalities and how it works:

1. Text Extraction:

  • The project first extracts text from the PDF using the pypdfium2 library. This library provides a fast and efficient way to access the text content embedded within the PDF.
  • The code snippet marker/pdf/extract_text.py demonstrates this process:
def get_text_blocks(doc, fname, max_pages: Optional[int] = None, start_page: Optional[int] = None) -> (List[Page], Dict):
    # ...
    char_blocks = dictionary_output(fname, page_range=page_range, keep_chars=True, workers=settings.PDFTEXT_CPU_WORKERS)
    marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]
    # ...

2. OCR (Optional):

  • If the PDF's embedded text is insufficient or inaccurate, marker employs OCR (Optical Character Recognition) to extract text from images within the document.
  • It supports two OCR engines: surya and ocrmypdf.
  • The code snippet marker/ocr/recognition.py shows how OCR is triggered and executed:
def run_ocr(doc, pages: List[Page], langs: List[str], rec_model, batch_multiplier=1) -> (List[Page], Dict):
    # ...
    if ocr_method == "surya":
        new_pages = surya_recognition(doc, ocr_idxs, langs, rec_model, pages, batch_multiplier=batch_multiplier)
    elif ocr_method == "ocrmypdf":
        new_pages = tesseract_recognition(doc, ocr_idxs, langs)
    # ...

3. Layout Analysis:

  • marker analyzes the layout of the PDF to understand the structure of the content, such as headings, paragraphs, tables, and figures.
  • It uses the surya library for layout detection and reading order determination.
  • The code snippet marker/layout/layout.py illustrates this process:
def surya_layout(doc, pages: List[Page], layout_model, batch_multiplier=1):
    images = [render_image(doc[pnum], dpi=settings.SURYA_LAYOUT_DPI) for pnum in range(len(pages))]
    text_detection_results = [p.text_lines for p in pages]

    processor = layout_model.processor
    layout_results = batch_layout_detection(images, layout_model, processor, detection_results=text_detection_results, batch_size=int(get_batch_size() * batch_multiplier))
    for page, layout_result in zip(pages, layout_results):
        page.layout = layout_result

4. Content Cleaning and Formatting:

  • marker cleans and formats the extracted text to improve readability and consistency in the markdown output.
  • This includes tasks like:
    • Removing headers and footers
    • Identifying and formatting code blocks
    • Converting equations to LaTeX
    • Handling tables and lists
  • The code snippets marker/cleaners/headers.py, marker/cleaners/code.py, marker/equations/equations.py, and marker/tables/table.py demonstrate these cleaning and formatting operations.

5. Markdown Generation:

  • Finally, marker generates the markdown output based on the cleaned and formatted text, along with any extracted images.
  • The code snippet marker/postprocessors/markdown.py shows how the markdown output is constructed:
def get_full_text(text_blocks):
    full_text = ""
    prev_block = None
    for block in text_blocks:
        if prev_block:
            full_text += block_separator(prev_block.text, block.text, prev_block.block_type, block.block_type)
        else:
            full_text += block.text
        prev_block = block
    return full_text

In summary, marker is a sophisticated PDF-to-markdown converter that leverages deep learning models and advanced text processing techniques to achieve high accuracy and speed. It combines text extraction, OCR, layout analysis, content cleaning, and markdown generation to produce well-formatted and readable markdown output.

β–² Exactly how will it handle images and charts in pdf file? In what format would images and charts be output to the final markdown (as standalone image files with links in markdown?)

marker handles images and charts in PDF files by extracting them as standalone image files and then including links to these files within the markdown output.

Here's how it works:

  1. Image Detection:

    • marker uses the surya library to detect image regions within the PDF. It identifies areas labeled as "Figure" or "Picture" in the layout analysis.
    • The code snippet marker/images/extract.py demonstrates this process:
    def find_image_blocks(page):
        image_blocks = []
        image_regions = [l.bbox for l in page.layout.bboxes if l.label in ["Figure", "Picture"]]
        # ...
  2. Image Extraction:

    • Once image regions are identified, marker extracts the corresponding images from the PDF using the pypdfium2 library.
    • The code snippet marker/images/extract.py shows how images are extracted:
    def extract_page_images(page_obj, page):
        # ...
        image = render_bbox_image(page_obj, page, bbox)
        # ...
  3. Image Saving:

    • The extracted images are saved as PNG files in a subfolder named after the PDF file.
    • The code snippet marker/images/save.py demonstrates this process:
    def get_image_filename(page: Page, image_idx):
        return f"{page.pnum}_image_{image_idx}.png"
    
    def images_to_dict(pages: List[Page]):
        images = {}
        for page in pages:
            if page.images is None:
                continue
            for image_idx, image in enumerate(page.images):
                image_filename = get_image_filename(page, image_idx)
                images[image_filename] = image
        return images
  4. Markdown Linking:

    • In the final markdown output, marker includes links to these saved image files using the ![alt text](image_path) syntax.
    • The code snippet marker/postprocessors/markdown.py shows how image links are added to the markdown:
    def extract_page_images(page_obj, page):
        # ...
        image_markdown = f"\n\n![{image_filename}]({image_filename})\n\n"
        image_span = Span(
            bbox=bbox,
            text=image_markdown,
            font="Image",
            rotation=0,
            font_weight=0,
            font_size=0,
            image=True,
            span_id=f"image_{image_idx}"
        )
        # ...

Therefore, marker outputs images and charts as standalone PNG files within a subfolder of the output directory, and links to these files are included in the generated markdown using the standard markdown image syntax.

Note: marker does not currently support converting charts directly to markdown. It treats charts as images and extracts them as such.

β–² Does it provide any online demo to play with?

While the marker project itself doesn't have a public online demo, it does have a hosted API available through DataLab.to.

This hosted API allows you to convert PDFs, Word documents, and PowerPoints to markdown using marker without needing to install or run the software locally.

You can find more information about the API and its pricing on the DataLab.to website.

While it's not a direct online demo, the hosted API provides a way to experiment with marker's capabilities and see its output for various document types.

{
"url": "https://github.com/VikParuchuri/marker",
"type": "github",
"title": "VikParuchuri/marker",
"picture": "https://avatars.githubusercontent.com/u/913340?v=4",
"description": "Python / 2.9K lines of code.\nConvert PDF to markdown quickly with high accuracy"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment