Python / 2.9K lines of code. Convert PDF to markdown quickly with high accuracy
URL: https://github.com/VikParuchuri/marker
β² Explain the major function of the project and how it works by referencing the most important and relevant code snippets.
The project, marker
, is a tool that converts PDF documents to markdown format. It aims to achieve this conversion quickly and accurately, especially for books and scientific papers.
Here's a breakdown of its key functionalities and how it works:
1. Text Extraction:
- The project first extracts text from the PDF using the
pypdfium2
library. This library provides a fast and efficient way to access the text content embedded within the PDF. - The code snippet
marker/pdf/extract_text.py
demonstrates this process:
def get_text_blocks(doc, fname, max_pages: Optional[int] = None, start_page: Optional[int] = None) -> (List[Page], Dict):
# ...
char_blocks = dictionary_output(fname, page_range=page_range, keep_chars=True, workers=settings.PDFTEXT_CPU_WORKERS)
marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]
# ...
2. OCR (Optional):
- If the PDF's embedded text is insufficient or inaccurate,
marker
employs OCR (Optical Character Recognition) to extract text from images within the document. - It supports two OCR engines:
surya
andocrmypdf
. - The code snippet
marker/ocr/recognition.py
shows how OCR is triggered and executed:
def run_ocr(doc, pages: List[Page], langs: List[str], rec_model, batch_multiplier=1) -> (List[Page], Dict):
# ...
if ocr_method == "surya":
new_pages = surya_recognition(doc, ocr_idxs, langs, rec_model, pages, batch_multiplier=batch_multiplier)
elif ocr_method == "ocrmypdf":
new_pages = tesseract_recognition(doc, ocr_idxs, langs)
# ...
3. Layout Analysis:
marker
analyzes the layout of the PDF to understand the structure of the content, such as headings, paragraphs, tables, and figures.- It uses the
surya
library for layout detection and reading order determination. - The code snippet
marker/layout/layout.py
illustrates this process:
def surya_layout(doc, pages: List[Page], layout_model, batch_multiplier=1):
images = [render_image(doc[pnum], dpi=settings.SURYA_LAYOUT_DPI) for pnum in range(len(pages))]
text_detection_results = [p.text_lines for p in pages]
processor = layout_model.processor
layout_results = batch_layout_detection(images, layout_model, processor, detection_results=text_detection_results, batch_size=int(get_batch_size() * batch_multiplier))
for page, layout_result in zip(pages, layout_results):
page.layout = layout_result
4. Content Cleaning and Formatting:
marker
cleans and formats the extracted text to improve readability and consistency in the markdown output.- This includes tasks like:
- Removing headers and footers
- Identifying and formatting code blocks
- Converting equations to LaTeX
- Handling tables and lists
- The code snippets
marker/cleaners/headers.py
,marker/cleaners/code.py
,marker/equations/equations.py
, andmarker/tables/table.py
demonstrate these cleaning and formatting operations.
5. Markdown Generation:
- Finally,
marker
generates the markdown output based on the cleaned and formatted text, along with any extracted images. - The code snippet
marker/postprocessors/markdown.py
shows how the markdown output is constructed:
def get_full_text(text_blocks):
full_text = ""
prev_block = None
for block in text_blocks:
if prev_block:
full_text += block_separator(prev_block.text, block.text, prev_block.block_type, block.block_type)
else:
full_text += block.text
prev_block = block
return full_text
In summary, marker
is a sophisticated PDF-to-markdown converter that leverages deep learning models and advanced text processing techniques to achieve high accuracy and speed. It combines text extraction, OCR, layout analysis, content cleaning, and markdown generation to produce well-formatted and readable markdown output.
β² Exactly how will it handle images and charts in pdf file? In what format would images and charts be output to the final markdown (as standalone image files with links in markdown?)
marker
handles images and charts in PDF files by extracting them as standalone image files and then including links to these files within the markdown output.
Here's how it works:
-
Image Detection:
marker
uses thesurya
library to detect image regions within the PDF. It identifies areas labeled as "Figure" or "Picture" in the layout analysis.- The code snippet
marker/images/extract.py
demonstrates this process:
def find_image_blocks(page): image_blocks = [] image_regions = [l.bbox for l in page.layout.bboxes if l.label in ["Figure", "Picture"]] # ...
-
Image Extraction:
- Once image regions are identified,
marker
extracts the corresponding images from the PDF using thepypdfium2
library. - The code snippet
marker/images/extract.py
shows how images are extracted:
def extract_page_images(page_obj, page): # ... image = render_bbox_image(page_obj, page, bbox) # ...
- Once image regions are identified,
-
Image Saving:
- The extracted images are saved as PNG files in a subfolder named after the PDF file.
- The code snippet
marker/images/save.py
demonstrates this process:
def get_image_filename(page: Page, image_idx): return f"{page.pnum}_image_{image_idx}.png" def images_to_dict(pages: List[Page]): images = {} for page in pages: if page.images is None: continue for image_idx, image in enumerate(page.images): image_filename = get_image_filename(page, image_idx) images[image_filename] = image return images
-
Markdown Linking:
- In the final markdown output,
marker
includes links to these saved image files using the
syntax. - The code snippet
marker/postprocessors/markdown.py
shows how image links are added to the markdown:
def extract_page_images(page_obj, page): # ... image_markdown = f"\n\n\n\n" image_span = Span( bbox=bbox, text=image_markdown, font="Image", rotation=0, font_weight=0, font_size=0, image=True, span_id=f"image_{image_idx}" ) # ...
- In the final markdown output,
Therefore, marker
outputs images and charts as standalone PNG files within a subfolder of the output directory, and links to these files are included in the generated markdown using the standard markdown image syntax.
Note: marker
does not currently support converting charts directly to markdown. It treats charts as images and extracts them as such.
While the marker
project itself doesn't have a public online demo, it does have a hosted API available through DataLab.to.
This hosted API allows you to convert PDFs, Word documents, and PowerPoints to markdown using marker
without needing to install or run the software locally.
You can find more information about the API and its pricing on the DataLab.to website.
While it's not a direct online demo, the hosted API provides a way to experiment with marker
's capabilities and see its output for various document types.