Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save williamtran29/841131c6fb6b16d4931e6bb2282b5884 to your computer and use it in GitHub Desktop.
Save williamtran29/841131c6fb6b16d4931e6bb2282b5884 to your computer and use it in GitHub Desktop.
zerox
Directory structure:
└── getomni-ai-zerox/
├── README.md
├── commitlint.config.js
├── jest.config.js
├── LICENSE
├── Makefile
├── MANIFEST.in
├── package.json
├── pyproject.toml
├── setup.cfg
├── setup.py
├── .editorconfig
├── .npmignore
├── .pre-commit-config.yaml
├── assets/
│ └── cs101.md
├── examples/
│ └── node/
│ ├── azure.ts
│ ├── bedrock.ts
│ ├── google.ts
│ └── openai.ts
├── node-zerox/
│ ├── tsconfig.json
│ ├── scripts/
│ │ └── install-dependencies.js
│ ├── src/
│ │ ├── constants.ts
│ │ ├── handleWarnings.ts
│ │ ├── index.ts
│ │ ├── types.ts
│ │ ├── models/
│ │ │ ├── azure.ts
│ │ │ ├── bedrock.ts
│ │ │ ├── google.ts
│ │ │ ├── index.ts
│ │ │ └── openAI.ts
│ │ └── utils/
│ │ ├── common.ts
│ │ ├── file.ts
│ │ ├── image.ts
│ │ ├── index.ts
│ │ ├── model.ts
│ │ └── tesseract.ts
│ └── tests/
│ ├── README.md
│ ├── index.ts
│ ├── performance.test.ts
│ ├── utils.ts
│ └── data/
├── py_zerox/
│ ├── pyzerox/
│ │ ├── __init__.py
│ │ ├── constants/
│ │ │ ├── __init__.py
│ │ │ ├── conversion.py
│ │ │ ├── messages.py
│ │ │ ├── patterns.py
│ │ │ └── prompts.py
│ │ ├── core/
│ │ │ ├── __init__.py
│ │ │ ├── types.py
│ │ │ └── zerox.py
│ │ ├── errors/
│ │ │ ├── __init__.py
│ │ │ ├── base.py
│ │ │ └── exceptions.py
│ │ ├── models/
│ │ │ ├── __init__.py
│ │ │ ├── base.py
│ │ │ ├── modellitellm.py
│ │ │ └── types.py
│ │ └── processor/
│ │ ├── __init__.py
│ │ ├── image.py
│ │ ├── pdf.py
│ │ ├── text.py
│ │ └── utils.py
│ ├── scripts/
│ │ ├── __init__.py
│ │ └── pre_install.py
│ └── tests/
│ └── test_noop.py
├── shared/
│ ├── systemPrompt.txt
│ ├── test.json
│ ├── inputs/
│ └── outputs/
│ ├── 0001.md
│ ├── 0002.md
│ ├── 0003.md
│ ├── 0004.md
│ ├── 0005.md
│ ├── 0006.md
│ ├── 0007.md
│ ├── 0008.md
│ ├── 0009.md
│ ├── 0010.md
│ ├── 0011.md
│ ├── 0012.md
│ ├── 0013.md
│ ├── 0014.md
│ ├── 0015.md
│ ├── 0016.md
│ ├── 0017.md
│ ├── 0018.md
│ ├── 0019.md
│ ├── 0020.md
│ ├── 0021.md
│ ├── 0022.md
│ ├── 0023.md
│ ├── 0024.md
│ ├── 0025.md
│ ├── 0026.md
│ ├── 0027.md
│ ├── 0028.md
│ ├── 0029.md
│ ├── 0030.md
│ ├── 0031.md
│ ├── 0032.md
│ ├── 0033.md
│ ├── 0034.md
│ ├── 0035.md
│ ├── 0036.md
│ ├── 0037.md
│ ├── 0038.md
│ ├── 0039.md
│ └── 0040.md
└── .github/
└── workflows/
└── python-publish.yml
================================================
FILE: README.md
================================================
![Hero Image](./assets/heroImage.png)
## Zerox OCR
<a href="https://discord.gg/smg2QfwtJ6">
<img src="https://github.com/user-attachments/assets/cccc0e9a-e3b2-425e-9b54-e5024681b129" alt="Join us on Discord" width="200px">
</a>
A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense!
The general logic:
- Pass in a file (PDF, DOCX, image, etc.)
- Convert that file into a series of images
- Pass each image to GPT and ask nicely for Markdown
- Aggregate the responses and return Markdown
Try out the hosted version here: <https://getomni.ai/ocr-demo>
Or visit our full documentation at: <https://docs.getomni.ai/zerox>
## Getting Started
Zerox is available as both a Node and Python package.
- [Node README](#node-zerox) - [npm package](https://www.npmjs.com/package/zerox)
- [Python README](#python-zerox) - [pip package](https://pypi.org/project/py-zerox/)
| Feature | Node.js | Python |
| ------------------------- | ---------------------------- | -------------------------- |
| PDF Processing | ✓ (requires graphicsmagick) | ✓ (requires poppler) |
| Image Processing | ✓ | ✓ |
| OpenAI Support | ✓ | ✓ |
| Azure OpenAI Support | ✓ | ✓ |
| AWS Bedrock Support | ✓ | ✓ |
| Google Gemini Support | ✓ | ✓ |
| Vertex AI Support | ✗ | ✓ |
| Data Extraction | ✓ (`schema`) | ✗ |
| Per-page Extraction | ✓ (`extractPerPage`) | ✗ |
| Custom System Prompts | ✗ | ✓ (`custom_system_prompt`) |
| Maintain Format Option | ✓ (`maintainFormat`) | ✓ (`maintain_format`) |
| Async API | ✓ | ✓ |
| Error Handling Modes | ✓ (`errorMode`) | ✗ |
| Concurrent Processing | ✓ (`concurrency`) | ✓ (`concurrency`) |
| Temp Directory Management | ✓ (`tempDir`) | ✓ (`temp_dir`) |
| Page Selection | ✓ (`pagesToConvertAsImages`) | ✓ (`select_pages`) |
| Orientation Correction | ✓ (`correctOrientation`) | ✗ |
| Edge Trimming | ✓ (`trimEdges`) | ✗ |
## Node Zerox
(Node.js SDK - supports vision models from different providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Google Gemini, etc.)
### Installation
```sh
npm install zerox
```
Zerox uses `graphicsmagick` and `ghostscript` for the PDF => image processing step. These should be pulled automatically, but you may need to manually install.
On linux use:
```
sudo apt-get update
sudo apt-get install -y graphicsmagick
```
## Usage
**With file URL**
```ts
import { zerox } from "zerox";
const result = await zerox({
filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
```
**From local path**
```ts
import { zerox } from "zerox";
import path from "path";
const result = await zerox({
filePath: path.resolve(__dirname, "./cs101.pdf"),
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
```
### Parameters
```ts
const result = await zerox({
// Required
filePath: "path/to/file",
credentials: {
apiKey: "your-api-key",
// Additional provider-specific credentials as needed
},
// Optional
cleanup: true, // Clear images from tmp after run
concurrency: 10, // Number of pages to run at a time
correctOrientation: true, // True by default, attempts to identify and correct page orientation
directImageExtraction: false, // Extract data directly from document images instead of the markdown
errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE, defaults to ErrorMode.IGNORE
extractionPrompt: "", // LLM instructions for extracting data from document
extractOnly: false, // Set to true to only extract structured data using a schema
extractPerPage, // Extract data per page instead of the entire document
imageDensity: 300, // DPI for image conversion
imageHeight: 2048, // Maximum height for converted images
llmParams: {}, // Additional parameters to pass to the LLM
maintainFormat: false, // Slower but helps maintain consistent formatting
maxImageSize: 15, // Maximum size of images to compress, defaults to 15MB
maxRetries: 1, // Number of retries to attempt on a failed page, defaults to 1
maxTesseractWorkers: -1, // Maximum number of Tesseract workers. Zerox will start with a lower number and only reach maxTesseractWorkers if needed
model: ModelOptions.OPENAI_GPT_4O, // Model to use (supports various models from different providers)
modelProvider: ModelProvider.OPENAI, // Choose from OPENAI, BEDROCK, GOOGLE, or AZURE
outputDir: undefined, // Save combined result.md to a file
pagesToConvertAsImages: -1, // Page numbers to convert to image as array (e.g. `[1, 2, 3]`) or a number (e.g. `1`). Set to -1 to convert all pages
prompt: "", // LLM instructions for processing the document
schema: undefined, // Schema for structured data extraction
tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory)
trimEdges: true, // True by default, trims pixels from all edges that contain values similar to the given background color, which defaults to that of the top-left pixel
});
```
The `maintainFormat` option tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. This requires the requests to run synchronously, so it's a lot slower. But valuable if your documents have a lot of tabular data, or frequently have tables that cross pages.
```
Request #1 => page_1_image
Request #2 => page_1_markdown + page_2_image
Request #3 => page_2_markdown + page_3_image
```
### Example Output
```js
{
completionTime: 10038,
fileName: 'invoice_36258',
inputTokens: 25543,
outputTokens: 210,
pages: [
{
page: 1,
content: '# INVOICE # 36258\n' +
'**Date:** Mar 06 2012 \n' +
'**Ship Mode:** First Class \n' +
'**Balance Due:** $50.10 \n' +
'## Bill To:\n' +
'Aaron Bergman \n' +
'98103, Seattle, \n' +
'Washington, United States \n' +
'## Ship To:\n' +
'Aaron Bergman \n' +
'98103, Seattle, \n' +
'Washington, United States \n' +
'\n' +
'| Item | Quantity | Rate | Amount |\n' +
'|--------------------------------------------|----------|--------|---------|\n' +
"| Global Push Button Manager's Chair, Indigo | 1 | $48.71 | $48.71 |\n" +
'| Chairs, Furniture, FUR-CH-4421 | | | |\n' +
'\n' +
'**Subtotal:** $48.71 \n' +
'**Discount (20%):** $9.74 \n' +
'**Shipping:** $11.13 \n' +
'**Total:** $50.10 \n' +
'---\n' +
'**Notes:** \n' +
'Thanks for your business! \n' +
'**Terms:** \n' +
'Order ID : CA-2012-AB10015140-40974 ',
contentLength: 747,
}
],
extracted: null,
summary: {
totalPages: 1,
ocr: {
failed: 0,
successful: 1,
},
extracted: null,
},
}
```
### Data Extraction
Zerox supports structured data extraction from documents using a schema. This allows you to pull specific information from documents in a structured format instead of getting the full markdown conversion.
Set `extractOnly: true` and provide a `schema` to extract structured data. The schema follows the [JSON Schema standard](https://json-schema.org/understanding-json-schema/).
Use `extractPerPage` to extract data per page instead of from the whole document at once.
You can also set `extractionModel`, `extractionModelProvider`, and `extractionCredentials` to use a different model for extraction than OCR. By default, the same model is used.
### Supported Models
Zerox supports a wide range of models across different providers:
- **Azure OpenAI**
- GPT-4 Vision (gpt-4o)
- GPT-4 Vision Mini (gpt-4o-mini)
- GPT-4.1 (gpt-4.1)
- GPT-4.1 Mini (gpt-4.1-mini)
- **OpenAI**
- GPT-4 Vision (gpt-4o)
- GPT-4 Vision Mini (gpt-4o-mini)
- GPT-4.1 (gpt-4.1)
- GPT-4.1 Mini (gpt-4.1-mini)
- **AWS Bedrock**
- Claude 3 Haiku (2024.03, 2024.10)
- Claude 3 Sonnet (2024.02, 2024.06, 2024.10)
- Claude 3 Opus (2024.02)
- **Google Gemini**
- Gemini 1.5 (Flash, Flash-8B, Pro)
- Gemini 2.0 (Flash, Flash-Lite)
```ts
import { zerox } from "zerox";
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";
// OpenAI
const openaiResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.OPENAI,
model: ModelOptions.OPENAI_GPT_4O,
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
// Azure OpenAI
const azureResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.AZURE,
model: ModelOptions.OPENAI_GPT_4O,
credentials: {
apiKey: process.env.AZURE_API_KEY,
endpoint: process.env.AZURE_ENDPOINT,
},
});
// AWS Bedrock
const bedrockResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.BEDROCK,
model: ModelOptions.BEDROCK_CLAUDE_3_SONNET_2024_10,
credentials: {
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
region: process.env.AWS_REGION,
},
});
// Google Gemini
const geminiResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.GOOGLE,
model: ModelOptions.GOOGLE_GEMINI_1_5_PRO,
credentials: {
apiKey: process.env.GEMINI_API_KEY,
},
});
```
## Python Zerox
(Python SDK - supports vision models from different providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, etc.)
### Installation
- Install **poppler** on the system, it should be available in path variable. See the [pdf2image documentation](https://pdf2image.readthedocs.io/en/latest/installation.html) for instructions by platform.
- Install py-zerox:
```sh
pip install py-zerox
```
The `pyzerox.zerox` function is an asynchronous API that performs OCR (Optical Character Recognition) to markdown using vision models. It processes PDF files and converts them into markdown format. Make sure to set up the environment variables for the model and the model provider before using this API.
Refer to the [LiteLLM Documentation](https://docs.litellm.ai/docs/providers) for setting up the environment and passing the correct model name.
### Usage
```python
from pyzerox import zerox
import os
import json
import asyncio
### Model Setup (Use only Vision Models) Refer: https://docs.litellm.ai/docs/providers ###
## placeholder for additional model kwargs which might be required for some models
kwargs = {}
## system prompt to use for the vision model
custom_system_prompt = None
# to override
# custom_system_prompt = "For the below PDF page, do something..something..." ## example
###################### Example for OpenAI ######################
model = "gpt-4o-mini" ## openai model
os.environ["OPENAI_API_KEY"] = "" ## your-api-key
###################### Example for Azure OpenAI ######################
model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model>
os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "" # "2023-05-15"
###################### Example for Gemini ######################
model = "gemini/gpt-4o-mini" ## "gemini/<gemini_model>" -> format <provider>/<model>
os.environ['GEMINI_API_KEY'] = "" # your-gemini-api-key
###################### Example for Anthropic ######################
model="claude-3-opus-20240229"
os.environ["ANTHROPIC_API_KEY"] = "" # your-anthropic-api-key
###################### Vertex ai ######################
model = "vertex_ai/gemini-1.5-flash-001" ## "vertex_ai/<model_name>" -> format <provider>/<model>
## GET CREDENTIALS
## RUN ##
# !gcloud auth application-default login - run this to add vertex credentials to your env
## OR ##
file_path = 'path/to/vertex_ai_service_account.json'
# Load the JSON file
with open(file_path, 'r') as file:
vertex_credentials = json.load(file)
# Convert to JSON string
vertex_credentials_json = json.dumps(vertex_credentials)
vertex_credentials=vertex_credentials_json
## extra args
kwargs = {"vertex_credentials": vertex_credentials}
###################### For other providers refer: https://docs.litellm.ai/docs/providers ######################
# Define main async entrypoint
async def main():
file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf" ## local filepath and file URL supported
## process only some pages or all
select_pages = None ## None for all, but could be int or list(int) page numbers (1 indexed)
output_dir = "./output_test" ## directory to save the consolidated markdown file
result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs)
return result
# run the main function:
result = asyncio.run(main())
# print markdown result
print(result)
```
### Parameters
```python
async def zerox(
cleanup: bool = True,
concurrency: int = 10,
file_path: Optional[str] = "",
maintain_format: bool = False,
model: str = "gpt-4o-mini",
output_dir: Optional[str] = None,
temp_dir: Optional[str] = None,
custom_system_prompt: Optional[str] = None,
select_pages: Optional[Union[int, Iterable[int]]] = None,
**kwargs
) -> ZeroxOutput:
...
```
Parameters
- **cleanup** (bool, optional):
Whether to clean up temporary files after processing. Defaults to True.
- **concurrency** (int, optional):
The number of concurrent processes to run. Defaults to 10.
- **file_path** (Optional[str], optional):
The path to the PDF file to process. Defaults to an empty string.
- **maintain_format** (bool, optional):
Whether to maintain the format from the previous page. Defaults to False.
- **model** (str, optional):
The model to use for generating completions. Defaults to "gpt-4o-mini".
Refer to LiteLLM Providers for the correct model name, as it may differ depending on the provider.
- **output_dir** (Optional[str], optional):
The directory to save the markdown output. Defaults to None.
- **temp_dir** (str, optional):
The directory to store temporary files, defaults to some named folder in system's temp directory. If already exists, the contents will be deleted before Zerox uses it.
- **custom_system_prompt** (str, optional):
The system prompt to use for the model, this overrides the default system prompt of Zerox.Generally it is not required unless you want some specific behavior. Defaults to None.
- **select_pages** (Optional[Union[int, Iterable[int]]], optional):
Pages to process, can be a single page number or an iterable of page numbers. Defaults to None
- **kwargs** (dict, optional):
Additional keyword arguments to pass to the litellm.completion method.
Refer to the LiteLLM Documentation and Completion Input for details.
Returns
- ZeroxOutput:
Contains the markdown content generated by the model and also some metadata (refer below).
### Example Output (output from "azure/gpt-4o-mini")
Note the output is manually wrapped for this documentation for better readability.
````Python
ZeroxOutput(
completion_time=9432.975,
file_name='cs101',
input_tokens=36877,
output_tokens=515,
pages=[
Page(
content='| Type | Description | Wrapper Class |\n' +
'|---------|--------------------------------------|---------------|\n' +
'| byte | 8-bit signed 2s complement integer | Byte |\n' +
'| short | 16-bit signed 2s complement integer | Short |\n' +
'| int | 32-bit signed 2s complement integer | Integer |\n' +
'| long | 64-bit signed 2s complement integer | Long |\n' +
'| float | 32-bit IEEE 754 floating point number| Float |\n' +
'| double | 64-bit floating point number | Double |\n' +
'| boolean | may be set to true or false | Boolean |\n' +
'| char | 16-bit Unicode (UTF-16) character | Character |\n\n' +
'Table 26.2.: Primitive types in Java\n\n' +
'### 26.3.1. Declaration & Assignment\n\n' +
'Java is a statically typed language meaning that all variables must be declared before you can use ' +
'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' +
'its identifier. For example:\n\n' +
'```java\n' +
'int numUnits;\n' +
'double costPerUnit;\n' +
'char firstInitial;\n' +
'boolean isStudent;\n' +
'```\n\n' +
'Each declaration specifies the variable’s type followed by the identifier and ending with a ' +
'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' +
'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' +
'character. We adopt the modern camelCasing naming convention for variables in our code. In ' +
'general, variables must be assigned a value before you can use them in an expression. You do not ' +
'have to immediately assign a value when you declare them (though it is good practice), but some ' +
'value must be assigned before they can be used or the compiler will issue an error.\n\n' +
'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' +
'the variable that we wish to assign the value to appears on the left-hand-side while the value ' +
'(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' +
'we can assign them values:\n\n' +
'> 2 Instance variables, that is variables declared as part of an object do have default values. ' +
'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' +
'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' +
'character (zero in the ASCII table).',
content_length=2333,
page=1
)
]
)
````
## Supported File Types
We use a combination of `libreoffice` and `graphicsmagick` to do document => image conversion. For non-image / non-PDF files, we use libreoffice to convert that file to a PDF, and then to an image.
```js
[
"pdf", // Portable Document Format
"doc", // Microsoft Word 97-2003
"docx", // Microsoft Word 2007-2019
"odt", // OpenDocument Text
"ott", // OpenDocument Text Template
"rtf", // Rich Text Format
"txt", // Plain Text
"html", // HTML Document
"htm", // HTML Document (alternative extension)
"xml", // XML Document
"wps", // Microsoft Works Word Processor
"wpd", // WordPerfect Document
"xls", // Microsoft Excel 97-2003
"xlsx", // Microsoft Excel 2007-2019
"ods", // OpenDocument Spreadsheet
"ots", // OpenDocument Spreadsheet Template
"csv", // Comma-Separated Values
"tsv", // Tab-Separated Values
"ppt", // Microsoft PowerPoint 97-2003
"pptx", // Microsoft PowerPoint 2007-2019
"odp", // OpenDocument Presentation
"otp", // OpenDocument Presentation Template
];
```
## Credits
- [Litellm](https://github.com/BerriAI/litellm): <https://github.com/BerriAI/litellm> | This powers our python sdk to support all popular vision models from different providers.
### License
This project is licensed under the MIT License.
================================================
FILE: commitlint.config.js
================================================
module.exports = {
extends: [
"@commitlint/config-conventional"
],
}
================================================
FILE: jest.config.js
================================================
/** @type {import('ts-jest').JestConfigWithTsJest} **/
module.exports = {
preset: "ts-jest",
testEnvironment: "node",
moduleDirectories: ["node_modules"],
transform: {
"^.+\\.tsx?$": [
"ts-jest",
{
tsconfig: "node-zerox/tsconfig.json",
},
],
},
};
================================================
FILE: LICENSE
================================================
The MIT License (MIT)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: Makefile
================================================
# Define the package directory for zerox
PACKAGE_DIR := py_zerox
# Define directory configs
VENV_DIR := .venv
DIST_DIR := ${PACKAGE_DIR}/dist
SRC_DIR := $(PACKAGE_DIR)/zerox
TEST_DIR := $(PACKAGE_DIR)/tests
# Define the build configs
POETRY_VERSION := 1.8.3
PYTHON_VERSION := 3.11
POETRY := poetry
# Test related configs
PYTEST_OPTIONS := -v
# Default target
.PHONY: all
all: venv build test dev
# Conditional map executable
ifeq ($(VIRTUAL_ENV),)
PYTHON := python$(PYTHON_VERSION)
else
PYTHON := python
endif
# Initialization
.PHONY: init
init:
@echo "== Initializing Development Environment =="
brew install node
brew install pre-commit
curl -sSL https://install.python-poetry.org | $(PYTHON) -
@echo "== Installing Pre-Commit Hooks =="
pre-commit install
pre-commit autoupdate
pre-commit install --install-hooks
pre-commit install --hook-type commit-msg
# Create virtual environment if it doesn't exist
.PHONY: venv
venv: $(VENV_DIR)/bin/activate
$(VENV_DIR)/bin/activate:
@echo "== Creating Virtual Environment =="
$(PYTHON) -m venv $(VENV_DIR)
. $(VENV_DIR)/bin/activate && pip install --upgrade pip setuptools wheel
touch $(VENV_DIR)/bin/activate
# Resolving dependencies and build the package using SetupTools
.PHONY: build
build: venv
@echo "== Resolving dependencies and building the package using SetupTools =="
$(PYTHON) setup.py sdist --dist-dir $(DIST_DIR)
# Install test dependencies for test environment
.PHONY: install-test
install-test: venv
@echo "== Resolving test dependencies =="
$(POETRY) install --with test
# Test out the build
.PHONY: test
test: install-test
@echo "== Triggering tests =="
pytest $(TEST_DIR) $(PYTEST_OPTIONS) || (echo "Tests failed" && exit 1)
# Clean build artifacts
.PHONY: clean
clean:
@echo "== Cleaning DIST_DIR and VENV_DIR =="
rm -rf $(DIST_DIR)
rm -rf $(VENV_DIR)
# Install dev dependencies for dev environment
.PHONY: install-dev
install-dev: venv build
@echo "== Resolving development dependencies =="
$(POETRY) install --with dev
# Package Development Build
.PHONY: dev
dev:
@echo "== Preparing development build =="
$(PYTHON) -m pip install -e .
.PHONY: check
check: install-dev lint format
.PHONY: lint
lint: venv
@echo "== Running Linting =="
$(VENV_DIR)/bin/ruff check $(SRC_DIR) $(TEST_DIR)
.PHONY: format
format: venv
@echo "== Running Formatting =="
$(VENV_DIR)/bin/black --check $(SRC_DIR) $(TEST_DIR)
.PHONY: fix
fix: install-dev lint-fix format-fix
.PHONY: lint-fix
lint-fix: venv
@echo "== Running Linting =="
$(VENV_DIR)/bin/ruff check --fix $(SRC_DIR) $(TEST_DIR)
.PHONY: format-fix
format-fix: venv
@echo "== Running Formatmting =="
$(VENV_DIR)/bin/black $(SRC_DIR) $(TEST_DIR)
================================================
FILE: MANIFEST.in
================================================
include setup.py
include README.md
include LICENSE
recursive-include py_zerox/zerox *
recursive-include py_zerox/scripts *
================================================
FILE: package.json
================================================
{
"name": "zerox",
"version": "1.1.19",
"description": "ocr documents using gpt-4o-mini",
"main": "node-zerox/dist/index.js",
"scripts": {
"clean": "rm -rf node-zerox/dist",
"build": "npm run clean && tsc -p node-zerox/tsconfig.json",
"postinstall": "node node-zerox/scripts/install-dependencies.js",
"prepublishOnly": "npm run build",
"test": "ts-node node-zerox/tests/index.ts",
"test:performance": "jest node-zerox/tests/performance.test.ts --runInBand"
},
"author": "tylermaran",
"license": "MIT",
"dependencies": {
"@aws-sdk/client-bedrock-runtime": "^3.734.0",
"@google/genai": "^0.9.0",
"axios": "^1.7.2",
"child_process": "^1.0.2",
"file-type": "^16.5.4",
"fs-extra": "^11.2.0",
"heic-convert": "^2.1.0",
"libreoffice-convert": "^1.6.0",
"mime-types": "^2.1.35",
"openai": "^4.82.0",
"os": "^0.1.2",
"p-limit": "^3.1.0",
"path": "^0.12.7",
"pdf-parse": "^1.1.1",
"pdf2pic": "^3.1.1",
"sharp": "^0.33.5",
"tesseract.js": "^5.1.1",
"util": "^0.12.5",
"uuid": "^11.0.3",
"xlsx": "^0.18.5"
},
"devDependencies": {
"@types/fs-extra": "^11.0.4",
"@types/heic-convert": "^2.1.0",
"@types/jest": "^29.5.14",
"@types/mime-types": "^2.1.4",
"@types/node": "^20.14.11",
"@types/pdf-parse": "^1.1.4",
"@types/prompts": "^2.4.9",
"@types/xlsx": "^0.0.35",
"dotenv": "^16.4.5",
"jest": "^29.7.0",
"prompts": "^2.4.2",
"ts-jest": "^29.2.5",
"ts-node": "^10.9.2",
"typescript": "^5.5.3"
},
"repository": {
"type": "git",
"url": "git+https://github.com/getomni-ai/zerox.git"
},
"keywords": [
"ocr",
"document",
"llm"
],
"types": "node-zerox/dist/index.d.ts",
"bugs": {
"url": "https://github.com/getomni-ai/zerox/issues"
},
"homepage": "https://github.com/getomni-ai/zerox#readme"
}
================================================
FILE: pyproject.toml
================================================
[tool.poetry]
name = "py-zerox"
version = "0.0.7"
description = "ocr documents using vision models from all popular providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock etc"
authors = ["wizenheimer","pradhyumna85"]
license = "MIT"
readme = "README.md"
packages = [{ include = "pyzerox", from = "py_zerox" }]
repository = "https://github.com/getomni-ai/zerox.git"
documentation = "https://github.com/getomni-ai/zerox"
keywords = ["ocr", "document", "llm"]
package-mode = false
[tool.poetry.dependencies]
python = "^3.11"
aiofiles = "^23.0"
aiohttp = "^3.9.5"
pdf2image = "^1.17.0"
litellm = "^1.44.15"
aioshutil = "^1.5"
pypdf2 = "^3.0.1"
[tool.poetry.scripts]
pre-install = "py_zerox.scripts.pre_install:check_and_install"
[tool.poetry.group.dev.dependencies]
notebook = "^7.2.1"
black = "^24.4.2"
ruff = "^0.5.5"
[tool.poetry.group.test.dependencies]
pytest = "^8.3.2"
================================================
FILE: setup.cfg
================================================
[metadata]
name = py-zerox
version = 0.0.7
description = ocr documents using vision models from all popular providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock etc
long_description = file: README.md
long_description_content_type = text/markdown
author = wizenheimer, pradhyumna85
license = MIT
license_file = LICENSE
classifiers =
License :: OSI Approved :: MIT License
Programming Language :: Python :: 3
Programming Language :: Python :: 3.11
[options]
package_dir =
= py_zerox
packages = find:
python_requires = >=3.11
install_requires =
aiofiles>=23.0
aiohttp>=3.9.5
pdf2image>=1.17.0
litellm>=1.44.15
aioshutil>=1.5
PyPDF2>=3.0.1
[options.packages.find]
where = py_zerox.pyzerox
[options.entry_points]
console_scripts =
py-zerox-pre-install = py_zerox.scripts.pre_install:check_and_install
================================================
FILE: setup.py
================================================
from setuptools import setup, find_packages
from setuptools.command.install import install
import subprocess
import sys
class InstallSystemDependencies(install):
def run(self):
try:
subprocess.check_call(
[sys.executable, "-m", "py_zerox.scripts.pre_install"])
except subprocess.CalledProcessError as e:
print(f"Pre-install script failed: {e}", file=sys.stderr)
sys.exit(1)
install.run(self)
setup(
name="py-zerox",
cmdclass={
"install": InstallSystemDependencies,
},
version="0.0.7",
packages=find_packages(where="py_zerox"), # Specify the root folder
package_dir={"": "py_zerox"}, # Map root directory
include_package_data=True,
)
================================================
FILE: .editorconfig
================================================
# EditorConfig is awesome: https://EditorConfig.org
# top-most EditorConfig file
root = true
[*]
indent_style = space
indent_size = 4
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = false
[{*.yaml,*.yml}]
indent_size = 2
ij_yaml_keep_indents_on_empty_lines = false
ij_yaml_keep_line_breaks = true
[Makefile]
indent_style = tab
[*.py]
indent_size = 4
[{*.js,*.ts,*.md,*.json}]
indent_size = 2
================================================
FILE: .npmignore
================================================
# Folders
node-zerox/src/
node-zerox/tests/
py_zerox/
assets/
shared/
# Config files
.pre-commit-config.yaml
.editorconfig
MANIFEST.in
commitlint.config.js
poetry.lock
pyproject.toml
setup.cfg
setup.py
Makefile
eng.traineddata
.env
# File types
*.ts
# Keep type declarations
!.gitignore
!node-zerox/dist/**/*.d.ts
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
# pre-commit hooks for testing the files
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: "v4.6.0"
hooks:
- id: check-added-large-files
- id: no-commit-to-branch
- id: check-toml
- id: check-yaml
- id: check-json
- id: check-xml
- id: end-of-file-fixer
exclude: \.json$
files: \.py$
- id: trailing-whitespace
- id: mixed-line-ending
# for formatting
- repo: https://github.com/psf/black
rev: 24.4.2
hooks:
- id: black
args: ["--line-length=100"]
language_version: python3
# for linting & style checks
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.5
hooks:
- id: ruff
args: ["--fix"]
================================================
FILE: assets/cs101.md
================================================
| Type | Description | Wrapper Class |
| ------- | ------------------------------------- | ------------- |
| byte | 8-bit signed 2s complement integer | Byte |
| short | 16-bit signed 2s complement integer | Short |
| int | 32-bit signed 2s complement integer | Integer |
| long | 64-bit signed 2s complement integer | Long |
| float | 32-bit IEEE 754 floating point number | Float |
| double | 64-bit floating point number | Double |
| boolean | may be set to true or false | Boolean |
| char | 16-bit Unicode (UTF-16) character | Character |
Table 26.2.: Primitive types in Java
### 26.3.1. Declaration & Assignment
Java is a statically typed language meaning that all variables must be declared before you can use them or refer to them. In addition, when declaring a variable, you must specify both its type and its identifier. For example:
```java
int numUnits;
double costPerUnit;
char firstInitial;
boolean isStudent;
```
Each declaration specifies the variable’s type followed by the identifier ending with a semicolon. The identifier rules are fairly standard: a name can consist of lowercase and uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric character. We adopt the modern camelCasing naming convention for variables in our code. In general, variables must be assigned a value before you can use them in an expression. You do not have to immediately assign a value when you declare them (though it is good practice), but some value must be assigned before they can be used or the compiler will issue an error.
The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, the variable that we wish to assign the value to appears on the left-hand-side while the value (literal, variable or expression) is on the right-hand-side. Using our variables from before, we can assign them values:
```
2Instance variables, that is variables declared as part of an object do have default values. For objects, the default is `null`, for all numeric types, zero is the default value. For the `boolean` type, `false` is the default, and the default `char` value is `\0`, the null-terminating character (zero in the ASCII table).
```
```
391
```
================================================
FILE: examples/node/azure.ts
================================================
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";
import { zerox } from "zerox";
/**
* Example using Azure OpenAI with Zerox to extract structured data from documents.
* This shows extraction setup with schema definition for a property report document.
*/
async function main() {
// Define the schema for property report data extraction
const schema = {
type: "object",
properties: {
commercial_office: {
type: "object",
properties: {
average: { type: "string" },
median: { type: "string" },
},
required: ["average", "median"],
},
transactions_by_quarter: {
type: "array",
items: {
type: "object",
properties: {
quarter: { type: "string" },
transactions: { type: "integer" },
},
required: ["quarter", "transactions"],
},
},
year: { type: "integer" },
},
required: ["commercial_office", "transactions_by_quarter", "year"],
};
try {
const result = await zerox({
credentials: {
apiKey: process.env.AZURE_API_KEY || "",
endpoint: process.env.AZURE_ENDPOINT || "",
},
extractOnly: true, // Skip OCR, only perform extraction (defaults to false)
filePath:
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png",
model: ModelOptions.OPENAI_GPT_4O,
modelProvider: ModelProvider.AZURE,
schema,
});
console.log("Extracted data:", result.extracted);
} catch (error) {
console.error("Error extracting data:", error);
}
}
main();
================================================
FILE: examples/node/bedrock.ts
================================================
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";
import { zerox } from "zerox";
/**
* Example using Bedrock Anthropic with Zerox to extract structured data from documents.
* This shows extraction setup with schema definition for a property report document.
*/
async function main() {
// Define the schema for property report data extraction
const schema = {
type: "object",
properties: {
commercial_office: {
type: "object",
properties: {
average: { type: "string" },
median: { type: "string" },
},
required: ["average", "median"],
},
transactions_by_quarter: {
type: "array",
items: {
type: "object",
properties: {
quarter: { type: "string" },
transactions: { type: "integer" },
},
required: ["quarter", "transactions"],
},
},
year: { type: "integer" },
},
required: ["commercial_office", "transactions_by_quarter", "year"],
};
try {
const result = await zerox({
credentials: {
accessKeyId: process.env.ACCESS_KEY_ID,
region: process.env.REGION || "us-east-1",
secretAccessKey: process.env.SECRET_ACCESS_KEY,
},
extractOnly: true, // Skip OCR, only perform extraction (defaults to false)
filePath:
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png",
model: ModelOptions.BEDROCK_CLAUDE_3_HAIKU_2024_03,
modelProvider: ModelProvider.BEDROCK,
schema,
});
console.log("Extracted data:", result.extracted);
} catch (error) {
console.error("Error extracting data:", error);
}
}
main();
================================================
FILE: examples/node/google.ts
================================================
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";
import { zerox } from "zerox";
/**
* Example using Google Gemini with Zerox to extract structured data from documents.
* This shows extraction setup with schema definition for a property report document.
*/
async function main() {
// Define the schema for property report data extraction
const schema = {
type: "object",
properties: {
commercial_office: {
type: "object",
properties: {
average: { type: "string" },
median: { type: "string" },
},
required: ["average", "median"],
},
transactions_by_quarter: {
type: "array",
items: {
type: "object",
properties: {
quarter: { type: "string" },
transactions: { type: "integer" },
},
required: ["quarter", "transactions"],
},
},
year: { type: "integer" },
},
required: ["commercial_office", "transactions_by_quarter", "year"],
};
try {
const result = await zerox({
credentials: {
apiKey: process.env.GEMINI_API_KEY || "",
},
extractOnly: true, // Skip OCR, only perform extraction (defaults to false)
filePath:
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png",
model: ModelOptions.GOOGLE_GEMINI_2_FLASH,
modelProvider: ModelProvider.GOOGLE,
schema,
});
console.log("Extracted data:", result.extracted);
} catch (error) {
console.error("Error extracting data:", error);
}
}
main();
================================================
FILE: examples/node/openai.ts
================================================
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";
import { zerox } from "zerox";
/**
* Example using OpenAI with Zerox to extract structured data from documents.
* This shows extraction setup with schema definition for a property report document.
*/
async function main() {
// Define the schema for property report data extraction
const schema = {
type: "object",
properties: {
commercial_office: {
type: "object",
properties: {
average: { type: "string" },
median: { type: "string" },
},
required: ["average", "median"],
},
transactions_by_quarter: {
type: "array",
items: {
type: "object",
properties: {
quarter: { type: "string" },
transactions: { type: "integer" },
},
required: ["quarter", "transactions"],
},
},
year: { type: "integer" },
},
required: ["commercial_office", "transactions_by_quarter", "year"],
};
try {
const result = await zerox({
credentials: {
apiKey: process.env.OPENAI_API_KEY || "",
},
extractOnly: true, // Skip OCR, only perform extraction (defaults to false)
filePath:
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png",
model: ModelOptions.OPENAI_GPT_4O,
modelProvider: ModelProvider.OPENAI,
schema,
});
console.log("Extracted data:", result.extracted);
} catch (error) {
console.error("Error extracting data:", error);
}
}
main();
================================================
FILE: node-zerox/tsconfig.json
================================================
{
"compilerOptions": {
"target": "ES5",
"module": "commonjs",
"declaration": true,
"outDir": "./dist",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "**/*.test.ts"]
}
================================================
FILE: node-zerox/scripts/install-dependencies.js
================================================
const { exec } = require("child_process");
const { promisify } = require("util");
const execPromise = promisify(exec);
const installPackage = async (command, packageName) => {
try {
const { stdout, stderr } = await execPromise(command);
if (stderr) {
throw new Error(`Failed to install ${packageName}: ${stderr}`);
}
return stdout;
} catch (error) {
throw new Error(`Failed to install ${packageName}: ${error.message}`);
}
};
const isSudoAvailable = async () => {
try {
// Try running a sudo command
await execPromise("sudo -n true");
return true;
} catch {
return false;
}
};
const checkAndInstall = async () => {
try {
const sudoAvailable = await isSudoAvailable();
// Check and install Ghostscript
try {
await execPromise("gs --version");
} catch {
if (process.platform === "darwin") {
await installPackage("brew install ghostscript", "Ghostscript");
} else if (process.platform === "linux") {
const command = sudoAvailable
? "sudo apt-get update && sudo apt-get install -y ghostscript"
: "apt-get update && apt-get install -y ghostscript";
await installPackage(command, "Ghostscript");
} else {
throw new Error(
"Please install Ghostscript manually from https://www.ghostscript.com/download.html"
);
}
}
// Check and install GraphicsMagick
try {
await execPromise("gm -version");
} catch {
if (process.platform === "darwin") {
await installPackage("brew install graphicsmagick", "GraphicsMagick");
} else if (process.platform === "linux") {
const command = sudoAvailable
? "sudo apt-get update && sudo apt-get install -y graphicsmagick"
: "apt-get update && apt-get install -y graphicsmagick";
await installPackage(command, "GraphicsMagick");
} else {
throw new Error(
"Please install GraphicsMagick manually from http://www.graphicsmagick.org/download.html"
);
}
}
// Check and install LibreOffice
try {
await execPromise("soffice --version");
} catch {
if (process.platform === "darwin") {
await installPackage("brew install --cask libreoffice", "LibreOffice");
} else if (process.platform === "linux") {
const command = sudoAvailable
? "sudo apt-get update && sudo apt-get install -y libreoffice"
: "apt-get update && apt-get install -y libreoffice";
await installPackage(command, "LibreOffice");
} else {
throw new Error(
"Please install LibreOffice manually from https://www.libreoffice.org/download/download/"
);
}
}
// Check and install Poppler
try {
await execPromise("pdfinfo -v || pdftoppm -v");
} catch {
if (process.platform === "darwin") {
await installPackage("brew install poppler", "Poppler");
} else if (process.platform === "linux") {
const command = sudoAvailable
? "sudo apt-get update && sudo apt-get install -y poppler-utils"
: "apt-get update && apt-get install -y poppler-utils";
await installPackage(command, "Poppler");
} else {
throw new Error(
"Please install Poppler manually from https://poppler.freedesktop.org/"
);
}
}
} catch (err) {
console.error(`Error during installation: ${err.message}`);
process.exit(1);
}
};
checkAndInstall();
================================================
FILE: node-zerox/src/constants.ts
================================================
export const ASPECT_RATIO_THRESHOLD = 5;
// This is a rough guess; this will be used to create Tesseract workers by default,
// that cater to this many pages. If a document has more than this many pages,
// then more workers will be created dynamically.
export const NUM_STARTING_WORKERS = 3;
export const CONSISTENCY_PROMPT = (priorPage: string): string =>
`Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`;
export const SYSTEM_PROMPT_BASE = `
Convert the following document to markdown.
Return only the markdown with no explanation text. Do not include delimiters like \`\`\`markdown or \`\`\`html.
RULES:
- You must include all information on the page. Do not exclude headers, footers, or subtext.
- Return tables in an HTML format.
- Charts & infographics must be interpreted to a markdown format. Prefer table format when applicable.
- Logos should be wrapped in brackets. Ex: <logo>Coca-Cola<logo>
- Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY<watermark>
- Page numbers should be wrapped in brackets. Ex: <page_number>14<page_number> or <page_number>9/22<page_number>
- Prefer using ☐ and ☑ for check boxes.
`;
================================================
FILE: node-zerox/src/handleWarnings.ts
================================================
// Tesseract relies on node-fetch v2, which has a deprecated version of punycode
// Suppress the warning for now. Check in when teseract updates to node-fetch v3
// https://github.com/naptha/tesseract.js/issues/876
if (process.stderr.write === process.stderr.constructor.prototype.write) {
const stdErrWrite = process.stderr.write;
process.stderr.write = function (chunk: any, ...args: any[]) {
const str = Buffer.isBuffer(chunk) ? chunk.toString() : chunk;
// Filter out the punycode deprecation warning
if (str.includes("punycode")) {
return true;
}
return stdErrWrite.apply(process.stderr, [chunk]);
};
}
================================================
FILE: node-zerox/src/index.ts
================================================
import fs from "fs-extra";
import os from "os";
import path from "path";
import pLimit from "p-limit";
import Tesseract from "tesseract.js";
import "./handleWarnings";
import {
addWorkersToTesseractScheduler,
checkIsCFBFile,
checkIsPdfFile,
cleanupImage,
CompletionProcessor,
compressImage,
convertFileToPdf,
convertHeicToJpeg,
convertPdfToImages,
downloadFile,
extractPagesFromStructuredDataFile,
getNumberOfPagesFromPdf,
getTesseractScheduler,
isCompletionResponse,
isStructuredDataFile,
prepareWorkersForImageProcessing,
runRetries,
splitSchema,
terminateScheduler,
} from "./utils";
import { createModel } from "./models";
import {
CompletionResponse,
ErrorMode,
ExtractionResponse,
HybridInput,
LogprobPage,
ModelOptions,
ModelProvider,
OperationMode,
Page,
PageStatus,
ZeroxArgs,
ZeroxOutput,
} from "./types";
import { NUM_STARTING_WORKERS } from "./constants";
export const zerox = async ({
cleanup = true,
concurrency = 10,
correctOrientation = true,
credentials = { apiKey: "" },
customModelFunction,
directImageExtraction = false,
enableHybridExtraction = false,
errorMode = ErrorMode.IGNORE,
extractionCredentials,
extractionLlmParams,
extractionModel,
extractionModelProvider,
extractionPrompt,
extractOnly = false,
extractPerPage,
filePath,
imageDensity,
imageHeight,
llmParams = {},
maintainFormat = false,
maxImageSize = 15,
maxRetries = 1,
maxTesseractWorkers = -1,
model = ModelOptions.OPENAI_GPT_4O,
modelProvider = ModelProvider.OPENAI,
openaiAPIKey = "",
outputDir,
pagesToConvertAsImages = -1,
prompt,
schema,
tempDir = os.tmpdir(),
trimEdges = true,
}: ZeroxArgs): Promise<ZeroxOutput> => {
let extracted: Record<string, unknown> | null = null;
let extractedLogprobs: LogprobPage[] = [];
let inputTokenCount: number = 0;
let outputTokenCount: number = 0;
let numSuccessfulOCRRequests: number = 0;
let numFailedOCRRequests: number = 0;
let ocrLogprobs: LogprobPage[] = [];
let priorPage: string = "";
let pages: Page[] = [];
let imagePaths: string[] = [];
const startTime = new Date();
if (openaiAPIKey && openaiAPIKey.length > 0) {
modelProvider = ModelProvider.OPENAI;
credentials = { apiKey: openaiAPIKey };
}
extractionCredentials = extractionCredentials ?? credentials;
extractionLlmParams = extractionLlmParams ?? llmParams;
extractionModel = extractionModel ?? model;
extractionModelProvider = extractionModelProvider ?? modelProvider;
// Validators
if (Object.values(credentials).every((credential) => !credential)) {
throw new Error("Missing credentials");
}
if (!filePath || !filePath.length) {
throw new Error("Missing file path");
}
if (enableHybridExtraction && (directImageExtraction || extractOnly)) {
throw new Error(
"Hybrid extraction cannot be used in direct image extraction or extract-only mode"
);
}
if (enableHybridExtraction && !schema) {
throw new Error("Schema is required when hybrid extraction is enabled");
}
if (extractOnly && !schema) {
throw new Error("Schema is required for extraction mode");
}
if (extractOnly && maintainFormat) {
throw new Error("Maintain format is only supported in OCR mode");
}
if (extractOnly) directImageExtraction = true;
let scheduler: Tesseract.Scheduler | null = null;
// Add initial tesseract workers if we need to correct orientation
if (correctOrientation) {
scheduler = await getTesseractScheduler();
const workerCount =
maxTesseractWorkers !== -1 && maxTesseractWorkers < NUM_STARTING_WORKERS
? maxTesseractWorkers
: NUM_STARTING_WORKERS;
await addWorkersToTesseractScheduler({
numWorkers: workerCount,
scheduler,
});
}
try {
// Ensure temp directory exists + create temp folder
const rand = Math.floor(1000 + Math.random() * 9000).toString();
const tempDirectory = path.join(
tempDir || os.tmpdir(),
`zerox-temp-${rand}`
);
const sourceDirectory = path.join(tempDirectory, "source");
await fs.ensureDir(sourceDirectory);
// Download the PDF. Get file name.
const { extension, localPath } = await downloadFile({
filePath,
tempDir: sourceDirectory,
});
if (!localPath) throw "Failed to save file to local drive";
// Sort the `pagesToConvertAsImages` array to make sure we use the right index
// for `formattedPages` as `pdf2pic` always returns images in order
if (Array.isArray(pagesToConvertAsImages)) {
pagesToConvertAsImages.sort((a, b) => a - b);
}
// Check if the file is a structured data file (like Excel).
// If so, skip the image conversion process and extract the pages directly
if (isStructuredDataFile(localPath)) {
pages = await extractPagesFromStructuredDataFile(localPath);
} else {
// Read the image file or convert the file to images
if (
extension === ".png" ||
extension === ".jpg" ||
extension === ".jpeg"
) {
imagePaths = [localPath];
} else if (extension === ".heic") {
const imagePath = await convertHeicToJpeg({
localPath,
tempDir: sourceDirectory,
});
imagePaths = [imagePath];
} else {
let pdfPath: string;
const isCFBFile = await checkIsCFBFile(localPath);
const isPdf = await checkIsPdfFile(localPath);
if ((extension === ".pdf" || isPdf) && !isCFBFile) {
pdfPath = localPath;
} else {
// Convert file to PDF if necessary
pdfPath = await convertFileToPdf({
extension,
localPath,
tempDir: sourceDirectory,
});
}
if (pagesToConvertAsImages !== -1) {
const totalPages = await getNumberOfPagesFromPdf({ pdfPath });
pagesToConvertAsImages = Array.isArray(pagesToConvertAsImages)
? pagesToConvertAsImages
: [pagesToConvertAsImages];
pagesToConvertAsImages = pagesToConvertAsImages.filter(
(page) => page > 0 && page <= totalPages
);
}
imagePaths = await convertPdfToImages({
imageDensity,
imageHeight,
pagesToConvertAsImages,
pdfPath,
tempDir: sourceDirectory,
});
}
// Compress images if maxImageSize is specified
if (maxImageSize && maxImageSize > 0) {
const compressPromises = imagePaths.map(async (imagePath: string) => {
const imageBuffer = await fs.readFile(imagePath);
const compressedBuffer = await compressImage(
imageBuffer,
maxImageSize
);
const originalName = path.basename(
imagePath,
path.extname(imagePath)
);
const compressedPath = path.join(
sourceDirectory,
`${originalName}_compressed.png`
);
await fs.writeFile(compressedPath, compressedBuffer);
return compressedPath;
});
imagePaths = await Promise.all(compressPromises);
}
if (correctOrientation) {
await prepareWorkersForImageProcessing({
maxTesseractWorkers,
numImages: imagePaths.length,
scheduler,
});
}
// Start processing OCR using LLM
const modelInstance = createModel({
credentials,
llmParams,
model,
provider: modelProvider,
});
if (!extractOnly) {
const processOCR = async (
imagePath: string,
pageIndex: number,
maintainFormat: boolean
): Promise<Page> => {
let pageNumber: number;
// If we convert all pages, just use the array index
if (pagesToConvertAsImages === -1) {
pageNumber = pageIndex + 1;
}
// Else if we convert specific pages, use the page number from the parameter
else if (Array.isArray(pagesToConvertAsImages)) {
pageNumber = pagesToConvertAsImages[pageIndex];
}
// Else, the parameter is a number and use it for the page number
else {
pageNumber = pagesToConvertAsImages;
}
const imageBuffer = await fs.readFile(imagePath);
const buffers = await cleanupImage({
correctOrientation,
imageBuffer,
scheduler,
trimEdges,
});
let page: Page;
try {
let rawResponse: CompletionResponse | ExtractionResponse;
if (customModelFunction) {
rawResponse = await runRetries(
() =>
customModelFunction({
buffers,
image: imagePath,
maintainFormat,
pageNumber,
priorPage,
}),
maxRetries,
pageNumber
);
} else {
rawResponse = await runRetries(
() =>
modelInstance.getCompletion(OperationMode.OCR, {
buffers,
maintainFormat,
priorPage,
prompt,
}),
maxRetries,
pageNumber
);
}
if (rawResponse.logprobs) {
ocrLogprobs.push({
page: pageNumber,
value: rawResponse.logprobs,
});
}
const response = CompletionProcessor.process(
OperationMode.OCR,
rawResponse
);
inputTokenCount += response.inputTokens;
outputTokenCount += response.outputTokens;
if (isCompletionResponse(OperationMode.OCR, response)) {
priorPage = response.content;
}
page = {
...response,
page: pageNumber,
status: PageStatus.SUCCESS,
};
numSuccessfulOCRRequests++;
} catch (error) {
console.error(`Failed to process image ${imagePath}:`, error);
if (errorMode === ErrorMode.THROW) {
throw error;
}
page = {
content: "",
contentLength: 0,
error: `Failed to process page ${pageNumber}: ${error}`,
page: pageNumber,
status: PageStatus.ERROR,
};
numFailedOCRRequests++;
}
return page;
};
if (maintainFormat) {
// Use synchronous processing
for (let i = 0; i < imagePaths.length; i++) {
const page = await processOCR(imagePaths[i], i, true);
pages.push(page);
if (page.status === PageStatus.ERROR) {
break;
}
}
} else {
const limit = pLimit(concurrency);
await Promise.all(
imagePaths.map((imagePath, i) =>
limit(() =>
processOCR(imagePath, i, false).then((page) => {
pages[i] = page;
})
)
)
);
}
}
}
// Start processing extraction using LLM
let numSuccessfulExtractionRequests: number = 0;
let numFailedExtractionRequests: number = 0;
if (schema) {
const extractionModelInstance = createModel({
credentials: extractionCredentials,
llmParams: extractionLlmParams,
model: extractionModel,
provider: extractionModelProvider,
});
const { fullDocSchema, perPageSchema } = splitSchema(
schema,
extractPerPage
);
const extractionTasks: Promise<any>[] = [];
const processExtraction = async (
input: string | string[] | HybridInput,
pageNumber: number,
schema: Record<string, unknown>
): Promise<Record<string, unknown>> => {
let result: Record<string, unknown> = {};
try {
await runRetries(
async () => {
const rawResponse = await extractionModelInstance.getCompletion(
OperationMode.EXTRACTION,
{
input,
options: { correctOrientation, scheduler, trimEdges },
prompt: extractionPrompt,
schema,
}
);
if (rawResponse.logprobs) {
extractedLogprobs.push({
page: pageNumber,
value: rawResponse.logprobs,
});
}
const response = CompletionProcessor.process(
OperationMode.EXTRACTION,
rawResponse
);
inputTokenCount += response.inputTokens;
outputTokenCount += response.outputTokens;
numSuccessfulExtractionRequests++;
for (const key of Object.keys(schema?.properties ?? {})) {
const value = response.extracted[key];
if (value !== null && value !== undefined) {
if (!Array.isArray(result[key])) {
result[key] = [];
}
(result[key] as any[]).push({ page: pageNumber, value });
}
}
},
maxRetries,
pageNumber
);
} catch (error) {
numFailedExtractionRequests++;
throw error;
}
return result;
};
if (perPageSchema) {
const inputs =
directImageExtraction && !isStructuredDataFile(localPath)
? imagePaths.map((imagePath) => [imagePath])
: enableHybridExtraction
? imagePaths.map((imagePath, index) => ({
imagePaths: [imagePath],
text: pages[index].content || "",
}))
: pages.map((page) => page.content || "");
extractionTasks.push(
...inputs.map((input, i) =>
processExtraction(input, i + 1, perPageSchema)
)
);
}
if (fullDocSchema) {
const input =
directImageExtraction && !isStructuredDataFile(localPath)
? imagePaths
: enableHybridExtraction
? {
imagePaths,
text: pages
.map((page, i) =>
i === 0 ? page.content : "\n<hr><hr>\n" + page.content
)
.join(""),
}
: pages
.map((page, i) =>
i === 0 ? page.content : "\n<hr><hr>\n" + page.content
)
.join("");
extractionTasks.push(
(async () => {
let result: Record<string, unknown> = {};
try {
await runRetries(
async () => {
const rawResponse =
await extractionModelInstance.getCompletion(
OperationMode.EXTRACTION,
{
input,
options: { correctOrientation, scheduler, trimEdges },
prompt: extractionPrompt,
schema: fullDocSchema,
}
);
if (rawResponse.logprobs) {
extractedLogprobs.push({
page: null,
value: rawResponse.logprobs,
});
}
const response = CompletionProcessor.process(
OperationMode.EXTRACTION,
rawResponse
);
inputTokenCount += response.inputTokens;
outputTokenCount += response.outputTokens;
numSuccessfulExtractionRequests++;
result = response.extracted;
},
maxRetries,
0
);
return result;
} catch (error) {
numFailedExtractionRequests++;
throw error;
}
})()
);
}
const results = await Promise.all(extractionTasks);
extracted = results.reduce((acc, result) => {
Object.entries(result || {}).forEach(([key, value]) => {
if (!acc[key]) {
acc[key] = [];
}
if (Array.isArray(value)) {
acc[key].push(...value);
} else {
acc[key] = value;
}
});
return acc;
}, {});
}
// Write the aggregated markdown to a file
const endOfPath = localPath.split("/")[localPath.split("/").length - 1];
const rawFileName = endOfPath.split(".")[0];
const fileName = rawFileName
.replace(/[^\w\s]/g, "")
.replace(/\s+/g, "_")
.toLowerCase()
.substring(0, 255); // Truncate file name to 255 characters to prevent ENAMETOOLONG errors
if (outputDir) {
const resultFilePath = path.join(outputDir, `${fileName}.md`);
const content = pages.map((page) => page.content).join("\n\n");
await fs.writeFile(resultFilePath, content);
}
// Cleanup the downloaded PDF file
if (cleanup) await fs.remove(tempDirectory);
// Format JSON response
const endTime = new Date();
const completionTime = endTime.getTime() - startTime.getTime();
return {
completionTime,
extracted,
fileName,
inputTokens: inputTokenCount,
...(ocrLogprobs.length || extractedLogprobs.length
? {
logprobs: {
ocr: !extractOnly ? ocrLogprobs : null,
extracted: schema ? extractedLogprobs : null,
},
}
: {}),
outputTokens: outputTokenCount,
pages,
summary: {
totalPages: pages.length,
ocr: !extractOnly
? {
successful: numSuccessfulOCRRequests,
failed: numFailedOCRRequests,
}
: null,
extracted: schema
? {
successful: numSuccessfulExtractionRequests,
failed: numFailedExtractionRequests,
}
: null,
},
};
} finally {
if (correctOrientation && scheduler) {
terminateScheduler(scheduler);
}
}
};
================================================
FILE: node-zerox/src/types.ts
================================================
import { ChatCompletionTokenLogprob } from "openai/resources";
import Tesseract from "tesseract.js";
export interface ZeroxArgs {
cleanup?: boolean;
concurrency?: number;
correctOrientation?: boolean;
credentials?: ModelCredentials;
customModelFunction?: (params: {
buffers: Buffer[];
image: string;
maintainFormat: boolean;
pageNumber: number;
priorPage: string;
}) => Promise<CompletionResponse>;
directImageExtraction?: boolean;
enableHybridExtraction?: boolean;
errorMode?: ErrorMode;
extractionCredentials?: ModelCredentials;
extractionLlmParams?: Partial<LLMParams>;
extractionModel?: ModelOptions | string;
extractionModelProvider?: ModelProvider | string;
extractionPrompt?: string;
extractOnly?: boolean;
extractPerPage?: string[];
filePath: string;
imageDensity?: number;
imageHeight?: number;
llmParams?: Partial<LLMParams>;
maintainFormat?: boolean;
maxImageSize?: number;
maxRetries?: number;
maxTesseractWorkers?: number;
model?: ModelOptions | string;
modelProvider?: ModelProvider | string;
openaiAPIKey?: string;
outputDir?: string;
pagesToConvertAsImages?: number | number[];
prompt?: string;
schema?: Record<string, unknown>;
tempDir?: string;
trimEdges?: boolean;
}
export interface ZeroxOutput {
completionTime: number;
extracted: Record<string, unknown> | null;
fileName: string;
inputTokens: number;
logprobs?: Logprobs;
outputTokens: number;
pages: Page[];
summary: Summary;
}
export interface AzureCredentials {
apiKey: string;
endpoint: string;
}
export interface BedrockCredentials {
accessKeyId?: string;
region: string;
secretAccessKey?: string;
sessionToken?: string;
}
export interface GoogleCredentials {
apiKey: string;
}
export interface OpenAICredentials {
apiKey: string;
}
export type ModelCredentials =
| AzureCredentials
| BedrockCredentials
| GoogleCredentials
| OpenAICredentials;
export enum ModelOptions {
// Bedrock Claude 3 Models
BEDROCK_CLAUDE_3_HAIKU_2024_10 = "anthropic.claude-3-5-haiku-20241022-v1:0",
BEDROCK_CLAUDE_3_SONNET_2024_06 = "anthropic.claude-3-5-sonnet-20240620-v1:0",
BEDROCK_CLAUDE_3_SONNET_2024_10 = "anthropic.claude-3-5-sonnet-20241022-v2:0",
BEDROCK_CLAUDE_3_HAIKU_2024_03 = "anthropic.claude-3-haiku-20240307-v1:0",
BEDROCK_CLAUDE_3_OPUS_2024_02 = "anthropic.claude-3-opus-20240229-v1:0",
BEDROCK_CLAUDE_3_SONNET_2024_02 = "anthropic.claude-3-sonnet-20240229-v1:0",
// OpenAI GPT-4 Models
OPENAI_GPT_4_1 = "gpt-4.1",
OPENAI_GPT_4_1_MINI = "gpt-4.1-mini",
OPENAI_GPT_4O = "gpt-4o",
OPENAI_GPT_4O_MINI = "gpt-4o-mini",
// Google Gemini Models
GOOGLE_GEMINI_1_5_FLASH = "gemini-1.5-flash",
GOOGLE_GEMINI_1_5_FLASH_8B = "gemini-1.5-flash-8b",
GOOGLE_GEMINI_1_5_PRO = "gemini-1.5-pro",
GOOGLE_GEMINI_2_5_PRO = "gemini-2.5-pro-preview-03-25",
GOOGLE_GEMINI_2_FLASH = "gemini-2.0-flash-001",
GOOGLE_GEMINI_2_FLASH_LITE = "gemini-2.0-flash-lite-preview-02-05",
}
export enum ModelProvider {
AZURE = "AZURE",
BEDROCK = "BEDROCK",
GOOGLE = "GOOGLE",
OPENAI = "OPENAI",
}
export enum OperationMode {
EXTRACTION = "EXTRACTION",
OCR = "OCR",
}
export enum PageStatus {
SUCCESS = "SUCCESS",
ERROR = "ERROR",
}
export interface Page {
content?: string;
contentLength?: number;
error?: string;
extracted?: Record<string, unknown>;
inputTokens?: number;
outputTokens?: number;
page: number;
status: PageStatus;
}
export interface ConvertPdfOptions {
density: number;
format: "png";
height: number;
preserveAspectRatio?: boolean;
saveFilename: string;
savePath: string;
}
export interface CompletionArgs {
buffers: Buffer[];
maintainFormat: boolean;
priorPage: string;
prompt?: string;
}
export interface CompletionResponse {
content: string;
inputTokens: number;
logprobs?: ChatCompletionTokenLogprob[] | null;
outputTokens: number;
}
export type ProcessedCompletionResponse = Omit<
CompletionResponse,
"logprobs"
> & {
contentLength: number;
};
export interface CreateModelArgs {
credentials: ModelCredentials;
llmParams: Partial<LLMParams>;
model: ModelOptions | string;
provider: ModelProvider | string;
}
export enum ErrorMode {
THROW = "THROW",
IGNORE = "IGNORE",
}
export interface ExtractionArgs {
input: string | string[] | HybridInput;
options?: {
correctOrientation?: boolean;
scheduler: Tesseract.Scheduler | null;
trimEdges?: boolean;
};
prompt?: string;
schema: Record<string, unknown>;
}
export interface ExtractionResponse {
extracted: Record<string, unknown>;
inputTokens: number;
logprobs?: ChatCompletionTokenLogprob[] | null;
outputTokens: number;
}
export type ProcessedExtractionResponse = Omit<ExtractionResponse, "logprobs">;
export interface HybridInput {
imagePaths: string[];
text: string;
}
interface BaseLLMParams {
frequencyPenalty?: number;
presencePenalty?: number;
temperature?: number;
topP?: number;
}
export interface AzureLLMParams extends BaseLLMParams {
logprobs: boolean;
maxTokens: number;
}
export interface BedrockLLMParams extends BaseLLMParams {
maxTokens: number;
}
export interface GoogleLLMParams extends BaseLLMParams {
maxOutputTokens: number;
}
export interface OpenAILLMParams extends BaseLLMParams {
logprobs: boolean;
maxTokens: number;
}
// Union type of all provider params
export type LLMParams =
| AzureLLMParams
| BedrockLLMParams
| GoogleLLMParams
| OpenAILLMParams;
export interface LogprobPage {
page: number | null;
value: ChatCompletionTokenLogprob[];
}
interface Logprobs {
ocr: LogprobPage[] | null;
extracted: LogprobPage[] | null;
}
export interface MessageContentArgs {
input: string | string[] | HybridInput;
options?: {
correctOrientation?: boolean;
scheduler: Tesseract.Scheduler | null;
trimEdges?: boolean;
};
}
export interface ModelInterface {
getCompletion(
mode: OperationMode,
params: CompletionArgs | ExtractionArgs
): Promise<CompletionResponse | ExtractionResponse>;
}
export interface Summary {
totalPages: number;
ocr: {
successful: number;
failed: number;
} | null;
extracted: {
successful: number;
failed: number;
} | null;
}
export interface ExcelSheetContent {
content: string;
contentLength: number;
sheetName: string;
}
================================================
FILE: node-zerox/src/models/azure.ts
================================================
import {
AzureCredentials,
AzureLLMParams,
CompletionArgs,
CompletionResponse,
ExtractionArgs,
ExtractionResponse,
MessageContentArgs,
ModelInterface,
OperationMode,
} from "../types";
import { AzureOpenAI } from "openai";
import {
cleanupImage,
convertKeysToCamelCase,
convertKeysToSnakeCase,
encodeImageToBase64,
} from "../utils";
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants";
import fs from "fs-extra";
export default class AzureModel implements ModelInterface {
private client: AzureOpenAI;
private llmParams?: Partial<AzureLLMParams>;
constructor(
credentials: AzureCredentials,
model: string,
llmParams?: Partial<AzureLLMParams>
) {
this.client = new AzureOpenAI({
apiKey: credentials.apiKey,
apiVersion: "2024-10-21",
deployment: model,
endpoint: credentials.endpoint,
});
this.llmParams = llmParams;
}
async getCompletion(
mode: OperationMode,
params: CompletionArgs | ExtractionArgs
): Promise<CompletionResponse | ExtractionResponse> {
const modeHandlers = {
[OperationMode.EXTRACTION]: () =>
this.handleExtraction(params as ExtractionArgs),
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs),
};
const handler = modeHandlers[mode];
if (!handler) {
throw new Error(`Unsupported operation mode: ${mode}`);
}
return await handler();
}
private async createMessageContent({
input,
options,
}: MessageContentArgs): Promise<any> {
const processImages = async (imagePaths: string[]) => {
const nestedImages = await Promise.all(
imagePaths.map(async (imagePath) => {
const imageBuffer = await fs.readFile(imagePath);
const buffers = await cleanupImage({
correctOrientation: options?.correctOrientation ?? false,
imageBuffer,
scheduler: options?.scheduler ?? null,
trimEdges: options?.trimEdges ?? false,
});
return buffers.map((buffer) => ({
image_url: {
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`,
},
type: "image_url",
}));
})
);
return nestedImages.flat();
};
if (Array.isArray(input)) {
return processImages(input);
}
if (typeof input === "string") {
return [{ text: input, type: "text" }];
}
const { imagePaths, text } = input;
const images = await processImages(imagePaths);
return [...images, { text, type: "text" }];
}
private async handleOCR({
buffers,
maintainFormat,
priorPage,
prompt,
}: CompletionArgs): Promise<CompletionResponse> {
const systemPrompt = prompt || SYSTEM_PROMPT_BASE;
// Default system message
const messages: any = [{ role: "system", content: systemPrompt }];
// If content has already been generated, add it to context.
// This helps maintain the same format across pages
if (maintainFormat && priorPage && priorPage.length) {
messages.push({
role: "system",
content: CONSISTENCY_PROMPT(priorPage),
});
}
// Add image to request
const imageContents = buffers.map((buffer) => ({
type: "image_url",
image_url: {
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`,
},
}));
messages.push({ role: "user", content: imageContents });
try {
const response = await this.client.chat.completions.create({
messages,
model: "",
...convertKeysToSnakeCase(this.llmParams ?? null),
});
const result: CompletionResponse = {
content: response.choices[0].message.content || "",
inputTokens: response.usage?.prompt_tokens || 0,
outputTokens: response.usage?.completion_tokens || 0,
};
if (this.llmParams?.logprobs) {
result["logprobs"] = convertKeysToCamelCase(
response.choices[0].logprobs
)?.content;
}
return result;
} catch (err) {
console.error("Error in Azure completion", err);
throw err;
}
}
private async handleExtraction({
input,
options,
prompt,
schema,
}: ExtractionArgs): Promise<ExtractionResponse> {
try {
const messages: any = [];
if (prompt) {
messages.push({ role: "system", content: prompt });
}
messages.push({
role: "user",
content: await this.createMessageContent({ input, options }),
});
const response = await this.client.chat.completions.create({
messages,
model: "",
response_format: {
json_schema: { name: "extraction", schema },
type: "json_schema",
},
...convertKeysToSnakeCase(this.llmParams ?? null),
});
const result: ExtractionResponse = {
extracted: JSON.parse(response.choices[0].message.content || ""),
inputTokens: response.usage?.prompt_tokens || 0,
outputTokens: response.usage?.completion_tokens || 0,
};
if (this.llmParams?.logprobs) {
result["logprobs"] = convertKeysToCamelCase(
response.choices[0].logprobs
)?.content;
}
return result;
} catch (err) {
console.error("Error in Azure completion", err);
throw err;
}
}
}
================================================
FILE: node-zerox/src/models/bedrock.ts
================================================
import {
BedrockCredentials,
BedrockLLMParams,
CompletionArgs,
CompletionResponse,
ExtractionArgs,
ExtractionResponse,
MessageContentArgs,
ModelInterface,
OperationMode,
} from "../types";
import {
BedrockRuntimeClient,
InvokeModelCommand,
} from "@aws-sdk/client-bedrock-runtime";
import {
cleanupImage,
convertKeysToSnakeCase,
encodeImageToBase64,
} from "../utils";
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants";
import fs from "fs-extra";
// Currently only supports Anthropic models
export default class BedrockModel implements ModelInterface {
private client: BedrockRuntimeClient;
private model: string;
private llmParams?: Partial<BedrockLLMParams>;
constructor(
credentials: BedrockCredentials,
model: string,
llmParams?: Partial<BedrockLLMParams>
) {
this.client = new BedrockRuntimeClient({
region: credentials.region,
credentials: credentials.accessKeyId
? {
accessKeyId: credentials.accessKeyId,
secretAccessKey: credentials.secretAccessKey!,
sessionToken: credentials.sessionToken,
}
: undefined,
});
this.model = model;
this.llmParams = llmParams;
}
async getCompletion(
mode: OperationMode,
params: CompletionArgs | ExtractionArgs
): Promise<CompletionResponse | ExtractionResponse> {
const modeHandlers = {
[OperationMode.EXTRACTION]: () =>
this.handleExtraction(params as ExtractionArgs),
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs),
};
const handler = modeHandlers[mode];
if (!handler) {
throw new Error(`Unsupported operation mode: ${mode}`);
}
return await handler();
}
private async createMessageContent({
input,
options,
}: MessageContentArgs): Promise<any> {
const processImages = async (imagePaths: string[]) => {
const nestedImages = await Promise.all(
imagePaths.map(async (imagePath) => {
const imageBuffer = await fs.readFile(imagePath);
const buffers = await cleanupImage({
correctOrientation: options?.correctOrientation ?? false,
imageBuffer,
scheduler: options?.scheduler ?? null,
trimEdges: options?.trimEdges ?? false,
});
return buffers.map((buffer) => ({
source: {
data: encodeImageToBase64(buffer),
media_type: "image/png",
type: "base64",
},
type: "image",
}));
})
);
return nestedImages.flat();
};
if (Array.isArray(input)) {
return processImages(input);
}
if (typeof input === "string") {
return [{ text: input, type: "text" }];
}
const { imagePaths, text } = input;
const images = await processImages(imagePaths);
return [...images, { text, type: "text" }];
}
private async handleOCR({
buffers,
maintainFormat,
priorPage,
prompt,
}: CompletionArgs): Promise<CompletionResponse> {
let systemPrompt = prompt || SYSTEM_PROMPT_BASE;
// Default system message
const messages: any = [];
// If content has already been generated, add it to context.
// This helps maintain the same format across pages
if (maintainFormat && priorPage && priorPage.length) {
systemPrompt += `\n\n${CONSISTENCY_PROMPT(priorPage)}`;
}
// Add image to request
const imageContents = buffers.map((buffer) => ({
source: {
data: encodeImageToBase64(buffer),
media_type: "image/png",
type: "base64",
},
type: "image",
}));
messages.push({ role: "user", content: imageContents });
try {
const body = {
anthropic_version: "bedrock-2023-05-31",
max_tokens: this.llmParams?.maxTokens || 4096,
messages,
system: systemPrompt,
...convertKeysToSnakeCase(this.llmParams ?? {}),
};
const command = new InvokeModelCommand({
accept: "application/json",
body: JSON.stringify(body),
contentType: "application/json",
modelId: this.model,
});
const response = await this.client.send(command);
const parsedResponse = JSON.parse(
new TextDecoder().decode(response.body)
);
return {
content: parsedResponse.content[0].text,
inputTokens: parsedResponse.usage?.input_tokens || 0,
outputTokens: parsedResponse.usage?.output_tokens || 0,
};
} catch (err) {
console.error("Error in Bedrock completion", err);
throw err;
}
}
private async handleExtraction({
input,
options,
prompt,
schema,
}: ExtractionArgs): Promise<ExtractionResponse> {
try {
const messages = [
{
role: "user",
content: await this.createMessageContent({ input, options }),
},
];
const tools = [
{
input_schema: schema,
name: "json",
},
];
const body = {
anthropic_version: "bedrock-2023-05-31",
max_tokens: this.llmParams?.maxTokens || 4096,
messages,
system: prompt,
tool_choice: { name: "json", type: "tool" },
tools,
...convertKeysToSnakeCase(this.llmParams ?? {}),
};
const command = new InvokeModelCommand({
accept: "application/json",
body: JSON.stringify(body),
contentType: "application/json",
modelId: this.model,
});
const response = await this.client.send(command);
const parsedResponse = JSON.parse(
new TextDecoder().decode(response.body)
);
return {
extracted: parsedResponse.content[0].input,
inputTokens: parsedResponse.usage?.input_tokens || 0,
outputTokens: parsedResponse.usage?.output_tokens || 0,
};
} catch (err) {
console.error("Error in Bedrock completion", err);
throw err;
}
}
}
================================================
FILE: node-zerox/src/models/google.ts
================================================
import {
cleanupImage,
convertKeysToSnakeCase,
encodeImageToBase64,
} from "../utils";
import {
CompletionArgs,
CompletionResponse,
ExtractionArgs,
ExtractionResponse,
GoogleCredentials,
GoogleLLMParams,
MessageContentArgs,
ModelInterface,
OperationMode,
} from "../types";
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants";
import { GoogleGenAI, createPartFromBase64 } from "@google/genai";
import fs from "fs-extra";
export default class GoogleModel implements ModelInterface {
private client: GoogleGenAI;
private model: string;
private llmParams?: Partial<GoogleLLMParams>;
constructor(
credentials: GoogleCredentials,
model: string,
llmParams?: Partial<GoogleLLMParams>
) {
this.client = new GoogleGenAI({ apiKey: credentials.apiKey });
this.model = model;
this.llmParams = llmParams;
}
async getCompletion(
mode: OperationMode,
params: CompletionArgs | ExtractionArgs
): Promise<CompletionResponse | ExtractionResponse> {
const modeHandlers = {
[OperationMode.EXTRACTION]: () =>
this.handleExtraction(params as ExtractionArgs),
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs),
};
const handler = modeHandlers[mode];
if (!handler) {
throw new Error(`Unsupported operation mode: ${mode}`);
}
return await handler();
}
private async createMessageContent({
input,
options,
}: MessageContentArgs): Promise<any> {
const processImages = async (imagePaths: string[]) => {
const nestedImages = await Promise.all(
imagePaths.map(async (imagePath) => {
const imageBuffer = await fs.readFile(imagePath);
const buffers = await cleanupImage({
correctOrientation: options?.correctOrientation ?? false,
imageBuffer,
scheduler: options?.scheduler ?? null,
trimEdges: options?.trimEdges ?? false,
});
return buffers.map((buffer) =>
createPartFromBase64(encodeImageToBase64(buffer), "image/png")
);
})
);
return nestedImages.flat();
};
if (Array.isArray(input)) {
return processImages(input);
}
if (typeof input === "string") {
return [{ text: input }];
}
const { imagePaths, text } = input;
const images = await processImages(imagePaths);
return [...images, { text }];
}
private async handleOCR({
buffers,
maintainFormat,
priorPage,
prompt,
}: CompletionArgs): Promise<CompletionResponse> {
// Insert the text prompt after the image contents array
// https://ai.google.dev/gemini-api/docs/image-understanding?lang=node#technical-details-image
// Build the prompt parts
const promptParts: any = [];
// Add image contents
const imageContents = buffers.map((buffer) =>
createPartFromBase64(encodeImageToBase64(buffer), "image/png")
);
promptParts.push(...imageContents);
// Add system prompt
promptParts.push({ text: prompt || SYSTEM_PROMPT_BASE });
// If content has already been generated, add it to context
if (maintainFormat && priorPage && priorPage.length) {
promptParts.push({ text: CONSISTENCY_PROMPT(priorPage) });
}
try {
const response = await this.client.models.generateContent({
config: convertKeysToSnakeCase(this.llmParams ?? null),
contents: promptParts,
model: this.model,
});
return {
content: response.text || "",
inputTokens: response.usageMetadata?.promptTokenCount || 0,
outputTokens: response.usageMetadata?.candidatesTokenCount || 0,
};
} catch (err) {
console.error("Error in Google completion", err);
throw err;
}
}
private async handleExtraction({
input,
options,
prompt,
schema,
}: ExtractionArgs): Promise<ExtractionResponse> {
// Build the prompt parts
const promptParts: any = [];
const parts = await this.createMessageContent({ input, options });
promptParts.push(...parts);
// Add system prompt
promptParts.push({ text: prompt || "Extract schema data" });
try {
const response = await this.client.models.generateContent({
config: {
...convertKeysToSnakeCase(this.llmParams ?? null),
responseMimeType: "application/json",
responseSchema: schema,
},
contents: promptParts,
model: this.model,
});
return {
extracted: response.text ? JSON.parse(response.text) : {},
inputTokens: response.usageMetadata?.promptTokenCount || 0,
outputTokens: response.usageMetadata?.candidatesTokenCount || 0,
};
} catch (err) {
console.error("Error in Google completion", err);
throw err;
}
}
}
================================================
FILE: node-zerox/src/models/index.ts
================================================
import {
AzureCredentials,
BedrockCredentials,
CreateModelArgs,
GoogleCredentials,
ModelInterface,
ModelProvider,
OpenAICredentials,
} from "../types";
import { validateLLMParams } from "../utils/model";
import AzureModel from "./azure";
import BedrockModel from "./bedrock";
import GoogleModel from "./google";
import OpenAIModel from "./openAI";
// Type guard for Azure credentials
const isAzureCredentials = (
credentials: any
): credentials is AzureCredentials => {
return (
credentials &&
typeof credentials.endpoint === "string" &&
typeof credentials.apiKey === "string"
);
};
// Type guard for Bedrock credentials
const isBedrockCredentials = (
credentials: any
): credentials is BedrockCredentials => {
return credentials && typeof credentials.region === "string";
};
// Type guard for Google credentials
const isGoogleCredentials = (
credentials: any
): credentials is GoogleCredentials => {
return credentials && typeof credentials.apiKey === "string";
};
// Type guard for OpenAI credentials
const isOpenAICredentials = (
credentials: any
): credentials is OpenAICredentials => {
return credentials && typeof credentials.apiKey === "string";
};
export const createModel = ({
credentials,
llmParams,
model,
provider,
}: CreateModelArgs): ModelInterface => {
const validatedParams = validateLLMParams(llmParams, provider);
switch (provider) {
case ModelProvider.AZURE:
if (!isAzureCredentials(credentials)) {
throw new Error("Invalid credentials for Azure provider");
}
return new AzureModel(credentials, model, validatedParams);
case ModelProvider.BEDROCK:
if (!isBedrockCredentials(credentials)) {
throw new Error("Invalid credentials for Bedrock provider");
}
return new BedrockModel(credentials, model, validatedParams);
case ModelProvider.GOOGLE:
if (!isGoogleCredentials(credentials)) {
throw new Error("Invalid credentials for Google provider");
}
return new GoogleModel(credentials, model, validatedParams);
case ModelProvider.OPENAI:
if (!isOpenAICredentials(credentials)) {
throw new Error("Invalid credentials for OpenAI provider");
}
return new OpenAIModel(credentials, model, validatedParams);
default:
throw new Error(`Unsupported model provider: ${provider}`);
}
};
================================================
FILE: node-zerox/src/models/openAI.ts
================================================
import {
CompletionArgs,
CompletionResponse,
ExtractionArgs,
ExtractionResponse,
MessageContentArgs,
ModelInterface,
OpenAICredentials,
OpenAILLMParams,
OperationMode,
} from "../types";
import {
cleanupImage,
convertKeysToCamelCase,
convertKeysToSnakeCase,
encodeImageToBase64,
} from "../utils";
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants";
import axios from "axios";
import fs from "fs-extra";
export default class OpenAIModel implements ModelInterface {
private apiKey: string;
private model: string;
private llmParams?: Partial<OpenAILLMParams>;
constructor(
credentials: OpenAICredentials,
model: string,
llmParams?: Partial<OpenAILLMParams>
) {
this.apiKey = credentials.apiKey;
this.model = model;
this.llmParams = llmParams;
}
async getCompletion(
mode: OperationMode,
params: CompletionArgs | ExtractionArgs
): Promise<CompletionResponse | ExtractionResponse> {
const modeHandlers = {
[OperationMode.EXTRACTION]: () =>
this.handleExtraction(params as ExtractionArgs),
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs),
};
const handler = modeHandlers[mode];
if (!handler) {
throw new Error(`Unsupported operation mode: ${mode}`);
}
return await handler();
}
private async createMessageContent({
input,
options,
}: MessageContentArgs): Promise<any> {
const processImages = async (imagePaths: string[]) => {
const nestedImages = await Promise.all(
imagePaths.map(async (imagePath) => {
const imageBuffer = await fs.readFile(imagePath);
const buffers = await cleanupImage({
correctOrientation: options?.correctOrientation ?? false,
imageBuffer,
scheduler: options?.scheduler ?? null,
trimEdges: options?.trimEdges ?? false,
});
return buffers.map((buffer) => ({
image_url: {
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`,
},
type: "image_url",
}));
})
);
return nestedImages.flat();
};
if (Array.isArray(input)) {
return processImages(input);
}
if (typeof input === "string") {
return [{ text: input, type: "text" }];
}
const { imagePaths, text } = input;
const images = await processImages(imagePaths);
return [...images, { text, type: "text" }];
}
private async handleOCR({
buffers,
maintainFormat,
priorPage,
prompt,
}: CompletionArgs): Promise<CompletionResponse> {
const systemPrompt = prompt || SYSTEM_PROMPT_BASE;
// Default system message
const messages: any = [{ role: "system", content: systemPrompt }];
// If content has already been generated, add it to context.
// This helps maintain the same format across pages
if (maintainFormat && priorPage && priorPage.length) {
messages.push({
role: "system",
content: CONSISTENCY_PROMPT(priorPage),
});
}
// Add image to request
const imageContents = buffers.map((buffer) => ({
type: "image_url",
image_url: {
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`,
},
}));
messages.push({ role: "user", content: imageContents });
try {
const response = await axios.post(
"https://api.openai.com/v1/chat/completions",
{
messages,
model: this.model,
...convertKeysToSnakeCase(this.llmParams ?? null),
},
{
headers: {
Authorization: `Bearer ${this.apiKey}`,
"Content-Type": "application/json",
},
}
);
const data = response.data;
const result: CompletionResponse = {
content: data.choices[0].message.content,
inputTokens: data.usage.prompt_tokens,
outputTokens: data.usage.completion_tokens,
};
if (this.llmParams?.logprobs) {
result["logprobs"] = convertKeysToCamelCase(
data.choices[0].logprobs
)?.content;
}
return result;
} catch (err) {
console.error("Error in OpenAI completion", err);
throw err;
}
}
private async handleExtraction({
input,
options,
prompt,
schema,
}: ExtractionArgs): Promise<ExtractionResponse> {
try {
const messages: any = [];
if (prompt) {
messages.push({ role: "system", content: prompt });
}
messages.push({
role: "user",
content: await this.createMessageContent({ input, options }),
});
const response = await axios.post(
"https://api.openai.com/v1/chat/completions",
{
messages,
model: this.model,
response_format: {
json_schema: { name: "extraction", schema },
type: "json_schema",
},
...convertKeysToSnakeCase(this.llmParams ?? null),
},
{
headers: {
Authorization: `Bearer ${this.apiKey}`,
"Content-Type": "application/json",
},
}
);
const data = response.data;
const result: ExtractionResponse = {
extracted: data.choices[0].message.content,
inputTokens: data.usage.prompt_tokens,
outputTokens: data.usage.completion_tokens,
};
if (this.llmParams?.logprobs) {
result["logprobs"] = convertKeysToCamelCase(
data.choices[0].logprobs
)?.content;
}
return result;
} catch (err) {
console.error("Error in OpenAI completion", err);
throw err;
}
}
}
================================================
FILE: node-zerox/src/utils/common.ts
================================================
export const camelToSnakeCase = (str: string) =>
str.replace(/[A-Z]/g, (letter: string) => `_${letter.toLowerCase()}`);
export const convertKeysToCamelCase = (
obj: Record<string, any> | null
): Record<string, any> => {
if (typeof obj !== "object" || obj === null) {
return obj ?? {};
}
if (Array.isArray(obj)) {
return obj.map(convertKeysToCamelCase);
}
return Object.fromEntries(
Object.entries(obj).map(([key, value]) => [
snakeToCamelCase(key),
convertKeysToCamelCase(value),
])
);
};
export const convertKeysToSnakeCase = (
obj: Record<string, any> | null
): Record<string, any> => {
if (typeof obj !== "object" || obj === null) {
return obj ?? {};
}
return Object.fromEntries(
Object.entries(obj).map(([key, value]) => [camelToSnakeCase(key), value])
);
};
export const isString = (value: string | null): value is string => {
return value !== null;
};
export const isValidUrl = (string: string): boolean => {
let url;
try {
url = new URL(string);
} catch (_) {
return false;
}
return url.protocol === "http:" || url.protocol === "https:";
};
// Strip out the ```markdown wrapper
export const formatMarkdown = (text: string): string => {
return (
text
// First preserve all language code blocks except html and markdown
.replace(/```(?!html|markdown)(\w+)([\s\S]*?)```/g, "§§§$1$2§§§")
// Then remove html and markdown code markers
.replace(/```(?:html|markdown)|````(?:html|markdown)|```/g, "")
// Finally restore all preserved language blocks
.replace(/§§§(\w+)([\s\S]*?)§§§/g, "```$1$2```")
);
};
export const runRetries = async <T>(
operation: () => Promise<T>,
maxRetries: number,
pageNumber: number
): Promise<T> => {
let retryCount = 0;
while (retryCount <= maxRetries) {
try {
return await operation();
} catch (error) {
if (retryCount === maxRetries) {
throw error;
}
console.log(`Retrying page ${pageNumber}...`);
retryCount++;
}
}
throw new Error("Unexpected retry error");
};
export const snakeToCamelCase = (str: string): string =>
str.replace(/_([a-z])/g, (_, letter: string) => letter.toUpperCase());
export const splitSchema = (
schema: Record<string, unknown>,
extractPerPage?: string[]
): {
fullDocSchema: Record<string, unknown> | null;
perPageSchema: Record<string, unknown> | null;
} => {
if (!extractPerPage?.length) {
return { fullDocSchema: schema, perPageSchema: null };
}
const fullDocSchema: Record<string, unknown> = {};
const perPageSchema: Record<string, unknown> = {};
for (const [key, value] of Object.entries(schema.properties || {})) {
(extractPerPage.includes(key) ? perPageSchema : fullDocSchema)[key] = value;
}
const requiredKeys = Array.isArray(schema.required) ? schema.required : [];
return {
fullDocSchema: Object.keys(fullDocSchema).length
? {
type: schema.type,
properties: fullDocSchema,
required: requiredKeys.filter((key) => !extractPerPage.includes(key)),
}
: null,
perPageSchema: Object.keys(perPageSchema).length
? {
type: schema.type,
properties: perPageSchema,
required: requiredKeys.filter((key) => extractPerPage.includes(key)),
}
: null,
};
};
================================================
FILE: node-zerox/src/utils/file.ts
================================================
import { convert } from "libreoffice-convert";
import { exec } from "child_process";
import { fromPath } from "pdf2pic";
import { pipeline } from "stream/promises";
import { promisify } from "util";
import { v4 as uuidv4 } from "uuid";
import { WriteImageResponse } from "pdf2pic/dist/types/convertResponse";
import axios from "axios";
import fileType from "file-type";
import fs from "fs-extra";
import heicConvert from "heic-convert";
import mime from "mime-types";
import path from "path";
import pdf from "pdf-parse";
import util from "util";
import xlsx from "xlsx";
import { ASPECT_RATIO_THRESHOLD } from "../constants";
import {
ConvertPdfOptions,
ExcelSheetContent,
Page,
PageStatus,
} from "../types";
import { isValidUrl } from "./common";
const convertAsync = promisify(convert);
const execAsync = util.promisify(exec);
// Save file to local tmp directory
export const downloadFile = async ({
filePath,
tempDir,
}: {
filePath: string;
tempDir: string;
}): Promise<{ extension: string; localPath: string }> => {
const fileNameExt = path.extname(filePath.split("?")[0]);
const localPath = path.join(tempDir, uuidv4() + fileNameExt);
let mimetype;
// Check if filePath is a URL
if (isValidUrl(filePath)) {
const writer = fs.createWriteStream(localPath);
const response = await axios({
url: filePath,
method: "GET",
responseType: "stream",
});
if (response.status !== 200) {
throw new Error(`HTTP error! Status: ${response.status}`);
}
mimetype = response.headers?.["content-type"];
await pipeline(response.data, writer);
} else {
// If filePath is a local file, copy it to the temp directory
await fs.copyFile(filePath, localPath);
}
if (!mimetype) {
mimetype = mime.lookup(localPath);
}
let extension = mime.extension(mimetype);
if (!extension) {
extension = fileNameExt || "";
}
if (!extension) {
if (mimetype === "binary/octet-stream") {
extension = ".bin";
} else {
throw new Error("File extension missing");
}
}
if (!extension.startsWith(".")) {
extension = `.${extension}`;
}
return { extension, localPath };
};
// Check if file is a Compound File Binary (legacy Office format)
export const checkIsCFBFile = async (filePath: string): Promise<boolean> => {
const type = await fileType.fromFile(filePath);
return type?.mime === "application/x-cfb";
};
// Check if file is a PDF by inspecting its magic number ("%PDF" at the beginning)
export const checkIsPdfFile = async (filePath: string): Promise<boolean> => {
const buffer = await fs.readFile(filePath);
return buffer.subarray(0, 4).toString() === "%PDF";
};
// Convert HEIC file to JPEG
export const convertHeicToJpeg = async ({
localPath,
tempDir,
}: {
localPath: string;
tempDir: string;
}): Promise<string> => {
try {
const inputBuffer = await fs.readFile(localPath);
const outputBuffer = await heicConvert({
buffer: inputBuffer,
format: "JPEG",
quality: 1,
});
const jpegPath = path.join(
tempDir,
`${path.basename(localPath, ".heic")}.jpg`
);
await fs.writeFile(jpegPath, Buffer.from(outputBuffer));
return jpegPath;
} catch (err) {
console.error(`Error converting .heic to .jpeg:`, err);
throw err;
}
};
// Convert each page (from other formats like docx) to a png and save that image to tmp
export const convertFileToPdf = async ({
extension,
localPath,
tempDir,
}: {
extension: string;
localPath: string;
tempDir: string;
}): Promise<string> => {
const inputBuffer = await fs.readFile(localPath);
const outputFilename = path.basename(localPath, extension) + ".pdf";
const outputPath = path.join(tempDir, outputFilename);
try {
const pdfBuffer = await convertAsync(inputBuffer, ".pdf", undefined);
await fs.writeFile(outputPath, pdfBuffer);
return outputPath;
} catch (err) {
console.error(`Error converting ${extension} to .pdf:`, err);
throw err;
}
};
// Convert each page to a png and save that image to tempDir
export const convertPdfToImages = async ({
imageDensity = 300,
imageHeight = 2048,
pagesToConvertAsImages,
pdfPath,
tempDir,
}: {
imageDensity?: number;
imageHeight?: number;
pagesToConvertAsImages: number | number[];
pdfPath: string;
tempDir: string;
}): Promise<string[]> => {
const aspectRatio = (await getPdfAspectRatio(pdfPath)) || 1;
const shouldAdjustHeight = aspectRatio > ASPECT_RATIO_THRESHOLD;
const adjustedHeight = shouldAdjustHeight
? Math.max(imageHeight, Math.round(aspectRatio * imageHeight))
: imageHeight;
const options: ConvertPdfOptions = {
density: imageDensity,
format: "png",
height: adjustedHeight,
preserveAspectRatio: true,
saveFilename: path.basename(pdfPath, path.extname(pdfPath)),
savePath: tempDir,
};
try {
try {
const storeAsImage = fromPath(pdfPath, options);
const convertResults: WriteImageResponse[] = await storeAsImage.bulk(
pagesToConvertAsImages
);
// Validate that all pages were converted
return convertResults.map((result) => {
if (!result.page || !result.path) {
throw new Error("Could not identify page data");
}
return result.path;
});
} catch (err) {
return await convertPdfWithPoppler(
pagesToConvertAsImages,
pdfPath,
options
);
}
} catch (err) {
console.error("Error during PDF conversion:", err);
throw err;
}
};
// Converts an Excel file to HTML format
export const convertExcelToHtml = async (
filePath: string
): Promise<ExcelSheetContent[]> => {
const tableClass = "zerox-excel-table";
try {
if (!(await fs.pathExists(filePath))) {
throw new Error(`Excel file not found: ${filePath}`);
}
const workbook = xlsx.readFile(filePath, {
type: "file",
cellStyles: true,
cellHTML: true,
});
if (!workbook || !workbook.SheetNames || workbook.SheetNames.length === 0) {
throw new Error("Invalid Excel file or no sheets found");
}
const sheets: ExcelSheetContent[] = [];
for (const sheetName of workbook.SheetNames) {
const worksheet = workbook.Sheets[sheetName];
const jsonData = xlsx.utils.sheet_to_json<any[]>(worksheet, {
header: 1,
});
let sheetContent = "";
sheetContent += `<h2>Sheet: ${sheetName}</h2>`;
sheetContent += `<table class="${tableClass}">`;
if (jsonData.length > 0) {
jsonData.forEach((row: any[], rowIndex: number) => {
sheetContent += "<tr>";
const cellTag = rowIndex === 0 ? "th" : "td";
if (row && row.length > 0) {
row.forEach((cell) => {
const cellContent =
cell !== null && cell !== undefined ? cell.toString() : "";
sheetContent += `<${cellTag}>${cellContent}</${cellTag}>`;
});
}
sheetContent += "</tr>";
});
}
sheetContent += "</table>";
sheets.push({
sheetName,
content: sheetContent,
contentLength: sheetContent.length,
});
}
return sheets;
} catch (error) {
throw error;
}
};
// Alternative PDF to PNG conversion using Poppler
const convertPdfWithPoppler = async (
pagesToConvertAsImages: number | number[],
pdfPath: string,
options: ConvertPdfOptions
): Promise<string[]> => {
const { density, format, height, saveFilename, savePath } = options;
const outputPrefix = path.join(savePath, saveFilename);
const run = async (from?: number, to?: number) => {
const pageArgs = from && to ? `-f ${from} -l ${to}` : "";
const cmd = `pdftoppm -${format} -r ${density} -scale-to-y ${height} -scale-to-x -1 ${pageArgs} "${pdfPath}" "${outputPrefix}"`;
await execAsync(cmd);
};
if (pagesToConvertAsImages === -1) {
await run();
} else if (typeof pagesToConvertAsImages === "number") {
await run(pagesToConvertAsImages, pagesToConvertAsImages);
} else if (Array.isArray(pagesToConvertAsImages)) {
await Promise.all(pagesToConvertAsImages.map((page) => run(page, page)));
}
const convertResults = await fs.readdir(savePath);
return convertResults
.filter(
(result) =>
result.startsWith(saveFilename) && result.endsWith(`.${format}`)
)
.map((result) => path.join(savePath, result));
};
// Extracts pages from a structured data file (like Excel)
export const extractPagesFromStructuredDataFile = async (
filePath: string
): Promise<Page[]> => {
if (isExcelFile(filePath)) {
const sheets = await convertExcelToHtml(filePath);
const pages: Page[] = [];
sheets.forEach((sheet: ExcelSheetContent, index: number) => {
pages.push({
content: sheet.content,
contentLength: sheet.contentLength,
page: index + 1,
status: PageStatus.SUCCESS,
});
});
return pages;
}
return [];
};
// Gets the number of pages from a PDF
export const getNumberOfPagesFromPdf = async ({
pdfPath,
}: {
pdfPath: string;
}): Promise<number> => {
const dataBuffer = await fs.readFile(pdfPath);
const data = await pdf(dataBuffer);
return data.numpages;
};
// Gets the aspect ratio (height/width) of a PDF
const getPdfAspectRatio = async (
pdfPath: string
): Promise<number | undefined> => {
return new Promise((resolve) => {
exec(`pdfinfo "${pdfPath}"`, (error, stdout) => {
if (error) return resolve(undefined);
const sizeMatch = stdout.match(/Page size:\s+([\d.]+)\s+x\s+([\d.]+)/);
if (sizeMatch) {
const height = parseFloat(sizeMatch[2]);
const width = parseFloat(sizeMatch[1]);
return resolve(height / width);
}
resolve(undefined);
});
});
};
// Checks if a file is an Excel file
export const isExcelFile = (filePath: string): boolean => {
const extension = path.extname(filePath).toLowerCase();
return (
extension === ".xlsx" ||
extension === ".xls" ||
extension === ".xlsm" ||
extension === ".xlsb"
);
};
// Checks if a file is a structured data file (like Excel)
export const isStructuredDataFile = (filePath: string): boolean => {
return isExcelFile(filePath);
};
================================================
FILE: node-zerox/src/utils/image.ts
================================================
import sharp from "sharp";
import Tesseract from "tesseract.js";
import { ASPECT_RATIO_THRESHOLD } from "../constants";
interface CleanupImageProps {
correctOrientation: boolean;
imageBuffer: Buffer;
scheduler: Tesseract.Scheduler | null;
trimEdges: boolean;
}
export const encodeImageToBase64 = (imageBuffer: Buffer) => {
return imageBuffer.toString("base64");
};
export const cleanupImage = async ({
correctOrientation,
imageBuffer,
scheduler,
trimEdges,
}: CleanupImageProps): Promise<Buffer[]> => {
const image = sharp(imageBuffer);
// Trim extra space around the content in the image
if (trimEdges) {
image.trim();
}
// scheduler would always be non-null if correctOrientation is true
// Adding this check to satisfy typescript
if (correctOrientation && scheduler) {
const optimalRotation = await determineOptimalRotation({
image,
scheduler,
});
if (optimalRotation) {
image.rotate(optimalRotation);
}
}
// Correct the image orientation
const correctedBuffer = await image.toBuffer();
return await splitTallImage(correctedBuffer);
};
// Determine the optimal image orientation based on OCR confidence
// Run Tesseract on 4 image orientations and compare the outputs
const determineOptimalRotation = async ({
image,
scheduler,
}: {
image: sharp.Sharp;
scheduler: Tesseract.Scheduler;
}): Promise<number> => {
const imageBuffer = await image.toBuffer();
const {
data: { orientation_confidence, orientation_degrees },
} = await scheduler.addJob("detect", imageBuffer);
if (orientation_degrees) {
console.log(
`Reorienting image ${orientation_degrees} degrees (confidence: ${orientation_confidence}%)`
);
return orientation_degrees;
}
return 0;
};
/**
* Compress an image to a maximum size
* @param image - The image to compress as a buffer
* @param maxSize - The maximum size in MB
* @returns The compressed image as a buffer
*/
export const compressImage = async (
image: Buffer,
maxSize: number
): Promise<Buffer> => {
if (maxSize <= 0) {
throw new Error("maxSize must be greater than 0");
}
// Convert maxSize from MB to bytes
const maxBytes = maxSize * 1024 * 1024;
if (image.length <= maxBytes) {
return image;
}
try {
// Start with quality 90 and gradually decrease if needed
let quality = 90;
let compressedImage: Buffer;
do {
compressedImage = await sharp(image).jpeg({ quality }).toBuffer();
quality -= 10;
if (quality < 20) {
throw new Error(
`Unable to compress image to ${maxSize}MB while maintaining acceptable quality.`
);
}
} while (compressedImage.length > maxBytes);
return compressedImage;
} catch (error) {
return image;
}
};
export const splitTallImage = async (
imageBuffer: Buffer
): Promise<Buffer[]> => {
const image = sharp(imageBuffer);
const metadata = await image.metadata();
const height = metadata.height || 0;
const width = metadata.width || 0;
const aspectRatio = height / width;
if (aspectRatio <= ASPECT_RATIO_THRESHOLD) {
return [await image.toBuffer()];
}
const { data: imageData } = await image
.grayscale()
.raw()
.toBuffer({ resolveWithObject: true });
const emptySpaces = new Array(height).fill(0);
// Analyze each row to find empty spaces
for (let y = 0; y < height; y++) {
let emptyPixels = 0;
for (let x = 0; x < width; x++) {
const pixelIndex = y * width + x;
if (imageData[pixelIndex] > 230) {
emptyPixels++;
}
}
// Calculate percentage of empty pixels in this row
const emptyRatio = emptyPixels / width;
// Mark rows that are mostly empty (whitespace)
emptySpaces[y] = emptyRatio > 0.95 ? 1 : 0;
}
const significantEmptySpaces = [];
let currentEmptyStart = -1;
for (let y = 0; y < height; y++) {
if (emptySpaces[y] === 1) {
if (currentEmptyStart === -1) {
currentEmptyStart = y;
}
} else {
if (currentEmptyStart !== -1) {
const emptyHeight = y - currentEmptyStart;
if (emptyHeight >= 5) {
// Minimum height for a significant empty space
significantEmptySpaces.push({
center: Math.floor(currentEmptyStart + emptyHeight / 2),
end: y - 1,
height: emptyHeight,
start: currentEmptyStart,
});
}
currentEmptyStart = -1;
}
}
}
// Handle if there's an empty space at the end
if (currentEmptyStart !== -1) {
const emptyHeight = height - currentEmptyStart;
if (emptyHeight >= 5) {
significantEmptySpaces.push({
center: Math.floor(currentEmptyStart + emptyHeight / 2),
end: height - 1,
height: emptyHeight,
start: currentEmptyStart,
});
}
}
const numSections = Math.ceil(aspectRatio);
const approxSectionHeight = Math.floor(height / numSections);
const splitPoints = [0];
for (let i = 1; i < numSections; i++) {
const targetY = i * approxSectionHeight;
// Find empty spaces near the target position
const searchRadius = Math.min(150, approxSectionHeight / 3);
const nearbyEmptySpaces = significantEmptySpaces.filter(
(space) =>
Math.abs(space.center - targetY) < searchRadius &&
space.start > splitPoints[splitPoints.length - 1] + 50
);
if (nearbyEmptySpaces.length > 0) {
// Sort by proximity to target
nearbyEmptySpaces.sort(
(a, b) => Math.abs(a.center - targetY) - Math.abs(b.center - targetY)
);
// Choose center of the best empty space
splitPoints.push(nearbyEmptySpaces[0].center);
} else {
// Fallback if no good empty spaces found
const minY = splitPoints[splitPoints.length - 1] + 50;
const maxY = Math.min(height - 50, targetY + searchRadius);
splitPoints.push(Math.max(minY, Math.min(maxY, targetY)));
}
}
splitPoints.push(height);
return Promise.all(
splitPoints.slice(0, -1).map((top, i) => {
const sectionHeight = splitPoints[i + 1] - top;
return sharp(imageBuffer)
.extract({ left: 0, top, width, height: sectionHeight })
.toBuffer();
})
);
};
================================================
FILE: node-zerox/src/utils/index.ts
================================================
export * from "./common";
export * from "./file";
export * from "./image";
export * from "./model";
export * from "./tesseract";
================================================
FILE: node-zerox/src/utils/model.ts
================================================
import {
CompletionResponse,
ExtractionResponse,
LLMParams,
ModelProvider,
OperationMode,
ProcessedCompletionResponse,
ProcessedExtractionResponse,
} from "../types";
import { formatMarkdown } from "./common";
export const isCompletionResponse = (
mode: OperationMode,
response: CompletionResponse | ExtractionResponse
): response is CompletionResponse => {
return mode === OperationMode.OCR;
};
const isExtractionResponse = (
mode: OperationMode,
response: CompletionResponse | ExtractionResponse
): response is ExtractionResponse => {
return mode === OperationMode.EXTRACTION;
};
export class CompletionProcessor {
static process<T extends OperationMode>(
mode: T,
response: CompletionResponse | ExtractionResponse
): T extends OperationMode.EXTRACTION
? ProcessedExtractionResponse
: ProcessedCompletionResponse {
const { logprobs, ...responseWithoutLogprobs } = response;
if (isCompletionResponse(mode, response)) {
const content = response.content;
return {
...responseWithoutLogprobs,
content:
typeof content === "string" ? formatMarkdown(content) : content,
contentLength: response.content?.length || 0,
} as T extends OperationMode.EXTRACTION
? ProcessedExtractionResponse
: ProcessedCompletionResponse;
}
if (isExtractionResponse(mode, response)) {
const extracted = response.extracted;
return {
...responseWithoutLogprobs,
extracted:
typeof extracted === "object" ? extracted : JSON.parse(extracted),
} as T extends OperationMode.EXTRACTION
? ProcessedExtractionResponse
: ProcessedCompletionResponse;
}
return responseWithoutLogprobs as T extends OperationMode.EXTRACTION
? ProcessedExtractionResponse
: ProcessedCompletionResponse;
}
}
const providerDefaultParams: Record<ModelProvider | string, LLMParams> = {
[ModelProvider.AZURE]: {
frequencyPenalty: 0,
logprobs: false,
maxTokens: 4000,
presencePenalty: 0,
temperature: 0,
topP: 1,
},
[ModelProvider.BEDROCK]: {
maxTokens: 4000,
temperature: 0,
topP: 1,
},
[ModelProvider.GOOGLE]: {
frequencyPenalty: 0,
maxOutputTokens: 4000,
presencePenalty: 0,
temperature: 0,
topP: 1,
},
[ModelProvider.OPENAI]: {
frequencyPenalty: 0,
logprobs: false,
maxTokens: 4000,
presencePenalty: 0,
temperature: 0,
topP: 1,
},
};
export const validateLLMParams = <T extends LLMParams>(
params: Partial<T>,
provider: ModelProvider | string
): LLMParams => {
const defaultParams = providerDefaultParams[provider];
if (!defaultParams) {
throw new Error(`Unsupported model provider: ${provider}`);
}
const validKeys = new Set(Object.keys(defaultParams));
for (const [key, value] of Object.entries(params)) {
if (!validKeys.has(key)) {
throw new Error(
`Invalid LLM parameter for ${provider}: ${key}. Valid parameters are: ${Array.from(
validKeys
).join(", ")}`
);
}
const expectedType = typeof defaultParams[key as keyof LLMParams];
if (typeof value !== expectedType) {
throw new Error(`Value for '${key}' must be a ${expectedType}`);
}
}
return { ...defaultParams, ...params };
};
================================================
FILE: node-zerox/src/utils/tesseract.ts
================================================
import * as Tesseract from "tesseract.js";
import { NUM_STARTING_WORKERS } from "../constants";
export const getTesseractScheduler = async () => {
return Tesseract.createScheduler();
};
const createAndAddWorker = async (scheduler: Tesseract.Scheduler) => {
const worker = await Tesseract.createWorker("eng", 2, {
legacyCore: true,
legacyLang: true,
});
await worker.setParameters({
tessedit_pageseg_mode: Tesseract.PSM.OSD_ONLY,
});
return scheduler.addWorker(worker);
};
export const addWorkersToTesseractScheduler = async ({
numWorkers,
scheduler,
}: {
numWorkers: number;
scheduler: Tesseract.Scheduler;
}) => {
let resArr = Array.from({ length: numWorkers });
await Promise.all(resArr.map(() => createAndAddWorker(scheduler)));
return true;
};
export const terminateScheduler = (scheduler: Tesseract.Scheduler) => {
return scheduler.terminate();
};
export const prepareWorkersForImageProcessing = async ({
numImages,
maxTesseractWorkers,
scheduler,
}: {
numImages: number;
maxTesseractWorkers: number;
scheduler: Tesseract.Scheduler | null;
}) => {
// Add more workers if correctOrientation is true
const numRequiredWorkers = numImages;
let numNewWorkers = numRequiredWorkers - NUM_STARTING_WORKERS;
if (maxTesseractWorkers !== -1) {
const numPreviouslyInitiatedWorkers =
maxTesseractWorkers < NUM_STARTING_WORKERS
? maxTesseractWorkers
: NUM_STARTING_WORKERS;
if (numRequiredWorkers > numPreviouslyInitiatedWorkers) {
numNewWorkers = Math.min(
numRequiredWorkers - numPreviouslyInitiatedWorkers,
maxTesseractWorkers - numPreviouslyInitiatedWorkers
);
} else {
numNewWorkers = 0;
}
}
// Add more workers if needed
if (numNewWorkers > 0 && maxTesseractWorkers !== 0 && scheduler)
addWorkersToTesseractScheduler({
numWorkers: numNewWorkers,
scheduler,
});
};
================================================
FILE: node-zerox/tests/README.md
================================================
# Test Script README
This script runs a quick test of the zerox output against a set keywords from known documents. This is not an exhaustive test, as it will not cover layout, but gives a good sense of any regressions.
## Overview
- **Processes Files**: Reads documents from `shared/inputs` (mix of PDFs, images, Word docs, etc.).
- **Runs OCR**: Runs `zerox` live against all the files.
- **Keyword Verification**: Compares extracted text with expected keywords from `shared/test.json`.
- **Results**: Outputs counts of keywords found and missing, and displays a summary table.
## How to Run
You should be able to run this test with `npm run test` from the root directory.
Note you will need a `.env` file in `node-zerox` with your OpenAI API key:
```
OPENAI_API_KEY=your_api_key_here
```
## Contributing new tests
1. Add Your Document:
- Place the file in `shared/inputs` (e.g., `0005.pdf`).
2. Update `test.json`:
- Add an entry:
```json
{
"file": "your_file.ext",
"expectedKeywords": [
["keyword1_page1", "keyword2_page1"],
["keyword1_page2", "keyword2_page2"]
]
}
```
3. Run the Test:
- Execute the script to include the new file.
## Performance Tests
To run the performance tests, use `npm run test:performance`.
================================================
FILE: node-zerox/tests/index.ts
================================================
import { compareKeywords } from "./utils";
import { ModelOptions } from "../src/types";
import { zerox } from "../src";
import dotenv from "dotenv";
import fs from "node:fs";
import path from "node:path";
import pLimit from "p-limit";
dotenv.config({ path: path.join(__dirname, "../.env") });
interface TestInput {
expectedKeywords: string[][];
file: string;
}
const FILE_CONCURRENCY = 10;
const INPUT_DIR = path.join(__dirname, "../../shared/inputs");
const TEST_JSON_PATH = path.join(__dirname, "../../shared/test.json");
const OUTPUT_DIR = path.join(__dirname, "results", `test-run-${Date.now()}`);
const TEMP_DIR = path.join(OUTPUT_DIR, "temp");
async function main() {
const T1 = new Date();
// Read the test inputs and expected keywords
const testInputs: TestInput[] = JSON.parse(
fs.readFileSync(TEST_JSON_PATH, "utf-8")
);
// Create the output directory
fs.mkdirSync(OUTPUT_DIR, { recursive: true });
const limit = pLimit(FILE_CONCURRENCY);
const results = await Promise.all(
testInputs.map((testInput) =>
limit(async () => {
const filePath = path.join(INPUT_DIR, testInput.file);
// Check if the file exists
if (!fs.existsSync(filePath)) {
console.warn(`File not found: ${filePath}`);
return null;
}
// Run OCR on the file
const ocrResult = await zerox({
cleanup: false,
filePath,
maintainFormat: false,
model: ModelOptions.OPENAI_GPT_4O,
openaiAPIKey: process.env.OPENAI_API_KEY,
outputDir: OUTPUT_DIR,
tempDir: TEMP_DIR,
});
// Compare expected keywords with OCR output
const keywordCounts = compareKeywords(
ocrResult.pages,
testInput.expectedKeywords
);
// Prepare the result
return {
file: testInput.file,
keywordCounts,
totalKeywords: testInput.expectedKeywords.flat().length,
};
})
)
);
// Filter out any null results (due to missing files)
const filteredResults = results.filter((result) => result !== null);
const tableData = filteredResults.map((result) => {
const totalFound =
result?.keywordCounts.reduce(
(sum, page) => sum + page.keywordsFound.length,
0
) ?? 0;
const totalMissing =
result?.keywordCounts.reduce(
(sum, page) => sum + page.keywordsMissing.length,
0
) ?? 0;
const totalKeywords = totalFound + totalMissing;
const percentage =
totalKeywords > 0
? ((totalFound / totalKeywords) * 100).toFixed(2) + "%"
: "N/A";
return {
fileName: result?.file,
keywordsFound: totalFound,
keywordsMissing: totalMissing,
percentage,
};
});
// Write the test results to output.json
fs.writeFileSync(
path.join(OUTPUT_DIR, "output.json"),
JSON.stringify(filteredResults, null, 2)
);
const T2 = new Date();
const completionTime = ((T2.getTime() - T1.getTime()) / 1000).toFixed(2);
// Calculate overall accuracy and total pages tested
const totalKeywordsFound = filteredResults.reduce(
(sum, result) =>
sum +
(result?.keywordCounts?.reduce(
(s, page) => s + (page.keywordsFound?.length ?? 0),
0
) ?? 0),
0
);
const totalKeywordsMissing = filteredResults.reduce(
(sum, result) =>
sum +
(result?.keywordCounts?.reduce(
(s, page) => s + (page.keywordsMissing?.length ?? 0),
0
) ?? 0),
0
);
const totalKeywords = totalKeywordsFound + totalKeywordsMissing;
const overallAccuracy =
totalKeywords > 0
? ((totalKeywordsFound / totalKeywords) * 100).toFixed(2) + "%"
: "N/A";
const pagesTested = filteredResults.reduce(
(sum, result) => sum + (result?.keywordCounts?.length ?? 0),
0
);
console.log("\n");
console.log("-------------------------------------------------------------");
console.log("Test complete in", completionTime, "seconds");
console.log("Overall accuracy:", overallAccuracy);
console.log("Pages tested:", pagesTested);
console.log("-------------------------------------------------------------");
console.table(tableData);
console.log("-------------------------------------------------------------");
console.log(`Full test results are available in ${OUTPUT_DIR}`);
console.log("-------------------------------------------------------------");
console.log("\n");
}
main().catch((error) => {
console.error("An error occurred during the test run:", error);
});
================================================
FILE: node-zerox/tests/performance.test.ts
================================================
import path from "path";
import fs from "fs-extra";
import { zerox } from "../src";
import { ModelOptions } from "../src/types";
const MOCK_OPENAI_TIME = 0;
const TEST_FILES_DIR = path.join(__dirname, "data");
interface TestResult {
numPages: number;
concurrency: number;
duration: number;
avgTimePerPage: number;
}
// Mock the OpenAIModel class
jest.mock("../src/models/openAI", () => {
return {
__esModule: true,
default: class MockOpenAIModel {
constructor() {
// Mock constructor
}
async getCompletion() {
await new Promise((resolve) => setTimeout(resolve, MOCK_OPENAI_TIME));
return {
content:
"# Mocked Content\n\nThis is a mocked response for testing purposes.",
inputTokens: 100,
outputTokens: 50,
};
}
},
};
});
describe("Zerox Performance Tests", () => {
const allResults: TestResult[] = [];
beforeAll(async () => {
// Ensure test directories exist
await fs.ensureDir(TEST_FILES_DIR);
});
const runPerformanceTest = async (numPages: number, concurrency: number) => {
const filePath = path.join(TEST_FILES_DIR, `${numPages}-pages.pdf`);
console.log(`\nTesting ${numPages} pages with concurrency ${concurrency}`);
console.time(`Processing ${numPages} pages`);
const startTime = Date.now();
const result = await zerox({
cleanup: true,
concurrency,
filePath,
model: ModelOptions.OPENAI_GPT_4O,
openaiAPIKey: "mock-key",
});
const duration = Date.now() - startTime;
console.timeEnd(`Processing ${numPages} pages`);
return {
numPages,
concurrency,
duration,
avgTimePerPage: duration / numPages,
successRate:
((result.summary.ocr?.successful || 0) / result.summary.totalPages) *
100,
};
};
const testCases = [
{ pages: 1, concurrency: 20 },
{ pages: 10, concurrency: 20 },
{ pages: 20, concurrency: 20 },
{ pages: 30, concurrency: 20 },
{ pages: 50, concurrency: 20 },
{ pages: 100, concurrency: 20 },
{ pages: 1, concurrency: 50 },
{ pages: 10, concurrency: 50 },
{ pages: 20, concurrency: 50 },
{ pages: 30, concurrency: 50 },
{ pages: 50, concurrency: 50 },
{ pages: 100, concurrency: 50 },
];
test.each(testCases)(
"Performance test with $pages pages and concurrency $concurrency",
async ({ pages, concurrency }) => {
const results = await runPerformanceTest(pages, concurrency);
allResults.push(results);
console.table({
"Number of Pages": results.numPages,
Concurrency: results.concurrency,
"Total Duration (ms)": results.duration,
"Avg Time per Page (ms)": Math.round(results.avgTimePerPage),
});
expect(results.duration).toBeGreaterThan(0);
},
// Set timeout to accommodate larger tests
120000
);
afterAll(() => {
// Print performance comparison
console.log("\n=== FINAL PERFORMANCE COMPARISON ===");
const comparisonTable = Array.from(new Set(testCases.map((tc) => tc.pages)))
.sort((a, b) => a - b)
.map((pages) => {
const c20 = allResults.find(
(r) => r.numPages === pages && r.concurrency === 20
);
const c50 = allResults.find(
(r) => r.numPages === pages && r.concurrency === 50
);
return {
Pages: pages,
"Time (concurrency=20) (s)": c20
? (c20.duration / 1000).toFixed(2)
: "N/A",
"Time (concurrency=50) (s)": c50
? (c50.duration / 1000).toFixed(2)
: "N/A",
Improvement:
c20 && c50
? `${((1 - c50.duration / c20.duration) * 100).toFixed(1)}%`
: "N/A",
};
});
console.table(comparisonTable);
});
});
================================================
FILE: node-zerox/tests/utils.ts
================================================
import { Page } from "../src/types";
export const compareKeywords = (
pages: Page[],
expectedKeywords: string[][]
) => {
const keywordCounts: {
keywordsFound: string[];
keywordsMissing: string[];
page: number;
totalKeywords: number;
}[] = [];
for (let i = 0; i < expectedKeywords.length; i++) {
const page = pages[i];
const keywords = expectedKeywords[i];
const keywordsFound: string[] = [];
const keywordsMissing: string[] = [];
if (page && keywords && page.content !== undefined) {
const pageContent = page.content.toLowerCase();
keywords.forEach((keyword) => {
if (pageContent.includes(keyword.toLowerCase())) {
keywordsFound.push(keyword);
} else {
keywordsMissing.push(keyword);
}
});
}
keywordCounts.push({
keywordsFound,
keywordsMissing,
page: i + 1,
totalKeywords: keywords.length,
});
}
return keywordCounts;
};
================================================
FILE: py_zerox/pyzerox/__init__.py
================================================
from .core import zerox
from .constants.prompts import Prompts
DEFAULT_SYSTEM_PROMPT = Prompts.DEFAULT_SYSTEM_PROMPT
__all__ = [
"zerox",
"Prompts",
"DEFAULT_SYSTEM_PROMPT",
]
================================================
FILE: py_zerox/pyzerox/constants/__init__.py
================================================
from .conversion import PDFConversionDefaultOptions
from .messages import Messages
from .prompts import Prompts
__all__ = [
"PDFConversionDefaultOptions",
"Messages",
"Prompts",
]
================================================
FILE: py_zerox/pyzerox/constants/conversion.py
================================================
class PDFConversionDefaultOptions:
"""Default options for converting PDFs to images"""
DPI = 300
FORMAT = "png"
SIZE = (None, 1056)
THREAD_COUNT = 4
USE_PDFTOCAIRO = True
================================================
FILE: py_zerox/pyzerox/constants/messages.py
================================================
class Messages:
"""User-facing messages"""
MISSING_ENVIRONMENT_VARIABLES = """
Required environment variable (keys) from the model are Missing. Please set the required environment variables for the model provider.
Refer: https://docs.litellm.ai/docs/providers
"""
NON_VISION_MODEL = """
The provided model is not a vision model. Please provide a vision model.
"""
MODEL_ACCESS_ERROR = """
Your provided model can't be accessed. Please make sure you have access to the model and also required environment variables are setup correctly including valid api key(s).
Refer: https://docs.litellm.ai/docs/providers
"""
CUSTOM_SYSTEM_PROMPT_WARNING = """
Custom system prompt was provided which overrides the default system prompt. We assume that you know what you are doing.
"""
MAINTAIN_FORMAT_SELECTED_PAGES_WARNING = """
The maintain_format flag is set to True in conjunction with select_pages input given. This may result in unexpected behavior.
"""
PAGE_NUMBER_OUT_OF_BOUND_ERROR = """
The page number(s) provided is out of bound. Please provide a valid page number(s).
"""
NON_200_RESPONSE = """
Model API returned status code {status_code}: {data}
Please check the litellm documentation for more information. https://docs.litellm.ai/docs/exception_mapping.
"""
COMPLETION_ERROR = """
Error in Completion Response. Error: {0}
Please check the status of your model provider API status.
"""
PDF_CONVERSION_FAILED = """
Error during PDF conversion: {0}
Please check the PDF file and try again. For more information: https://github.com/Belval/pdf2image
"""
FILE_UNREACHAGBLE = """
File not found or unreachable. Status Code: {0}
"""
FILE_PATH_MISSING = """
File path is invalid or missing.
"""
FAILED_TO_SAVE_FILE = """Failed to save file to local drive"""
FAILED_TO_PROCESS_IMAGE = """Failed to process image"""
================================================
FILE: py_zerox/pyzerox/constants/patterns.py
================================================
class Patterns:
"""Regex patterns for markdown and code blocks"""
MATCH_MARKDOWN_BLOCKS = r"^```[a-z]*\n([\s\S]*?)\n```$"
MATCH_CODE_BLOCKS = r"^```\n([\s\S]*?)\n```$"
================================================
FILE: py_zerox/pyzerox/constants/prompts.py
================================================
class Prompts:
"""Class for storing prompts for the Zerox system."""
DEFAULT_SYSTEM_PROMPT = """
Convert the following document to markdown.
Return only the markdown with no explanation text. Do not include delimiters like ```markdown or ```html.
RULES:
- You must include all information on the page. Do not exclude headers, footers, or subtext.
- Return tables in an HTML format.
- Charts & infographics must be interpreted to a markdown format. Prefer table format when applicable.
- Logos should be wrapped in brackets. Ex: <logo>Coca-Cola<logo>
- Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY<watermark>
- Page numbers should be wrapped in brackets. Ex: <page_number>14<page_number> or <page_number>9/22<page_number>
- Prefer using ☐ and ☑ for check boxes.
"""
================================================
FILE: py_zerox/pyzerox/core/__init__.py
================================================
from .zerox import zerox
__all__ = [
"zerox",
]
================================================
FILE: py_zerox/pyzerox/core/types.py
================================================
from typing import List, Optional, Dict, Any, Union, Iterable
from dataclasses import dataclass, field
@dataclass
class ZeroxArgs:
"""
Dataclass to store the arguments for the Zerox class.
"""
file_path: str
cleanup: bool = True
concurrency: int = 10
maintain_format: bool = False
model: str = "gpt-4o-mini",
output_dir: Optional[str] = None
temp_dir: Optional[str] = None
custom_system_prompt: Optional[str] = None
select_pages: Optional[Union[int, Iterable[int]]] = None
kwargs: Dict[str, Any] = field(default_factory=dict)
@dataclass
class Page:
"""
Dataclass to store the page content.
"""
content: str
content_length: int
page: int
@dataclass
class ZeroxOutput:
"""
Dataclass to store the output of the Zerox class.
"""
completion_time: float
file_name: str
input_tokens: int
output_tokens: int
pages: List[Page]
================================================
FILE: py_zerox/pyzerox/core/zerox.py
================================================
import os
import aioshutil as async_shutil
import tempfile
import warnings
from typing import List, Optional, Union, Iterable
from datetime import datetime
import aiofiles
import aiofiles.os as async_os
import asyncio
from ..constants import PDFConversionDefaultOptions
# Package Imports
from ..processor import (
convert_pdf_to_images,
download_file,
process_page,
process_pages_in_batches,
create_selected_pages_pdf,
)
from ..errors import FileUnavailable
from ..constants.messages import Messages
from ..models import litellmmodel
from .types import Page, ZeroxOutput
async def zerox(
cleanup: bool = True,
concurrency: int = 10,
file_path: Optional[str] = "",
image_density: int = PDFConversionDefaultOptions.DPI,
image_height: tuple[Optional[int], int] = PDFConversionDefaultOptions.SIZE,
maintain_format: bool = False,
model: str = "gpt-4o-mini",
output_dir: Optional[str] = None,
temp_dir: Optional[str] = None,
custom_system_prompt: Optional[str] = None,
select_pages: Optional[Union[int, Iterable[int]]] = None,
**kwargs
) -> ZeroxOutput:
"""
API to perform OCR to markdown using Vision models.
Please setup the environment variables for the model and model provider before using this API. Refer: https://docs.litellm.ai/docs/providers
:param cleanup: Whether to cleanup the temporary files after processing, defaults to True
:type cleanup: bool, optional
:param concurrency: The number of concurrent processes to run, defaults to 10
:type concurrency: int, optional
:param file_path: The path or URL to the PDF file to process.
:type file_path: str, optional
:param maintain_format: Whether to maintain the format from the previous page, defaults to False
:type maintain_format: bool, optional
:param model: The model to use for generating completions, defaults to "gpt-4o-mini". Note - Refer: https://docs.litellm.ai/docs/providers to pass correct model name as according to provider it might be different from actual name.
:type model: str, optional
:param output_dir: The directory to save the markdown output, defaults to None
:type output_dir: str, optional
:param temp_dir: The directory to store temporary files, defaults to some named folder in system's temp directory. If already exists, the contents will be deleted for zerox uses it.
:type temp_dir: str, optional
:param custom_system_prompt: The system prompt to use for the model, this overrides the default system prompt of zerox. Generally it is not required unless you want some specific behaviour. When set, it will raise a friendly warning, defaults to None
:type custom_system_prompt: str, optional
:param select_pages: Pages to process, can be a single page number or an iterable of page numbers, defaults to None
:type select_pages: int or Iterable[int], optional
:param kwargs: Additional keyword arguments to pass to the model.completion -> litellm.completion method. Refer: https://docs.litellm.ai/docs/providers and https://docs.litellm.ai/docs/completion/input
:return: The markdown content generated by the model.
"""
input_token_count = 0
output_token_count = 0
prior_page = ""
aggregated_markdown: List[str] = []
start_time = datetime.now()
# File Path Validators
if not file_path:
raise FileUnavailable()
# Create an instance of the litellm model interface
vision_model = litellmmodel(model=model,**kwargs)
# override the system prompt if a custom prompt is provided
if custom_system_prompt:
vision_model.system_prompt = custom_system_prompt
# Check if both maintain_format and select_pages are provided
if maintain_format and select_pages is not None:
warnings.warn(Messages.MAINTAIN_FORMAT_SELECTED_PAGES_WARNING)
# If select_pages is a single integer, convert it to a list for consistency
if isinstance(select_pages, int):
select_pages = [select_pages]
# Sort the pages to maintain consistency
if select_pages is not None:
select_pages = sorted(select_pages)
# Ensure the output directory exists
if output_dir:
await async_os.makedirs(output_dir, exist_ok=True)
## delete tmp_dir if exists and then recreate it
if temp_dir:
if os.path.exists(temp_dir):
await async_shutil.rmtree(temp_dir)
await async_os.makedirs(temp_dir, exist_ok=True)
# Create a temporary directory to store the PDF and images
with tempfile.TemporaryDirectory() as temp_dir_:
if temp_dir:
## use the user provided temp directory
temp_directory = temp_dir
else:
## use the system temp directory
temp_directory = temp_dir_
# Download the PDF. Get file name.
local_path = await download_file(file_path=file_path, temp_dir=temp_directory)
if not local_path:
raise FileUnavailable()
raw_file_name = os.path.splitext(os.path.basename(local_path))[0]
file_name = "".join(c.lower() if c.isalnum() else "_" for c in raw_file_name)
# Truncate file name to 255 characters to prevent ENAMETOOLONG errors
file_name = file_name[:255]
# create a subset pdf in temp dir with only the requested pages if select_pages is provided
if select_pages is not None:
subset_pdf_create_kwargs = {"original_pdf_path":local_path, "select_pages":select_pages,
"save_directory":temp_directory, "suffix":"_selected_pages"}
local_path = await asyncio.to_thread(create_selected_pages_pdf,
**subset_pdf_create_kwargs)
# Convert the file to a series of images, below function returns a list of image paths in page order
images = await convert_pdf_to_images(image_density=image_density, image_height=image_height, local_path=local_path, temp_dir=temp_directory)
if maintain_format:
for image in images:
result, input_token_count, output_token_count, prior_page = await process_page(
image,
vision_model,
temp_directory,
input_token_count,
output_token_count,
prior_page,
)
if result:
aggregated_markdown.append(result)
else:
results = await process_pages_in_batches(
images,
concurrency,
vision_model,
temp_directory,
input_token_count,
output_token_count,
prior_page,
)
aggregated_markdown = [result[0] for result in results if isinstance(result[0], str)]
## add token usage
input_token_count += sum([result[1] for result in results])
output_token_count += sum([result[2] for result in results])
# Write the aggregated markdown to a file
if output_dir:
result_file_path = os.path.join(output_dir, f"{file_name}.md")
async with aiofiles.open(result_file_path, "w", encoding="utf-8") as f:
await f.write("\n\n".join(aggregated_markdown))
# Cleanup the downloaded PDF file
if cleanup and os.path.exists(temp_directory):
await async_shutil.rmtree(temp_directory)
# Format JSON response
end_time = datetime.now()
completion_time = (end_time - start_time).total_seconds() * 1000
# Adjusting the formatted_pages logic to account for select_pages to output the correct page numbers
if select_pages is not None:
# Map aggregated markdown to the selected pages
formatted_pages = [
Page(content=content, page=select_pages[i], content_length=len(content))
for i, content in enumerate(aggregated_markdown)
]
else:
# Default behavior when no select_pages is provided
formatted_pages = [
Page(content=content, page=i + 1, content_length=len(content))
for i, content in enumerate(aggregated_markdown)
]
return ZeroxOutput(
completion_time=completion_time,
file_name=file_name,
input_tokens=input_token_count,
output_tokens=output_token_count,
pages=formatted_pages,
)
================================================
FILE: py_zerox/pyzerox/errors/__init__.py
================================================
from .exceptions import (
NotAVisionModel,
ModelAccessError,
PageNumberOutOfBoundError,
MissingEnvironmentVariables,
ResourceUnreachableException,
FileUnavailable,
FailedToSaveFile,
FailedToProcessFile,
)
__all__ = [
"NotAVisionModel",
"ModelAccessError",
"PageNumberOutOfBoundError",
"MissingEnvironmentVariables",
"ResourceUnreachableException",
"FileUnavailable",
"FailedToSaveFile",
"FailedToProcessFile",
]
================================================
FILE: py_zerox/pyzerox/errors/base.py
================================================
from typing import Optional
class CustomException(Exception):
"""
Base class for custom exceptions
"""
def __init__(
self,
message: Optional[str] = None,
extra_info: Optional[dict] = None,
):
self.message = message
self.extra_info = extra_info
super().__init__(self.message)
def __str__(self):
if self.extra_info:
return f"{self.message} (Extra Info: {self.extra_info})"
return self.message
================================================
FILE: py_zerox/pyzerox/errors/exceptions.py
================================================
from typing import Dict, Optional
# Package Imports
from ..constants import Messages
from .base import CustomException
class MissingEnvironmentVariables(CustomException):
"""Exception raised when the model provider environment variables, API key(s) are missing. Refer: https://docs.litellm.ai/docs/providers"""
def __init__(
self,
message: str = Messages.MISSING_ENVIRONMENT_VARIABLES,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class NotAVisionModel(CustomException):
"""Exception raised when the provided model is not a vision model."""
def __init__(
self,
message: str = Messages.NON_VISION_MODEL,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class ModelAccessError(CustomException):
"""Exception raised when the provided model can't be accessed due to incorrect credentials/keys or incorrect environent variables setup."""
def __init__(
self,
message: str = Messages.MODEL_ACCESS_ERROR,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class PageNumberOutOfBoundError(CustomException):
"""Exception invalid page number(s) provided."""
def __init__(
self,
message: str = Messages.PAGE_NUMBER_OUT_OF_BOUND_ERROR,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class ResourceUnreachableException(CustomException):
"""Exception raised when a resource is unreachable."""
def __init__(
self,
message: str = Messages.FILE_UNREACHAGBLE,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class FileUnavailable(CustomException):
"""Exception raised when a file is unavailable."""
def __init__(
self,
message: str = Messages.FILE_PATH_MISSING,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class FailedToSaveFile(CustomException):
"""Exception raised when a file fails to save."""
def __init__(
self,
message: str = Messages.FAILED_TO_SAVE_FILE,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
class FailedToProcessFile(CustomException):
"""Exception raised when a file fails to process."""
def __init__(
self,
message: str = Messages.FAILED_TO_PROCESS_IMAGE,
extra_info: Optional[Dict] = None,
):
super().__init__(message, extra_info)
================================================
FILE: py_zerox/pyzerox/models/__init__.py
================================================
from .modellitellm import litellmmodel
from .types import CompletionResponse
__all__ = [
"litellmmodel",
"CompletionResponse",
]
================================================
FILE: py_zerox/pyzerox/models/base.py
================================================
from abc import ABC, abstractmethod
from typing import Dict, Optional, Type, TypeVar, TYPE_CHECKING
if TYPE_CHECKING:
from ..models import CompletionResponse
T = TypeVar("T", bound="BaseModel")
class BaseModel(ABC):
"""
Base class for all models.
"""
@abstractmethod
async def completion(
self,
) -> "CompletionResponse":
raise NotImplementedError("Subclasses must implement this method")
@abstractmethod
def validate_access(
self,
) -> None:
raise NotImplementedError("Subclasses must implement this method")
@abstractmethod
def validate_model(
self,
) -> None:
raise NotImplementedError("Subclasses must implement this method")
def __init__(
self,
model: Optional[str] = None,
**kwargs,
):
self.model = model
self.kwargs = kwargs
## validations
# self.validate_model()
# self.validate_access()
================================================
FILE: py_zerox/pyzerox/models/modellitellm.py
================================================
import os
import aiohttp
import litellm
from typing import List, Dict, Any, Optional
# Package Imports
from .base import BaseModel
from .types import CompletionResponse
from ..errors import ModelAccessError, NotAVisionModel, MissingEnvironmentVariables
from ..constants.messages import Messages
from ..constants.prompts import Prompts
from ..processor.image import encode_image_to_base64
DEFAULT_SYSTEM_PROMPT = Prompts.DEFAULT_SYSTEM_PROMPT
class litellmmodel(BaseModel):
## setting the default system prompt
_system_prompt = DEFAULT_SYSTEM_PROMPT
def __init__(
self,
model: Optional[str] = None,
**kwargs,
):
"""
Initializes the Litellm model interface.
:param model: The model to use for generating completions, defaults to "gpt-4o-mini". Refer: https://docs.litellm.ai/docs/providers
:type model: str, optional
:param kwargs: Additional keyword arguments to pass to self.completion -> litellm.completion. Refer: https://docs.litellm.ai/docs/providers and https://docs.litellm.ai/docs/completion/input
"""
super().__init__(model=model, **kwargs)
## calling custom methods to validate the environment and model
self.validate_environment()
self.validate_model()
self.validate_access()
@property
def system_prompt(self) -> str:
'''Returns the system prompt for the model.'''
return self._system_prompt
@system_prompt.setter
def system_prompt(self, prompt: str) -> None:
'''
Sets/overrides the system prompt for the model.
'''
self._system_prompt = prompt
## custom method on top of BaseModel
def validate_environment(self) -> None:
"""Validates the environment variables required for the model."""
env_config = litellm.validate_environment(model=self.model)
if not env_config["keys_in_environment"]:
raise MissingEnvironmentVariables(extra_info=env_config)
def validate_model(self) -> None:
'''Validates the model to ensure it is a vision model.'''
if not litellm.supports_vision(model=self.model):
raise NotAVisionModel(extra_info={"model": self.model})
def validate_access(self) -> None:
"""Validates access to the model -> if environment variables are set correctly with correct values."""
if not litellm.check_valid_key(model=self.model,api_key=None):
raise ModelAccessError(extra_info={"model": self.model})
async def completion(
self,
image_path: str,
maintain_format: bool,
prior_page: str,
) -> CompletionResponse:
"""LitellM completion for image to markdown conversion.
:param image_path: Path to the image file.
:type image_path: str
:param maintain_format: Whether to maintain the format from the previous page.
:type maintain_format: bool
:param prior_page: The markdown content of the previous page.
:type prior_page: str
:return: The markdown content generated by the model.
"""
messages = await self._prepare_messages(
image_path=image_path,
maintain_format=maintain_format,
prior_page=prior_page,
)
try:
response = await litellm.acompletion(model=self.model, messages=messages, **self.kwargs)
## completion response
response = CompletionResponse(
content=response["choices"][0]["message"]["content"],
input_tokens=response["usage"]["prompt_tokens"],
output_tokens=response["usage"]["completion_tokens"],
)
return response
except Exception as err:
raise Exception(Messages.COMPLETION_ERROR.format(err))
async def _prepare_messages(
self,
image_path: str,
maintain_format: bool,
prior_page: str,
) -> List[Dict[str, Any]]:
"""Prepares the messages to send to the LiteLLM Completion API.
:param image_path: Path to the image file.
:type image_path: str
:param maintain_format: Whether to maintain the format from the previous page.
:type maintain_format: bool
:param prior_page: The markdown content of the previous page.
:type prior_page: str
"""
# Default system message
messages: List[Dict[str, Any]] = [
{
"role": "system",
"content": self._system_prompt,
},
]
# If content has already been generated, add it to context.
# This helps maintain the same format across pages.
if maintain_format and prior_page:
messages.append(
{
"role": "system",
"content": f'Markdown must maintain consistent formatting with the following page: \n\n """{prior_page}"""',
},
)
# Add Image to request
base64_image = await encode_image_to_base64(image_path)
messages.append(
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_image}"},
},
],
}
)
return messages
================================================
FILE: py_zerox/pyzerox/models/types.py
================================================
from dataclasses import dataclass
@dataclass
class CompletionResponse:
"""
A class representing the response of a completion.
"""
content: str
input_tokens: int
output_tokens: int
================================================
FILE: py_zerox/pyzerox/processor/__init__.py
================================================
from .image import save_image, encode_image_to_base64
from .pdf import (
convert_pdf_to_images,
process_page,
process_pages_in_batches,
)
from .text import format_markdown
from .utils import download_file, create_selected_pages_pdf
__all__ = [
"save_image",
"encode_image_to_base64",
"convert_pdf_to_images",
"format_markdown",
"download_file",
"process_page",
"process_pages_in_batches",
"create_selected_pages_pdf",
]
================================================
FILE: py_zerox/pyzerox/processor/image.py
================================================
import aiofiles
import base64
import io
async def encode_image_to_base64(image_path: str) -> str:
"""Encode an image to base64 asynchronously."""
async with aiofiles.open(image_path, "rb") as image_file:
image_data = await image_file.read()
return base64.b64encode(image_data).decode("utf-8")
async def save_image(image, image_path: str):
"""Save an image to a file asynchronously."""
# Convert PIL Image to BytesIO object
with io.BytesIO() as buffer:
image.save(buffer, format=image.format) # Save the image to the BytesIO object
image_data = buffer.getvalue() # Get the image data from the BytesIO object
# Write image data to file asynchronously
async with aiofiles.open(image_path, "wb") as f:
await f.write(image_data)
================================================
FILE: py_zerox/pyzerox/processor/pdf.py
================================================
import logging
import os
import asyncio
from typing import List, Optional, Tuple
from pdf2image import convert_from_path
# Package Imports
from .image import save_image
from .text import format_markdown
from ..constants import PDFConversionDefaultOptions, Messages
from ..models import litellmmodel
async def convert_pdf_to_images(image_density: int, image_height: tuple[Optional[int], int], local_path: str, temp_dir: str) -> List[str]:
"""Converts a PDF file to a series of images in the temp_dir. Returns a list of image paths in page order."""
options = {
"pdf_path": local_path,
"output_folder": temp_dir,
"dpi": image_density,
"fmt": PDFConversionDefaultOptions.FORMAT,
"size": image_height,
"thread_count": PDFConversionDefaultOptions.THREAD_COUNT,
"use_pdftocairo": PDFConversionDefaultOptions.USE_PDFTOCAIRO,
"paths_only": True,
}
try:
image_paths = await asyncio.to_thread(
convert_from_path, **options
)
return image_paths
except Exception as err:
logging.error(f"Error converting PDF to images: {err}")
async def process_page(
image: str,
model: litellmmodel,
temp_directory: str = "",
input_token_count: int = 0,
output_token_count: int = 0,
prior_page: str = "",
semaphore: Optional[asyncio.Semaphore] = None,
) -> Tuple[str, int, int, str]:
"""Process a single page of a PDF"""
# If semaphore is provided, acquire it before processing the page
if semaphore:
async with semaphore:
return await process_page(
image,
model,
temp_directory,
input_token_count,
output_token_count,
prior_page,
)
image_path = os.path.join(temp_directory, image)
# Get the completion from LiteLLM
try:
completion = await model.completion(
image_path=image_path,
maintain_format=True,
prior_page=prior_page,
)
formatted_markdown = format_markdown(completion.content)
input_token_count += completion.input_tokens
output_token_count += completion.output_tokens
prior_page = formatted_markdown
return formatted_markdown, input_token_count, output_token_count, prior_page
except Exception as error:
logging.error(f"{Messages.FAILED_TO_PROCESS_IMAGE} Error:{error}")
return "", input_token_count, output_token_count, ""
async def process_pages_in_batches(
images: List[str],
concurrency: int,
model: litellmmodel,
temp_directory: str = "",
input_token_count: int = 0,
output_token_count: int = 0,
prior_page: str = "",
):
# Create a semaphore to limit the number of concurrent tasks
semaphore = asyncio.Semaphore(concurrency)
# Process each page in parallel
tasks = [
process_page(
image,
model,
temp_directory,
input_token_count,
output_token_count,
prior_page,
semaphore,
)
for image in images
]
# Wait for all tasks to complete
return await asyncio.gather(*tasks)
================================================
FILE: py_zerox/pyzerox/processor/text.py
================================================
import re
# Package imports
from ..constants.patterns import Patterns
def format_markdown(text: str) -> str:
"""Format markdown text by removing markdown and code blocks"""
formatted_markdown = re.sub(Patterns.MATCH_MARKDOWN_BLOCKS, r"\1", text)
formatted_markdown = re.sub(Patterns.MATCH_CODE_BLOCKS, r"\1", formatted_markdown)
return formatted_markdown
================================================
FILE: py_zerox/pyzerox/processor/utils.py
================================================
import os
import re
from typing import Optional, Union, Iterable
from urllib.parse import urlparse
import aiofiles
import aiohttp
from PyPDF2 import PdfReader, PdfWriter
from ..constants.messages import Messages
# Package Imports
from ..errors.exceptions import ResourceUnreachableException, PageNumberOutOfBoundError
async def download_file(
file_path: str,
temp_dir: str,
) -> Optional[str]:
"""Downloads a file from a URL or local path to a temporary directory."""
local_pdf_path = os.path.join(temp_dir, os.path.basename(file_path))
if is_valid_url(file_path):
async with aiohttp.ClientSession() as session:
async with session.get(file_path) as response:
if response.status != 200:
raise ResourceUnreachableException()
async with aiofiles.open(local_pdf_path, "wb") as f:
await f.write(await response.read())
else:
async with aiofiles.open(file_path, "rb") as src, aiofiles.open(
local_pdf_path, "wb"
) as dst:
await dst.write(await src.read())
return local_pdf_path
def is_valid_url(string: str) -> bool:
"""Checks if a string is a valid URL."""
try:
result = urlparse(string)
return all([result.scheme, result.netloc]) and result.scheme in [
"http",
"https",
]
except ValueError:
return False
def create_selected_pages_pdf(original_pdf_path: str, select_pages: Union[int, Iterable[int]],
save_directory: str, suffix: str = "_selected_pages",
sorted_pages: bool = True) -> str:
"""
Creates a new PDF with only the selected pages.
:param original_pdf_path: Path to the original PDF file.
:type original_pdf_path: str
:param select_pages: A single page number or an iterable of page numbers (1-indexed).
:type select_pages: int or Iterable[int]
:param save_directory: The directory to store the new PDF.
:type save_directory: str
:param suffix: The suffix to add to the new PDF file name, defaults to "_selected_pages".
:type suffix: str, optional
:param sorted_pages: Whether to sort the selected pages, defaults to True.
:type sorted_pages: bool, optional
:return: Path the new PDF file
"""
file_name = os.path.splitext(os.path.basename(original_pdf_path))[0]
# Write the new PDF to a temporary file
selected_pages_pdf_path = os.path.join(save_directory, f"{file_name}{suffix}.pdf")
# Ensure select_pages is iterable, if not, convert to list
if isinstance(select_pages, int):
select_pages = [select_pages]
if sorted_pages:
# Sort the pages for consistency
select_pages = sorted(list(select_pages))
with open(original_pdf_path, "rb") as orig_pdf, open(selected_pages_pdf_path, "wb") as new_pdf:
# Read the original PDF
reader = PdfReader(stream=orig_pdf)
total_pages = len(reader.pages)
# Validate page numbers
invalid_page_numbers = []
for page in select_pages:
if page < 1 or page > total_pages:
invalid_page_numbers.append(page)
## raise error if invalid page numbers
if invalid_page_numbers:
raise PageNumberOutOfBoundError(extra_info={"input_pdf_num_pages":total_pages,
"select_pages": select_pages,
"invalid_page_numbers": invalid_page_numbers})
# Create a new PDF writer
writer = PdfWriter(fileobj=new_pdf)
# Add only the selected pages
for page_number in select_pages:
writer.add_page(reader.pages[page_number - 1])
writer.write(stream=new_pdf)
return selected_pages_pdf_path
================================================
FILE: py_zerox/scripts/__init__.py
================================================
================================================
FILE: py_zerox/scripts/pre_install.py
================================================
# pre_install.py
import subprocess
import sys
import platform
def run_command(command):
try:
result = subprocess.run(command, shell=True, text=True, capture_output=True)
result.check_returncode()
return result.stdout
except subprocess.CalledProcessError as e:
raise RuntimeError(e.stderr.strip())
def install_package(command, package_name):
try:
output = run_command(command)
print(output)
return output
except RuntimeError as e:
raise RuntimeError(f"Failed to install {package_name}: {e}")
def check_and_install():
try:
# Check and install Poppler
try:
run_command("pdftoppm -h")
except RuntimeError:
if platform.system() == "Darwin": # macOS
install_package("brew install poppler", "Poppler")
elif platform.system() == "Linux": # Linux
install_package(
"sudo apt-get update && sudo apt-get install -y poppler-utils",
"Poppler",
)
else:
raise RuntimeError(
"Please install Poppler manually from https://poppler.freedesktop.org/"
)
except RuntimeError as err:
print(f"Error during installation: {err}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
check_and_install()
================================================
FILE: py_zerox/tests/test_noop.py
================================================
def test_noop():
assert 1 == 1
================================================
FILE: shared/systemPrompt.txt
================================================
Convert the following document to markdown.
Return only the markdown with no explanation text. Do not include delimiters like '''markdown or '''.
RULES:
- You must include all information on the page. Do not exclude headers, footers, or subtext.
- Charts & infographics must be interpreted to a markdown format. Prefer table format when applicable.
- Images without text must be replaced with [Description of image](image.png)
- For tables with double headers, prefer adding a new column.
- Logos should be wrapped in square brackets. Ex: [Coca-Cola]
- Prefer using ☐ and ☑ for check boxes.
================================================
FILE: shared/test.json
================================================
[
{
"file": "0001.png",
"expectedKeywords": [
[
"Department of the Treasury",
"Internal Revenue Service",
"U.S. Individual Income Tax Return",
"2023",
"OMB No. 1545-0074",
"IRS Use Only",
"Do not write or staple in this space.",
"For the year",
"Jan. 1",
"Dec. 31, 2023",
"other tax year beginning",
"See separate instructions",
"Your first name and middle initial",
"JOSEPH R",
"Last name",
"BIDEN JR",
"Your social security number",
"If joint return, spouse's first name and middle initial",
"JILL T",
"Spouse's social security number",
"Home address (number and street)",
"If you have a P.O. box, see instructions",
"Apt. no.",
"City, town, or post office",
"If you have a foreign address, also complete spaces below",
"Foreign country name",
"Foreign province/state/county",
"Foreign postal code",
"Presidential Election Campaign",
"Check here if you, or your spouse if filing jointly, want $3 to go to this fund",
"Checking a box below will not change your tax or refund",
"Check only one box",
"Single",
"Married filing jointly (even if only one had income)",
"Married filing separately (MFS)",
"Head of Household (HoH)",
"Qualifying surviving spouse (QSS)",
"If you checked the MFS box, enter the name of your spouse",
"If you checked the HOH or QSS box, enter the child's name if the qualifying person is a child but not your dependent",
"Digital Assets",
"At any time during 2023",
"receive (as a reward, award, or payment for property or services)",
"sell, exchange, or otherwise dispose of a digital asset",
"or a financial interest in a digital asset",
"See instructions",
"Standard Deduction",
"Someone can claim",
"You as a dependent",
"Your spouse as a dependent",
"Spouse itemizes on a separate return or you were a dual-status alien",
"Age/Blindness",
"Were born before January 2, 1959",
"Are blind",
"Is blind",
"If more than four dependents",
"If more than four dependents, see Instr. and check here",
"First name Last name",
"Social security number",
"Relationship to you",
"Check the box if qualifies for (see instr.)",
"Child tax credit",
"Credit for other dependents",
"Total amount from Form(s) W-2, box 1 (see instructions)",
"STMT 1",
"485,985"
]
]
},
{
"file": "0002.pdf",
"expectedKeywords": [
[
"Deloitte",
"Quality System Audit for BioTech Innovations (Pty) Ltd",
"02 October 2024",
"67 River Rd",
"Kensington",
"Johannesburg",
"Gauteng",
"2094",
"South Africa",
"06h30",
"Contact Person",
"Kathy Margaret",
"+14 22 045 4952",
"Opening Meeting Agenda",
"Introductions",
"Review of audit agenda",
"Confirmation of availability for required persons",
"Anna Pojawis",
"Tyler Maran",
"Kathy Margaret",
"Mark Ding",
"CTO",
"CEO",
"Associate",
"Eng",
"[email protected]",
"[email protected]",
"[email protected]",
"[email protected]",
"QAC Auditor",
"David Thompson",
"Lead Quality Auditor",
"NTA Services on behalf of BioTech Innovations",
"Biopharmaceuticals",
"Page 1 of 7",
"DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC",
"450 Oceanview Drive",
"Suite 200",
"Santa Monica",
"CA 90405",
"(800) 555-1234",
"(310) 555-7890",
"(310) 555-4567",
"www.qaconsultants.com",
"[email protected]"
]
]
},
{
"file": "0003.pdf",
"expectedKeywords": [
[
"Short Time Overload",
"Insulation Resistance",
"Endurance",
"Damp Heat with Load",
"Solderability",
"Dielectric Withstanding Voltage",
"Temperature Coefficient",
"Pulse Overload",
"Resistance To Solvent",
"Terminal Strength",
"Carbon Film Leader Resistor",
"Environmental Characteristics",
"Rated Continuous Working Voltage",
"Storage Temperature",
"JIS-C-5201-1 5.5",
"JIS-C-5201-1 5.6",
"JIS-C-5201-1 7.10",
"JIS-C-5201-1 7.9",
"JIS-C-5201-1 6.5",
"JIS-C-5201-1 5.7",
"Resistance value at room temperature",
"Temperature+100°C",
"JIS-C-5201-1 5.8",
"JIS-C-5201-1 6.9",
"Direct Load for 10 seconds",
"In the direction off the terminal leads",
"working voltage for 1000",
"overload voltage for 5 seconds",
"10000 cycles with 1 second",
"and 25 seconds",
"Trichroethane",
"70±2°C",
"100V",
"DC",
"40±2°C",
"245±5°C",
"RCWV*2.5",
"4 times RCWV for 10000 cycles",
">1000MΩ",
"100KΩ±3%",
"100KΩ±5%",
"90% min. Coverage",
"350ppm",
"500ppm",
"100KΩ",
"700ppm",
"1500ppm",
"No deterioration of coatings and markings",
"Tensile: 2.5 kg"
]
]
},
{
"file": "0004.pdf",
"expectedKeywords": [
[
"Improving",
"throught",
"Han",
"Abstract",
"Howard",
"community",
"530%",
"language",
"Wang",
"environments",
"confronted",
"technique",
"same",
"accurate",
"Methodology",
"dilution",
"illustrated",
"continually",
"stems"
],
[
"Gemma",
"repeated",
"pronounced",
"designer",
"Llama",
"7.68±2.41s",
"significant",
"reiteration",
"22.50±11.19s",
"generate",
"reasoning",
"question",
"larger",
"2.0%",
"context",
"deviation",
"3.8B",
"ratio",
"unnecessary",
"Repetition",
"Experimentation",
"Conclusion",
"Transformers",
"PyTorch",
"24GB",
"Abdin",
"530%",
"Xu",
"github"
],
[
"References",
"Jacobs",
"Nguyen",
"2404.14219",
"Abhimanyu",
"Ahmad",
"Yang",
"herd",
"Team",
"Riviere",
"Mesnard",
"Bobak",
"practical",
"Lei",
"186345",
"Frontiers",
"Jingsen",
"Jiakai",
"autonomous",
"compromising",
"quality",
"ICLR",
"Saizheng",
"HotpotQA",
"explainable",
"Natural",
"Representations",
"2023",
"Narasimhan",
"Eleventh"
]
]
},
{
"file": "0005.png",
"expectedKeywords": [
[
"Quest",
"Diagnostics",
"Maternal",
"Insurance",
"DEPENDENT",
"Fasting",
"Ultrasound",
"QuestDiagnostics.com/MLCP",
"Medicaid",
"hyperGly-hCG",
"Sequential",
"Stepwise",
"SST",
"MSAFP",
"30294",
"Quad",
"Penta",
"LMP",
"Ethnic",
"fetuses",
"insulin-dependent",
"Trisomy",
"Down",
"Syndrome",
"Donor",
"cigarettes",
"Nuchal",
"Ultrasonographer",
"QD20330K"
]
]
},
{
"file": "0006.png",
"expectedKeywords": [
[
"T-Mobile",
"Monthly",
"100469352",
"Recurring",
"****9541",
"MasterCard",
"Equipment",
"Installment",
"Plan",
"(XXX)-49X-5XXX",
"$00.00",
"KickBack",
"2GB",
"AutoPay",
"my.t-mobile.com",
"t-mobile.com/pay",
"*PAY (*XXX)",
"1004693520967854630205189400463125091"
]
]
},
{
"file": "0007.png",
"expectedKeywords": [
[
"LABCORP",
"3106932528",
"Farzam",
"044-494-4741-0",
"(310) 849-7991",
"GENESEE",
"1619295474",
"Fasting",
"rflx",
"25-Hydroxy",
"x10E3/uL",
"Neutrophils",
"Lymphs (Absolute)",
"Baso (Absolute)",
"Not Estab.",
"Creatinine",
"Africn",
"mL/min/1.73",
"02/16/19 1809 ET",
"800-859-6046"
]
]
},
{
"file": "0008.png",
"expectedKeywords": [
[
"LABCORP LCLS BULK",
"3106932528",
"2 of 4",
"Farzam",
"Potassium",
"mmol/L",
"A/G Ratio",
"AST (SGOT)",
"High",
"Abnormal",
"UA/M w/rflx",
"IU/L",
"1.005 - 1.030",
"Negative/Trace",
"Semi-Qn",
"None seen/Few",
"Result 1",
"02/16/19 1809 ET",
"FINAL REPORT",
"800-859-6046",
"© 1995-2019"
]
]
},
{
"file": "0009.png",
"expectedKeywords": [
[
"3106932528",
"Page 3 of 4",
"LabCorp",
"04/11/1992",
"044-494-4741-0",
"02/13/2019 1000 Local",
"50,000-100,000",
"Triglycerides",
"Hemoglobin A1c",
"4.8 - 5.6",
"Glycemic",
"uIU/mL",
"25-Hydroxy",
"Low",
"Institute",
"25-OH",
"Bischoff-Ferrari",
":1911-30",
"Please Note:",
"Therapeutic",
"02/16/19 1809 ET",
"confidential",
"800-859-6046",
"© 1995-2019"
]
]
},
{
"file": "0010.png",
"expectedKeywords": [
[
"02/16/2019 6:09:27 PM",
"3106932528",
"Farzam",
"WEISER, CHERYL",
"60070006294",
"REFERENCE INTERVAL",
"Ferritin, Serum",
"ng/mL",
"15 - 150",
"13112 Evening Creek",
"92128-4108",
"Dir: Jenny Galloway",
"800-859-6046",
"858-668-3700",
"02/16/19 1809 ET",
"4 of 4",
"confidential",
"800-859-6046",
"America® Holdings"
]
]
},
{
"file": "0011.png",
"expectedKeywords": [
[
"Bill of Lading",
"21099992723",
"123 Pick Up Street",
"Business without Dock or Forklift",
"800-866-4870",
"15 minutes",
"[email protected]",
"[email protected]",
"Pallet",
"48x40x48 in",
"300 Lbs",
"declared value",
"per",
"TOTAL WEIGHT",
"14706(c)(1)(A)",
"certify",
"Shipper Signature",
"Freight Loaded",
"Carrier Signature",
"Pickup Date",
"LTL Only",
"123 Delivery Street",
"Vancouver",
"BCV5K 0A4",
"Canada",
"9:00AM",
"5:00PM",
"Protect From Freeze",
"EXAMPLE CARRIER",
"Seal Number",
"Freight Charges Term",
"3rd Party",
"BOL",
"POD"
]
]
},
{
"file": "0012.png",
"expectedKeywords": [
[
"UniformSoftware",
"transport business name",
"ABN",
"xx xxx xxx xxx",
"Tel",
"Cust A/c",
"Due Date",
"Inv. Date",
"Invoice #",
"1/1/2017",
"ref#1",
"MASCOT",
"SYDNENHAM",
"4PCS",
"0.500",
"187",
"$11.00",
"$0.88",
"$11.00",
"EAST BOTANY",
"PORT BOTANY",
"4.070",
"6459",
"1659",
"5/15/2017",
"44PCS",
"Please remit payment to",
"bank name",
"xxx-xxx",
"TOTAL",
"155.00",
"15.50",
"182.90"
]
]
},
{
"file": "0013.pdf",
"expectedKeywords": [
[
"MSFT",
"AAPL",
"10.6%",
"Portfolio Value",
"$545k",
"Allocation",
"82%/18% Stocks/Bonds",
"Cash",
"5.75",
"US Stocks",
"78.65",
"Non-US Stocks",
"3.45",
"Bonds",
"12.12",
"Other/Not Clsfd",
"0.03",
"Portfolio Construction",
"78.9%",
"2.4%",
"18.1%",
"0.5%",
"Cash",
"ETFs",
"Mutual Funds",
"Individual Stocks",
"Total Stock Holdings",
"634",
"Total Bond Holdings",
"9,748"
],
[
"Your Equity Allocation",
"82%",
"Individual Stocks 96%",
"ETFs 4%",
"25% - 35%",
"600",
"Diversification Analysis",
"Owning multiple funds does not always produce the anticipated diversification benefits.",
"Equity Style",
"Large",
"Mid",
"Small",
"20",
"13",
"32",
"15",
"Sectors",
"20.62",
"28.73",
"Consumer Def",
"Bmark%",
"1.45"
],
[
"Tax Transition/Overlap Analysis",
"$138,494",
"$19,943",
"PFIZER INC",
"($453)",
"2026",
"RIO TINTO PLC SPONSORED ADR",
"ISHARES 5-10 YEAR IG CORP BOND ETF",
"NXST",
"$1,198.95",
"$11.631.26",
"PROCTER & GAMBLE CO",
"PG",
"$283",
"($928)",
"1/31/2023"
]
]
},
{
"file": "0014.png",
"expectedKeywords": [
[
"Last quote update:",
"2013-11-30 05:20:51",
"54,659",
"(9.2%)",
"14.41%",
"Profit/Loss",
"66,087",
"0.36%",
"0.17%",
"60,105",
"If [Target-Actual]",
"Emerging",
"Real Estate",
"Healthcare",
"Technology",
"8%",
"Defensive",
"Sensitivity",
"DELL",
"Years-Current",
"161,200",
"-6",
"627",
"0.43%",
"29,455",
"138,898",
"53,601",
"13",
"0.08%",
"0.17%",
"13.10%",
"MTD",
"US Broad",
"CAD",
"XIU.TO"
]
]
},
{
"file": "0015.png",
"expectedKeywords": [
[
"通行费",
"湖北增值税电子普通发票",
"04201700112",
"499099660821",
"12636666022332927910",
"校 验 码",
"开票日期",
"2018年03月23日",
"密 码 区",
"030243319>1*+9239+></<59+3-",
"786-646/16<248>/-/029029>746",
"7>44<97929379677-955315>+-",
"6/53<13+8*010369194565>-5/04",
"武汉经济技术开发区车城大道7号 84289348",
"经营租货通行费",
"鄂AHG248",
"通行日期起",
"20180212",
"通行日期止",
"20180212",
"¥286.23",
"税率",
"3%",
"¥294.84",
"湖北随岳南高速公路有限公司",
"91420000753416406R",
"发票专用章",
"收 款 人",
"龙梦媛",
"复 核:",
"陈煜",
"贰佰玖拾肆元捌角贰分",
"武汉市经济开发区17C1地块东和中心B栋1601号027-83458755"
]
]
},
{
"file": "0016.pdf",
"expectedKeywords": [
[
"Valori Nutrizionali",
"Nutrition Facts",
"Nährwerte",
"Valores Nutricionales",
"Energia/Energy/Energie/Valor energético",
"Kj 2577/Kcal 616",
"di cui acidi grassi saturi/of which saturates/davon gesättigte Fettsäuren/de las cuales saturadas",
"8.3g",
"di cui zuccheri/of which sugars/davon Zucker/de los cuales azúcar",
"5.1 g",
"Proteine/Protein/Eiweiß/Proteínas",
"IT Ingredienti",
"100%",
"PEANUT BUTTER",
"BØWLPRØS",
"latte e sesamo",
"8 054145 812068",
"[email protected]",
"www.bowlpros.com",
"Da consumarsi preferibilmente entro il/Best before/Mindestens haltbar bis/Consumir preferentemente antes del"
]
]
},
{
"file": "0017.pdf",
"expectedKeywords": [
[
"御 見 積 書",
"⻑崎北郵便局",
"書類番号",
"202410-01439[01]",
"下記の通り御見積申し上げます",
"何卒御用命下さる様お願い申し上げます",
"発行日",
"⻑崎北郵便局(親時計更新)",
"60日間",
"90日間",
"荷造運賃",
"812-0026",
"福岡県福岡市博多区上川端町8-18",
"092-281-0020",
"092-281-0112",
"取付工事費",
"御見積金額",
"712,000",
"品目コード",
"親時計4回線壁掛型 タイマー・チャイム",
"単価",
"金額",
"※設置工事費含む",
"※キャンペーン期間中の為設置工事費無料です。",
"KM-82TC-4P",
"標準価格計",
"割引合計額",
"御了承ください",
"特注品の内容に関しては営業担当者へ確認ください"
]
]
},
{
"file": "0018.pdf",
"expectedKeywords": [
[
"Tesla Inc.",
"10-Q",
"Texas",
"I.R.S. Employer Identification No.",
"91-2197729",
"Zip Code",
"78725",
"Registrant’s telephone number, including area code",
"(512) 516-8177",
"Title of each class",
"Common stock",
"Name of each exchange on which registered",
"The Nasdaq Global Select Market"
],
[
"Balance Sheets",
"Current assets",
"September 30, 2024",
"December 31, 2023",
"Cash and cash equivalents",
"18,111",
"16398",
"Short-term investments",
"Accounts receivable",
"Inventory",
"Total assets",
"119,852",
"106,618",
"Liabilities",
"Current liabilities",
"Accounts payable",
"14,654",
"14,431",
"Total liabilities",
"49,142",
"43,009",
"Equity",
"$0.001 par value; 100 shares authorized; no shares issued and outstanding",
"Total liabilities and equity",
"119,852",
"106,618"
],
[
"Consolidated Statements of Operations",
"(in millions, except per share data)",
"(unaudited)",
"Revenues",
"Three Months Ended September 30,",
"Net income",
"2,183",
"1,878",
"4,821",
"7,031",
"Net income attributable to common stockholders",
"2,167",
"1,853",
"4,774",
"7,069"
],
[
"Consolidated Statements of Comprehensive Income",
"Comprehensive income attributable to common stockholders",
"2,620",
"1,571",
"4,903",
"6,738",
"(289)",
"(343)"
]
]
},
{
"file": "0019.png",
"expectedKeywords": [
[
"Walmart",
"win $1000",
"7N5N1V1XCQDQ",
"317-851-1102",
"Mgr",
"JAMIE BROOKSHIRE",
"882 S. STATE ROAD 136",
"GREENWOOD",
"IN 46143",
"05483",
"TATER TOTS",
"001312000025",
"2.96",
"SNACK BARS",
"002190848816",
"4.98",
"VOIDED ENTRY",
"HRI CL CHS",
"GALE",
"000000000003K",
"32.00",
"BAGELS",
"001376402801",
"4.66",
"TOTAL",
"144.02",
"CASH",
"150.02",
"CHANGE DUE",
"6.00",
"ITEMS SOLD 26",
"0783 5080 4072 3416 2496 6",
"04/27/19",
"12:59:46",
"Scan with Walmart app to save receipt"
]
]
},
{
"file": "0020.png",
"expectedKeywords": [
[
"ZESTADO EXPRESS",
"ABN",
"16 112 221 123",
"Bill To",
"Custom Board Makeers",
"Administration Centre",
"12 Salvage Road",
"Acaacia Ridge BC QLD 4110",
"Australia",
"Issue Date",
"8th December 2021",
"Account No.",
"101234",
"Invoice Amount",
"$5,270.00 AUD",
"Pay By",
"15th December 2021",
"Waybill No.",
"012345A",
"GC12345",
"[email protected]",
"Surfboards",
"1,010 kg",
"Kahului Maui Hawaii Port",
"Maui Surf Shop",
"13' x 18\" x 3\"",
"Eco High Performance Mini Mals",
"$1,980.00",
"4321-A1 XL Custom Supreme Light Stand Up",
"13' x 18\" x 3\"",
"$1,035.00",
"$1,210.00"
]
]
},
{
"file": "0021.png",
"expectedKeywords": [
[
"RZECZPOSPOLITA",
"BRANDT",
"27.06.1988 CRIVITZ",
"4c. STAROSTA POLICKI",
"880627",
"00359/19/3211",
"8806670172",
"AM/B1/B",
"PL"
]
]
},
{
"file": "0022.png",
"expectedKeywords": [
[
"Ohio",
"LICENSE",
"STRICKLAND",
"Mike Rankin",
"Registrar BMV",
"JANE Q",
"9900TL5467900302",
"LICENSE NO.",
"TL545786",
"07-09-1962",
"04-01-2009",
"07-09-2012",
"ENDORS",
"07-09-1962",
"BRO",
"POWER OF ATTY",
"LIFE SUSTAINING",
"EQUIPMENT"
]
]
},
{
"file": "0023.png",
"expectedKeywords": [
[
"DRIVER",
"Tennessee",
"123456789",
"02/11/2026",
"DOB",
"02/11/1974",
"ISS",
"02/11/2019",
"REST",
"5'-05",
"1234567890123456",
"SAMPLE",
"JANICE",
"123 MAIN STREET",
"APT. 1",
"NASHVILLE",
"37210",
"DL"
]
]
},
{
"file": "0024.png",
"expectedKeywords": [
[
"CALIFORNIA",
"1986",
"N8685798",
"4641 Hayvenhurst",
"91316",
"Blk",
"Brn",
"8-29-58",
"PRE LIC EXP",
"CLASS 3",
"CORRECTIVE",
"SECTION 12804",
"Michael Joe Jackson",
"4-28-83",
"clckjw",
"AHIJ"
]
]
},
{
"file": "0025.png",
"expectedKeywords": [
[
"NEW YORK",
"LEARNER PERMIT",
"Mark J.F. Schroeder",
"Commissioner",
"987 654 321",
"BLU",
"DOB",
"Issued",
"10/31/2026",
"Michelle M. Motorist",
"MOTORIST",
"MICHELLE",
"2345 ANYWHERE STREET",
"12222",
"U18 UNTIL",
"OCT 21",
"U21 UNTIL",
"OCT 31 03",
"123456789"
]
]
},
{
"file": "0026.png",
"expectedKeywords": [
[
"California",
"DRIVER LICENSE",
"11234568",
"IMA",
"2570 24TH STREET",
"ANYTOWN",
"95818",
"DOB",
"08/31/1977",
"RSTR",
"DONOR",
"VETERAN",
"BRN",
"WGT",
"125 lb",
"00/00/0000NNNAN/ANFD/YY",
"08/31/2009",
"Cardholder",
"0831977"
]
]
},
{
"file": "0027.png",
"expectedKeywords": [
[
"Pennsylvania",
"IDENTIFICATION",
"visitPA.com",
"99 999 999",
"DUPS",
"01/07/1973",
"ANDREW JASON",
"123 MAIN STREET",
"HARRISBURG",
"17101-0000",
"01/31/2026",
"01/07/2022",
"HGT",
"1234567890123",
"456789012345",
"Andrew",
"Sample"
]
]
},
{
"file": "0028.png",
"expectedKeywords": [
[
"CALIFORNIA",
"1970",
"Ronald J. Thomas",
"ADMINISTRATOR",
"David Franklin Thomas",
"5798 Olive St",
"Calif",
"95969",
"W106438",
"Gry",
"COLOR EYES",
"Blu",
"DATE OF BIRTH",
"Aug 20, 1892",
"Corrective",
"D. F. Thomas",
"6,000 LBS",
"Paradise",
"8-4-65"
]
]
},
{
"file": "0029.png",
"expectedKeywords": [
[
"CALIFORNIA",
"OPERATING",
"1984",
"W0209369",
"James Scott Garner",
"35 Oakmont Dr",
"90049",
"Blk",
"Brn",
"6-3",
"PRE LIC EXP",
"4-7-23",
"CORRECTIVE",
"CONDITIONS",
"CLASS 3",
"SECTION 12804",
"James S. Garner",
"3-25-80",
"Gln rc",
"LAMINATE"
]
]
},
{
"file": "0030.png",
"expectedKeywords": [
[
"CALIFORNIA",
"DRIVER",
"RENEWAL",
"BIRTHDAY",
"N2287802",
"Shanaberger",
"1541 Beloit Ave",
"#208",
"90025",
"Brn",
"HEIGHT",
"5-6",
"130",
"CORRECTIVE",
"CONDITIONS",
"CLASS 3",
"SECTION 12804",
"08-21-80",
"Tor mw",
"LAMINATE"
]
]
},
{
"file": "0031.png",
"expectedKeywords": [
[
"SIGNATURE",
"TITULAIRE",
"FIRMA DEL TITULAR",
"PASAPORTE",
"UNITED STATES OF AMERICA",
"Codigo",
"546844936",
"Apellidos",
"ABRENICA",
"Date de naissance",
"Lugar de nacimiento",
"NEW YORK",
"06 Jun 2016",
"Autoridad",
"SEE PAGE 27",
"<USAAABRENICA<<JARED<MICHAEL",
"5468449363USA0102100M2106054275193173<681306"
]
]
},
{
"file": "0032.png",
"expectedKeywords": [
[
"ENDORSEMENTS AND LIMITATIONS",
"OBSERVATIONS BEGINNING",
"MENTIONS ET RESTRICTIONS",
"l'intention",
"GK141569",
"CANADA",
"Pays émetteur",
"passeport",
"MANN",
"Prénoms",
"JASKARAN SINGH",
"CANADIENNE",
"naissance",
"JNAOIA",
"délivrance",
"TORONTO",
"CANMANN<<JASKARAN<<SINGH",
"GK141569<8CAN8607294M2707202",
"ED197265"
]
]
},
{
"file": "0033.png",
"expectedKeywords": [
[
"Assinatura",
"Ce passeport",
"caso de incapacidad",
"AA000000",
"REPÚBLICA FEDERATIVA DO BRASIL",
"PAÍS EMISSOR",
"SOBRENOME",
"FARIAS DOS SANTOS",
"NOME",
"RODRIGO",
"BRASILEIRO(A)",
"16 MAR/MAR 2004",
"BRASÍLIA/DF",
"AMANDA FARIAS DOS SANTOS",
"Res. CNJ 131/11, Art. 13.",
"P<BRAFARIAS<DOS<SANTOS<<RODRIGO",
"AA000000<0BRA0403162M2507053"
]
]
},
{
"file": "0034.png",
"expectedKeywords": [
[
"ENDORSEMENTS AND LIMITATIONS",
"PAGE 5 (IF APPLICABLE)",
"MENTIONS ET RESTRICTIONS",
"HK444152",
"CANADA",
"Issuing Country",
"WITTMACK",
"Prénoms",
"BRIAN FREDRICK",
"Date de naissance",
"01 NOV 47",
"CONSORT CAN",
"Date d'expiration",
"MISSISSAUGA",
"P<CANWITTMACK<<BRIAN<FREDRICK",
"HK444152<5CAN4711018M2606130",
"EGD69494"
]
]
},
{
"file": "0035.png",
"expectedKeywords": [
[
"OBSERVATIONS OFFICIELLES (11)",
"UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND",
"518242591",
"Surname/Nom (1)",
"BRITISH CITIZEN",
"CROYDON",
"Date of expiry",
"24 APR / AVR 24",
"Holder's signature",
"P<GBRWEBB<<JAMES<ROBERT",
"5182425917GBR7702174M2404244"
]
]
},
{
"file": "0036.png",
"expectedKeywords": [
[
"RESIDENZA",
"TORINO (TO)",
"COLORE DEGLI OCCHI",
"MARRONI",
"REPUBBLICA ITALIANA",
"Tipo. Type. Type.",
"Codice Paese.",
"YA8116396",
"TREVISAN",
"Nome. Given Names. Prénoms. (2)",
"FELTRE (BL)",
"MINISTRO AFFARI ESTERI",
"E COOPERAZIONE INTERNAZIONALE",
"Firma del titolare",
"P<ITATREVISAN<<MARCO",
"YA81163966ITA6602129M2507097"
]
]
},
{
"file": "0037.png",
"expectedKeywords": [
[
"We the People",
"insure domestic Tranquility",
"Constitution for the United States of America",
"SIGNATURE OF BEARER",
"UNITED STATES OF AMERICA",
"Código",
"910239248",
"Apellidos",
"OBAMA",
"Date de naissance",
"17 Jan 1964",
"ILLINOIS, U.S.A.",
"Authority / Autorité / Autoridad",
"SEE PAGE 51",
"P<USABOBAMA<<MICHELLE",
"9102392482USA6401171F1812051900781200<129676"
]
]
},
{
"file": "0038.png",
"expectedKeywords": [
[
"Of the United States",
"PASSPORT",
"PASSEPORT",
"PASAPORTE",
"Code / Code / Código",
"488839667",
"VOLD",
"STEPHEN HANSL",
"Nationality / Nationalité / Nacionalidad",
"WASHINGTON, U.S.A.",
"21 May 2012",
"United States Department of State",
"Mentions Spéciale",
"SEE PAGE 51",
"P<USAVOLD<<STEPHEN<HANSL",
"4888396671USA6008156M220520112117147143<509936"
]
]
},
{
"file": "0039.png",
"expectedKeywords": [
[
"insure domestic Tranquility",
"Constitution for the United States of America.",
"PASSPORT",
"PASSEPORT",
"PASAPORTE",
"USA",
"Type / Type / Tipo",
"963545637",
"JOHN",
"15 Mar 1996",
"Fecha de expedición",
"United States Department of State",
"Endorsements",
"Mentions Spéciales",
"Anotaciones",
"SEE PAGE 17",
"P<USAJOHN<<DOE",
"9635456374USA9603150M27041402O2113962<804330"
]
]
},
{
"file": "0040.png",
"expectedKeywords": [
[
"OBSERVATIONS OFFICIELLES (11)",
"UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND",
"Code/Code",
"925600253",
"UK SPECIMEN",
"Prénoms (2)",
"ANGELA ZOE",
"Nationality",
"Nationalité",
"CROYDON",
"16 JUL / JUIL 10",
"Holder's signature",
"P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<",
"9256002538GBR8809117F2007162"
]
]
}
]
================================================
FILE: shared/outputs/0001.md
================================================
# Form 1040
Department of the Treasury - Internal Revenue Service
## U.S. Individual Income Tax Return
## 2023
OMB No. 1545-0074
IRS Use Only - Do not write or staple in this space.
For the year Jan. 1 – Dec. 31, 2023, or other tax year beginning \_\_\_, ending \_\_\_
See separate instructions.
Your first name and middle initial
JOSEPH R.
Last name
BIDEN JR.
Your social security number
If joint return, spouse's first name and middle initial
JILL T.
Last name
BIDEN
Spouse's social security number
Home address (number and street). If you have a P.O. box, see instructions.
Apt. no.
City, town, or post office. If you have a foreign address, also complete spaces below.
State
ZIP Code
Foreign country name
Foreign province/state/country
Foreign postal code
**Presidential Election Campaign**
Check here if you, or your spouse if filing jointly, want $3 to go to this fund. Checking a box below will not change your tax or refund.
☑ You ☑ Spouse
### Filing Status
Check only one box.
□ Single
☑ Married filing jointly (even if only one had income)
□ Married filing separately (MFS)
□ Head of Household (HoH)
□ Qualifying surviving spouse (QSS)
If you checked the MFS box, enter the name of your spouse. If you checked the HOH or QSS box, enter the child's name if the qualifying person is a child but not your dependent
### Digital Assets
At any time during 2023, did you (a) receive (as a reward, award, or payment for property or services); or (b) sell, exchange, or otherwise dispose of a digital asset (or a financial interest in a digital currency)? (See instructions.)
□ Yes ☑ No
### Standard Deduction
**Someone can claim:**
□ You as a dependent
□ Your spouse as a dependent
□ Spouse itemizes on a separate return or you were a dual-status alien
### Age/Blindness
**You:** ☑ Were born before January 2, 1959 □ Are blind
**Spouse:** ☑ Was born before January 2, 1959 □ Is blind
### Dependents
If more than four dependents, see Instr. and check here □
| (see instructions): (1) First Name Last Name | (2) Social security number | (3) Relationship to you | (4) Check the box if qualifies for (see instr.): Child tax credit | (4) Check the box if qualifies for (see instr.): Credit for other dependents |
| -------------------------------------------- | -------------------------- | ----------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| | | | □ | □ |
| | | | □ | □ |
| | | | □ | □ |
| | | | □ | □ |
- **1a** Total amount from Form(s) W-2, box 1 (see instructions) STMT 1 **1a** 485,985.
================================================
FILE: shared/outputs/0002.md
================================================
# Deloitte.
## Quality System Audit for BioTech Innovations (Pty) Ltd
### Opening Meeting Sign-in Sheet
**Audit Date:** 02 October 2024
**Time:** 06h30
**Supplier:** BioTech Innovations (Pty) Ltd; 67 River Rd, Kensington, Johannesburg, Gauteng, 2094 South Africa.
**Contact Person:** Kathy Margaret
**Phone Number:** +14 22 045 4952
**Opening Meeting Agenda:**
- Introductions
- Review of audit agenda
- Confirmation of availability for required persons
**Opening Meeting Attendees:**
| No. | Print Name | Job Title | Email | Signature |
| --- | -------------- | --------- | --------------------------- | ----------- |
| 1 | Anna Pojawis | CTO | [email protected] | [Signature] |
| 2 | Tyler Maran | CEO | [email protected] | [Signature] |
| 3 | Kathy Margaret | Associate | [email protected] | [Signature] |
| 4 | Mark Ding | Eng | [email protected] | [Signature] |
| 5 | | | | |
**QAC Auditor:** David Thompson, Lead Quality Auditor, NTA Services on behalf of BioTech Innovations (Biopharmaceuticals).
---
Page 1 of 7
**DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC**
450 Oceanview Drive, Suite 200 - Santa Monica, CA 90405 - PHONE (800) 555-1234 (310) 555-7890 - FAX (310) 555-4567
Website: [www.qaconsultants.com](http://www.qaconsultants.com) - Email: [email protected]
================================================
FILE: shared/outputs/0003.md
================================================
# [RS]
## Carbon Film Leader Resistor - Resistor
## Environmental Characteristics
| Item | Requirement | Test Method |
| ------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| Short Time Overload | ±(0.75\*±0.05Ω) | JIS-C-5201-1 5.5 RCWV\*2.5 or Max. overload voltage for 5 seconds |
| Insulation Resistance | >1000MΩ | JIS-C-5201-1 5.6 Apply 100VDC for 1 minute |
| Endurance | ±(3%\*0.05Ω) | JIS-C-5201-1 7.10 70±2°C, Max. working voltage for 1000 hrs with 1.5 hrs "ON" and 0.5 hrs "OFF" |
| Damp Heat with Load | 100KΩ±3% 100KΩ±5% | JIS-C-5201-1 7.9 40±2°C, 90-95% R.H. Max. working voltage for 1000 hrs with 1.5 hrs "ON" and 0.5 hrs "OFF" |
| Solderability | 90% min. Coverage | JIS-C-5201-1 6.5 245±5°C for 3 seconds |
| Dielectric Withstanding Voltage | By Type | JIS-C-5201-1 5.7 Apply Max. Overload Voltage for 1 minute |
| Temperature Coefficient | < 100KΩ +350ppm~-500ppm 100KΩ~1MΩ-0ppm~-700ppm > 1 MΩ-0ppm~-1500ppm | Resistance value at room temperature and room Temperature+100°C |
| Pulse Overload | ±(1%\*±0.05Ω) | JIS-C-5201-1 5.8 4 times RCWV for 10000 cycles with 1 second "ON" and 25 seconds "OFF" |
| Resistance To Solvent | No deterioration of coatings and markings | JIS-C-5201-1 6.9 Trichroethane for 1 min. with ultrasonic |
| Terminal Strength | Tensile: 2.5 kg | Direct Load for 10 seconds In the direction off the terminal leads |
## Rated Continuous Working Voltage(RCWV) = √(P\*R)
## Storage Temperature: 25±3°C; Humidity < 80% RH
================================================
FILE: shared/outputs/0004.md
================================================
Focused ReAct: Improving ReAct through Reiterate and Early Stop
================================================================================
**Shuoqui Li**
Carnegie Mellon University
[email protected]
**Han Xu**
University of Illinois at Urbana-Champaign
[email protected]
**Haipeng Chen**
William & Mary
[email protected]
---
**Abstract**
Large language models (LLMs) have significantly improved their reasoning and decision-making capabilities, as seen in methods like ReAct. However, despite its effectiveness in tackling complex tasks, ReAct faces two main challenges: losing focus on the original question and becoming stuck in action loops. To address these issues, we introduce Focused ReAct, an enhanced version of the ReAct paradigm that incorporates reiteration and early stop mechanisms. These improvements help the model stay focused on the original query and avoid repetitive behaviors. Experimental results show accuracy gains of 18% to 530% and a runtime reduction of up to 34% compared to the original ReAct method.
1 Introduction
----------------
Recent advancements in large language models (LLMs) have enabled more sophisticated techniques for reasoning and decision-making. One such technique, the ReAct framework (Reason+Act), has gained popularity for its dual approach of alternating between reasoning and action (Yao et al., 2023). This combination allows ReAct to excel in handling complex tasks by better adapting to dynamic environments (Wang et al., 2024).
Despite its strengths in general question-answering (QA), ReAct sometimes falls short in delivering accurate results, as demonstrated in Figure 1. When confronted with lengthy or intricate questions, the model—paired with the ReAct framework—may lose track of the original question, leading to incorrect responses. Additionally, the ReAct method can get trapped in repetitive action loops, continuously performing the same action without progressing toward a solution. To address these challenges, we introduce Focused ReAct, an enhanced version of the original ReAct method specifically designed for QA tasks.
2 Methodology
----------------
This section outlines the two core features of the Focused ReAct method: reiterate and early stop.
2.1 Reiterate to Solve Context Loss
As indicated in the introduction, The first challenge stems from the extended reasoning process, where the relevance of the original question diminishes as additional reasoning and actions are appended. To resolve this, the concept of reiterate is introduced. In this approach, the original question is restated at the beginning of each reasoning step in the ReAct cycle. This technique is illustrated in the green-highlighted region of Figure 2.
By reiterating the original question at each step, the model continually emphasizes the user’s query, preventing it from being overshadowed by the increasingly long context that ReAct tends to create. This simple yet effective strategy mitigates the context dilution problem illustrated in Figure 1, ensuring that the output remains aligned with the user’s intent.
---
![Figure 1: The yellow highlighted region illustrates where the model loses focus on the original question, while the red highlighted region depicts repeated actions that result in a failed response.](#)
Question: Howard Van Doren Shaw was the designer of which 1917 Progressive Era community?
Thought: 1917 is the year of the community, so I need to search Howard Van Doren Shaw and Progressive Era.
Action: Search [Howard Van Doren Shaw]
Observation: I found the following information in order to answer "Howard Van Doren Shaw was the designer of which 1917 Progressive Era community?" …...
Thought: Hereby I have provide enough information. Please give out the answer to "Howard Van Doren Shaw was the designer of which 1917 Progressive Era community?" with the format of ……
Action: Finish [Marktown]
Evaluation: True
Figure 2: The QA process by Focused ReAct for the same question, which applies reiteration (highlighted in yellow) and early stop (highlighted in red) to resolve the context loss and the repeated action issue.
2.2 Early Stop to Prevent Action Repetition
The second challenge, as outlined in the introduction, occurs when the model gets caught in repetitive loops, generating the same response without progressing toward the correct answer. To tackle this, we propose an early stop mechanism. It assumes that by the time a duplicate action occurs, sufficient information has been gathered.
When the program detects repeated actions, it triggers a termination request - highlighted in red in Figure 2 - instructing the model to generate a final answer based on the existing information. This approach prevents unnecessary repetition and helps the QA process arrive at an accurate response more efficiently.
3 Experimentation
We evaluate Focused ReAct against the ReAct baseline using the Gemma 2 2B (Team et al., 2024), Phi-3.5-mini 3.8B (Abdin et al., 2024) and Llama 3.1 8B (Dubey et al., 2024) models. The implementation uses the PyTorch and Transformers libraries¹, with experiments conducted on a single NVIDIA L4 GPU with 24GB of memory. The dataset consists of 150 QA tasks, randomly selected from HotPotQA (Yang et al., 2018). We measure accuracy as the ratio of correctly answered tasks to the total number of tasks, while runtime is recorded for the completion of each task.
Table 1 presents the accuracy comparison between the vanilla ReAct and Focused ReAct across the Gemma 2, Phi-3.5, and Llama 3.1 models. Focused ReAct demonstrates an 18%-530% improvement in accuracy.
| Model | ReAct | Focused ReAct | abs./rel. diff |
|---------------|-------|---------------|----------------|
| Gemma 2 2B | 2.0% | 12.6% | +10.6 / 530% |
| Phi-3.5-mini 3.8B | 22.0% | 26.0% | +4.0 / 18% |
| Llama 3.1 8B | 14.0% | 23.3% | +9.3 / 66% |
Table 2: Runtime Comparison (Average and Std) for ReAct vs. Focused ReAct
| Model | ReAct | Focused ReAct | abs./rel. diff |
|---------------|---------------|---------------|----------------|
| Gemma 2 2B | 11.68±2.66s | 7.68±2.41s | -4.0 / 34% |
| Phi-3.5-mini 3.8B | 23.23±8.42s | 22.50±11.19s | -0.73 / 3% |
| Llama 3.1 8B | 24.10±23.48s | 23.12±25.35s | -0.98 / 4% |
Table 2 summarizes the average runtime and standard deviation (std) for both the original ReAct and Focused ReAct methods. Models with fewer parameters show a 34% reduction in runtime, while models with larger parameter sizes exhibit no significant decrease. This discrepancy may be attributed to the fact that smaller models, with weaker reasoning capabilities, benefit more from Focused ReAct optimizations. In contrast, larger models are more robust at maintaining context and performing deeper reasoning, which may reduce the relative impact of Focused ReAct’s efficiency gains. As a result, the runtime benefits are less pronounced compared to smaller models.
4 Conclusion
This paper identifies two common issues with the ReAct method in QA: losing focus on the original question during extended reasoning and becoming stuck in repetitive action loops. To overcome these problems, we propose Focused ReAct, which incorporates reiteration and early stop to improve upon the ReAct framework. Compared to the original ReAct method, the new approach achieves accuracy improvements between 18% and 530%, along with a reduction in runtime of up to 34%.
For future work, we plan to extend Focused ReAct to a broader range of tasks and scenarios, evaluate its generalizability and robustness, and explore techniques to further accelerate its performance (Xu et al., 2024).
¹Our code implementation and experiments are available at https://github.com/vmd3i/Focused-ReAct.
## References
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. *arXiv preprint arXiv:2404.14219*.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. *arXiv preprint arXiv:2407.21783*.
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhuptiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. *arXiv preprint arXiv:2408.00118*.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents. *Frontiers of Computer Science, 18*(6):186345.
Han Xu, Jingyang Ye, Yutong Li, and Haipeng Chen. 2024. Can speculative sampling accelerate react without compromising reasoning quality? In *The Second Tiny Papers Track at ICLR 2024*.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In *The Eleventh International Conference on Learning Representations*.
================================================
FILE: shared/outputs/0005.md
================================================
# Quest Diagnostics
## Maternal Serum Screening
### BILL TO:
□ My Account
□ Insurance Provided
□ Lab Card/Select
□ Patient
### PRINT PATIENT NAME (LAST, FIRST, MIDDLE)
### REGISTRATION \# (IF APPLICABLE)
### DATE OF BIRTH
### SEX
### LAB REFERENCE
### CELL PHONE
### PATIENT ID \# / MRN
### PATIENT PHONE
### PATIENT EMAIL ADDRESS
### PRINT NAME OF INSURED/RESPONSIBLE PARTY (LAST, FIRST, MIDDLE) - IF OTHER THAN PATIENT
### PATIENT STREET ADDRESS (OR INSURED/RESPONSIBLE PARTY)
### APT
### KEY
### CITY
### STATE
### ZIP
### PRIMARY INSURANCE
### RELATIONSHIP TO INSURED:
□ SELF
□ SPOUSE
□ DEPENDENT
### PRIMARY INSURANCE CO. NAME
### MEMBER / INSURED ID NO. \#
### GROUP \#
### INSURANCE ADDRESS
### ACCOUNT
### CITY
### STATE
### ZIP
### ACCOUNT \#
### NAME
### ADDRESS
### CITY, STATE, ZIP
### TELEPHONE \#
### DATE COLLECTED
### TIME
□ AM
□ PM
### TOTAL VOL/hrs
\_\_\_ ML \_\_\_ HR
□ Fasting
□ Non Fasting
### NPI/UPIN ORDERING/SUPERVISING PHYSICIAN AND/OR PAYERS (MUST BE INDICATED)
### ADDIT'L PHYS.: Dr.
### NPI/UPIN
### NON-PHYSICIAN PROVIDER:
### NAME
### I.D.#
### □ Fax Result to:
### Send Duplicate Report to:
### Client # OR NAME:
### ADDRESS
### CITY
### STATE
### ZIP
### DID YOU KNOW
- Reflex Tests Are Performed At An Additional Charge.
- PSC Appointment Website And Telephone Number Information Listed On The Back.
- Each Sample Should Be Labeled With At Least Two Patient Identifiers At Time Of Collection.
### ICD Diagnosis Codes Are Mandatory. Fill in the applicable fields below.
### ABN required for tests with these symbols
Medicare Limited Coverage Tests
- @ - May not be covered for the reported diagnosis.
- F - Has a frequency limit or usage coverage.
- & - As a blood donor screening test/experimental kit.
- B - Has both donor and frequency/medical coverage limitations.
Provide signed ABN when necessary
### Provide
- ICD Diagnosis Code(s)
- ABN when necessary
### Visit QuestDiagnostics.com/MLCP for Medicare coverage guidelines
### ICD Codes (enter all that apply)
Many payers (including Medicaid) have medical necessity requirements. You should only order those tests which are medically necessary for the diagnosis and treatment of the patient.
### 1st TRIMESTER SCREENING ♦ # (1st Trimester Screening does not detect oNTDs) | Red Top SST - 1 Tube
**16020** □ 1st Trimester Screen hyperGly-hCG (PAPP-A, h-hCG) (9.0-13.9 wks gestation)
@ **16145** □ 1st Trimester Screen, hCG (PAPP-A, hCG) (10.0-13.9 wks gestation)
### INTEGRATED/SEQUENTIAL SCREENING
@ **16131** □ Sequential Integrated Screen **Part 1** (PAPP-A, hCG) # ♦ (10.0-13.9 weeks gestation)
@ **16133** □ Sequential Integrated Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation)
**Speciment # from Part 1**
@ **16463** □ Stepwise Sequential Screen **Part 1** (PAPP-A, hCG) # ♦ (10.0-13.9 weeks gestation)
@ **16465** □ Stepwise Sequential Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation)
**Speciment # from Part 1**
**16148** □ Integrated Screen **Part 1** (PAPP-A) # ♦ (NT required) (9.0-13.9 weeks gestation)
@ **16150** □ Integrated Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation)
**Speciment # from Part 1**
**16165** □ Serum Integrated Screen **Part 1** (PAPP-A) # (NT not required) (9.0-13.9 weeks gestation)
@ **16167** □ Serum Integrated Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation)
**Speciment # from Part 1**
### 2nd TRIMESTER SCREENING # | Red Top SST - 1 Tube
@ **5059** □ Maternal Serum AFP (MSAFP) (15.0-22.9 weeks gestation)
**Screens for open neural tube detects (oNTDs) only**
@ **30294** □ Quad Screen (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation)
@ **15934** □ Penta Screen (AFP, hCG, uE3, DIA) (15.0-22.9 wks gestation)
### THIS INFORMATION IS REQUIRED FOR ALL TESTS ~ CALL 866-GENEINFO IF YOU HAVE ANY QUESTIONS
Date of Birth: \_\_/\_\_/\_\_
Collection Date: \_\_/\_\_/\_\_
Maternal Weight: \_\_LBS
### # THIS INFORMATION IS REQUIRED FOR PART 1 OF INTEGRATED/SEQUENTIAL SCREENING, 1ST AND 2ND TRIMESTER SCREENING | Red Top SST - 1 Tube
Estimated Date of Delivery (EDD): \_\_/\_\_/\_\_ determined by: ☐ Ultrasound ☐ Last Menstrual Period (LMP) ☐ Physical Exam
Mother's Ethnic Origin: ☐ African American ☐ Asian ☐ Caucasian ☐ Hispanic ☐ Other: \_\_
Number of Fetuses: ☐ One ☐ Two ☐ More than 2 | How many fetuses? \_\_
| Yes | No | |
| --- | --- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ☐ | ☐ | Patient is an insulin-dependent diabetic prior to pregnancy |
| ☐ | ☐ | This is a repeat specimen for this pregnancy (Repeat testing following a screen positive result for Down syndrome or Trisomy 18 is **NOT** recommended) |
| ☐ | ☐ | History of neural tube defect If yes, explain: \_\_ |
| ☐ | ☐ | Previous pregnancy with Down Syndrome |
| ☐ | ☐ | Pregnancy is from a donor egg Age of Donor at time of Egg Retrieval: \_\_ |
| ☐ | ☐ | Patient currently smokes cigarettes |
| | | **Other Relevant Clinical Information:** |
### ♦ THIS INFORMATION IS REQUIRED FOR 1st TRIMESTER SCREENING AND PART 1 INTEGRATED/SEQUENTIAL SCREENING.
Ultrasound date \_\_/\_\_/\_\_
Ultrasonographer's name \_\_/\_\_/\_\_
**Nuchal Translucency Measurement Credentialing Agency (required, check one box)**
☐ NTQR Ultrasonographer's ID# \_\_ | Location ID# \_\_ | Reading Physician ID# \_\_
☐ FMF Ultrasonographer's ID# \_\_
☐ Other (List) \_\_ | ID# \_\_
Quest, Quest Diagnostics, the associated logo and all associated Quest Diagnostics marks are the trademarks of Quest Diagnostics Incorporated. © Quest Diagnostics Incorporated. All rights reserved. QD20330K. Revised 8/20.
================================================
FILE: shared/outputs/0006.md
================================================
# Monthly Statement
**Statement for**
FIRST NAME LAST NAME
**Account number**
100469352
**Bill close date**
Feb 14, 2024
### FIRST NAME LAST NAME
### ADDRESS
### CITY, STATE, ZIP CODE
## Balance
| Description | Amount |
| -------------------------------- | ----------- |
| Previous balance | $95.80 |
| Credits and one time charges | ($95.80) |
| Payments received | ($0.00) |
| **Balance forward - Credit** | **($0.00)** |
| Current charges | |
| Recurring | $69.23 |
| Other | $41.47 |
| **Total amount due by 02/28/24** | **$110.70** |
Your bill is scheduled for an automatic payment on 02/28/24 using MasterCard \*\*\*\*9541.
_"Change from last month" does not include changes to taxes and fees unless associated with changes in service plan, Equipment Installment Plan, or Lease._
## Current charges
| Account and lines | Recurring | Other | Change from last month |
| -------------------- | ----------- | ---------- | ---------------------- |
| Account | $25.00 | - | $95.80 ▼ |
| (XXX)-49X-5XXX | - | $28.10 | - |
| (XXX)-49X-5XXX | - | $1.50 | - |
| (XXX)-49X-5XXX | - | - | $26.60 ▲ |
| (XXX)-49X-5XXX | - | - | $23.10 ▲ |
| (XXX)-49X-5XXX | $20.23 | - | $20.03 ▲ |
| (XXX)-49X-5XXX | - | $7.12 | $10.17 ▲ |
| New - (XXX)-496-5XXX | $24.00 | $4.75 | - |
| (XXX)-496-5XXX | - | - | - |
| 3 additional lines | - | $00.00 | 15.50 ▲ |
| **Subtotal** | **$69.23** | **$41.47** | - |
| **Total** | **$110.70** | | - |
## Bill highlights
Follow numbers throughout bill.
- **1** **_You had usage charges._**
- **2** **_Your plan changed._**
- **3** **_An Equipment Installment Plan (EIP) monthly charge was billed for the first time._**
- **i** **_Your billing address has changed._**
- **4** **_One or more of your lines did not receive KickBack because they exceeded 2GB._**
- **i** **_You're getting an AutoPay discount for using AutoPay!_**
- **i** **_Visit my.t-mobile.com or the T-Mobile App to pay your bill online, manage your account and get product support._**
**Questions?** For more information visit my.t-mobile.com.
Please detach this portion and return with your payment. Please make sure address shows through window.
**T-Mobile**
**Statement for:** FIRST NAME LAST NAME
**Account number:** 100469352
**Pay online:** t-mobile.com/pay
**Pay by phone:** *PAY (*XXX)
**Scan to pay**
| Total amount due by 02/28/24 | Amount enclosed |
| ---------------------------- | --------------- |
| **$110.70** | **AutoPay** |
T-MOBILE
PO BOX 8668
CITY, STATE, ZIP CODE
□ **Sign up for AutoPay** - Check box and complete reverse side.
□ **If you changed your address** - Check box and record new address on the reverse side.
1004693520967854630205189400463125091
================================================
FILE: shared/outputs/0007.md
================================================
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 1 of 4
TO: Michael Farzam MD
# LabCorp | Patient Report
**Specimen ID:** 044-494-4741-0
**Control ID:** 60070006294
**Acct #:** 04275945
**Phone:** (310) 849-7991
**Rte:** 00
**WEISER, CHERYL**
444 N GENESEE AVE
LOS ANGELES CA 90036
(213) 400-3914
**Michael Farzam MD**
258 North Bowling Green Way
LOS ANGELES CA 90049
### Patient Details
**DOB:** 04/11/1992
**Age(y/m/d):** 026/10/02
**Gender:** F
**SSN:**
**Patient ID:**
### Specimen Details
**Date collected:** 02/13/2019 1000 Local
**Date received:** 02/13/2019
**Date entered:** 02/13/2019
**Date reported:** 02/16/2019 1809 ET
### Physician Details
**Ordering:** M FARZAM
**Referring:**
**ID:**
**NPI:** 1619295474
**General Comments & Additional Information**
**Total Volume:** Not Provided
**Fasting:** Yes
**Ordered Items**
CBC With Differential/Platelet; Comp. Metabolic Panel (14); UA/M w/rflx Culture, Routine; Lipid Panel; Vitamin B12 and Folate; Hemoglobin A1c; Thyroxine (T4) Free, Direct, S; TSH; Vitamin D, 25-Hydroxy; Uric Acid; Iron; Ferritin, Serum
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB |
| ---------------------------------- | ------- | -------- | ----------- | ------------------ | --- |
| **CBC With Differential/Platelet** | | | | | |
| WBC | 7.5 | | x10E3/uL | 3.4 - 10.8 | 01 |
| RBC | 4.24 | | x10E6/uL | 3.77 - 5.28 | 01 |
| Hemoglobin | 12.8 | | g/dL | 11.1 - 15.9 | 01 |
| Hematocrit | 38.4 | | % | 34.0 - 46.6 | 01 |
| MCV | 91 | | fL | 79 - 97 | 01 |
| MCH | 30.2 | | pg | 26.6 - 33.0 | 01 |
| MCHC | 33.3 | | g/dL | 31.5 - 35.7 | 01 |
| RDW | 13.2 | | % | 12.3 - 15.4 | 01 |
| Platelets | 202 | | x10E3/uL | 150 - 379 | 01 |
| Neutrophils | 41 | | % | Not Estab. | 01 |
| Lymphs | 48 | | % | Not Estab. | 01 |
| Monocytes | 8 | | % | Not Estab. | 01 |
| Eos | 3 | | % | Not Estab. | 01 |
| Basos | 0 | | % | Not Estab. | 01 |
| Neutrophils (Absolute) | 3.1 | | x10E3/uL | 1.4 - 7.0 | 01 |
| **Lymphs (Absolute)** | **3.5** | **High** | x10E3/uL | 0.7 - 3.1 | 01 |
| Monocytes (Absolute) | 0.6 | | x10E3/uL | 0.1 - 0.9 | 01 |
| Eos (Absolute) | 0.2 | | x10E3/uL | 0.0 - 0.4 | 01 |
| Baso (Absolute) | 0.0 | | x10E3/uL | 0.0 - 0.2 | 01 |
| Immature Granulocytes | 0 | | % | Not Estab. | 01 |
| Immature Grans (Abs) | 0.0 | | x10E3/uL | 0.0 - 0.1 | 01 |
| **Comp. Metabolic Panel (14)** | | | | | |
| Glucose | 76 | | mg/dL | 65 - 99 | 01 |
| **BUN** | **32** | **High** | mg/dL | 6 - 20 | 01 |
| Creatinine | 0.74 | | mg/dL | 0.57 - 1.00 | 01 |
| eGFR If NonAfricn Am | 112 | | mL/min/1.73 | >59 | |
| eGFR If Africn Am | 129 | | mL/min/1.73 | >59 | |
| **BUN/Creatinine Ratio** | **43** | **High** | | 9 - 23 | |
---
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 1 of 4
This document contains private and confidential health information protected by state and federal law.
If you have received this document in error, please call 800-859-6046
© 1995-2019 Laboratory Corporation of America® Holdings
All Rights Reserved - Enterprise Report Version: 1.00
================================================
FILE: shared/outputs/0008.md
================================================
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 2 of 4
TO: Michael Farzam MD
# LabCorp | Patient Report
**Patient: WEISER, CHERYL**
**DOB:** 04/11/1992
**Patient ID:**
**Control ID:** 60070006294
**Specimen ID:** 044-494-4741-0
**Date collected:** 02/13/2019 1000 Local
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB |
| ------------------------------------------------------------------- | ------------------ | ------------ | ------ | ------------------ | --- |
| Sodium | 136 | | mmol/L | 134 - 144 | 01 |
| Potassium | 3.9 | | mmol/L | 3.5 - 5.2 | 01 |
| Chloride | 102 | | mmol/L | 96 - 106 | 01 |
| Carbon Dioxide, Total | 21 | | mmol/L | 20 - 29 | 01 |
| Calcium | 9.2 | | mg/dL | 8.7 - 10.2 | 01 |
| Protein, Total | 7.4 | | g/dL | 6.0 - 8.5 | 01 |
| Albumin | 4.7 | | g/dL | 3.5 - 5.5 | 01 |
| Globulin, Total | 2.7 | | g/dL | 1.5 - 4.5 | |
| A/G Ratio | 1.7 | | | 1.2 - 2.2 | |
| Bilirubin, Total | 0.3 | | mg/dL | 0.0 - 1.2 | 01 |
| Alkaline Phosphatase | 92 | | IU/L | 39 - 117 | 01 |
| **AST (SGOT)** | **41** | **High** | IU/L | 0 - 40 | 01 |
| **ALT (SGPT)** | **83** | **High** | IU/L | 0 - 32 | 01 |
| **UA/M w/rflx Culture, Routine** | | | | | |
| Urinalysis Gross Exam | | | | | 01 |
| Specific Gravity | 1.019 | | | 1.005 - 1.030 | 01 |
| pH | 5.5 | | | 5.0 - 7.5 | 01 |
| **Urine-Color** | **Brown** | **Abnormal** | | Yellow | 01 |
| **Appearance** | **Cloudy** | **Abnormal** | | Clear | 01 |
| **WBC Esterase** | **1+** | **Abnormal** | | Negative | 01 |
| **Protein** | **2+** | **Abnormal** | | Negative/Trace | 01 |
| Glucose | Negative | | | Negative | 01 |
| Ketones | Negative | | | Negative | 01 |
| **Occult Blood** | **3+** | **Abnormal** | | Negative | 01 |
| Bilirubin | Negative | | | Negative | 01 |
| Urobilinogen, Semi-Qn | 0.2 | | mg/dL | 0.2 - 1.0 | 01 |
| Nitrite, Urine | Negative | | | Negative | 01 |
| Microscopic Examination<br>See below: | | | | | 01 |
| WBC | 0-5 | | /hpf | 0 - 5 | 01 |
| **RBC** | **11-30** | **Abnormal** | /hpf | 0 - 2 | 01 |
| Epithelial Cells (non renal) | 0-10 | | /hpf | 0 - 10 | 01 |
| **Crystals** | **Present** | **Abnormal** | | N/A | 01 |
| Crystal Type | Amorphous Sediment | | | N/A | 01 |
| Mucus Threads | Present | | | Not Estab. | 01 |
| Bacteria | Few | | | None seen/Few | 01 |
| Urinalysis Reflex<br>This specimen has reflexed to a Urine Culture. | | | | | 01 |
| Urine Culture, Routine<br>Final report | | | | | 01 |
| Result 1 | | | | | |
---
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 2 of 4
This document contains private and confidential health information protected by state and federal law.
If you have received this document in error, please call 800-859-6046
© 1995-2019 Laboratory Corporation of America® Holdings
All Rights Reserved - Enterprise Report Version: 1.0
================================================
FILE: shared/outputs/0009.md
================================================
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 3 of 4
TO: Michael Farzam MD
# LabCorp | Patient Report
**Patient: WEISER, CHERYL**
**DOB:** 04/11/1992
**Patient ID:**
**Control ID:** 60070006294
**Specimen ID:** 044-494-4741-0
**Date collected:** 02/13/2019 1000 Local
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- | ------ | ------------------ | --- |
| Mixed urogenital flora<br>50,000-100,000 colony forming units per mL | | | | | 01 |
| **Lipid Panel** | | | | | |
| **Cholesterol, Total** | **202** | **High** | mg/dL | 100 - 199 | 01 |
| Triglycerides | 48 | | mg/dL | 0 - 149 | 01 |
| HDL Cholesterol | 82 | | mg/dL | >39 | 01 |
| VLDL Cholesterol Cal | 10 | | mg/dL | 5 - 40 | 01 |
| **LDL Cholesterol Calc** | **110** | **High** | mg/dL | 0 - 99 | 01 |
| **Vitamin B12 and Folate** | | | | | |
| **Vitamin B12** | **>1999** | **High** | pg/mL | 232 - 1245 | 01 |
| Folate (Folic Acid), Serum<br>Note:<br>A serum folate concentration of less than 3.1 ng/mL is considered to represent clinical deficiency. | 5.1 | | ng/mL | >3.0 | 01 |
| **Hemoglobin A1c** | | | | | |
| Hemoglobin A1c<br>Please Note:<br>- Prediabetes: 5.7 - 6.4<br>- Diabetes: >6.4<br>- Glycemic control for adults with diabetes: <7.0 | 4.8 | | % | 4.8 - 5.6 | 01 |
| **Thyroxine (T4) Free, Direct, S** | | | | | |
| T4,Free(Direct) | 1.07 | | ng/dL | 0.82 - 1.77 | 01 |
| TSH | 2.200 | | uIU/mL | 0.450 - 4.500 | 01 |
| **Vitamin D, 25-Hydroxy**<br>**Vitamin D deficiency has been defined by the Institute of Medicine and an Endocrine Society practice guideline as a level of serum 25-OH vitamin D less than 20 ng/mL (1,2). The Endocrine Society went on to further define vitamin D insufficiency as a level between 21 and 29 ng/mL (2).<br>1. IOM (Institute of Medicine). 2010. Dietary reference intakes for calcium and D. Washington DC: The National Academies Press.<br>2. Holick MF, Binkley NC, Bischoff-Ferrari HA, et al. Evaluation, treatment, and prevention of vitamin D deficiency: an Endocrine Society clinical practice guideline. JCEM. 2011 Jul; 96(7):1911-30.** | **10.7** | **Low** | ng/mL | 30.0 - 100.0 | 01 |
| **Uric Acid** | | | | | |
| Uric Acid<br>Please Note:<br>Therapeutic target for gout patients: <6.0 | 2.8 | | mg/dL | 2.5 - 7.1 | 01 |
---
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 3 of 4
This document contains private and confidential health information protected by state and federal law.
If you have received this document in error, please call 800-859-6046
© 1995-2019 Laboratory Corporation of America® Holdings
All Rights Reserved - Enterprise Report Version: 1.00
================================================
FILE: shared/outputs/0010.md
================================================
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 4 of 4
TO: Michael Farzam MD
# LabCorp | Patient Report
**Patient: WEISER, CHERYL**
**DOB:** 04/11/1992
**Patient ID:**
**Control ID:** 60070006294
**Specimen ID:** 044-494-4741-0
**Date collected:** 02/13/2019 1000 Local
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB |
| --------------- | ------ | ---- | ----- | ------------------ | --- |
| Iron | 123 | | ug/dL | 27 - 159 | 01 |
| Ferritin, Serum | 44 | | ng/mL | 15 - 150 | 01 |
01 SO
LabCorp San Diego
13112 Evening Creek Dr So Ste 200, San Diego, CA
92128-4108
Dir: Jenny Galloway, MD
For inquiries, the physician may contact **Branch: 800-859-6046 Lab: 858-668-3700**
---
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 4 of 4
This document contains private and confidential health information protected by state and federal law.
If you have received this document in error, please call 800-859-6046
© 1995-2019 Laboratory Corporation of America® Holdings
All Rights Reserved - Enterprise Report Version: 1.00
================================================
FILE: shared/outputs/0011.md
================================================
# Bill of Lading
**Ship Date** September 13, 2021
**Bill of Lading Number:** 21099992723
---
## Ship From
**Example Pick Up Company**
123 Pick Up Street
Vancouver, BCV5K 0A4, Canada
**Location Type:** Business without Dock or Forklift
(123) 456-7890 (Example Pick Up Contact)
**Pickup Hours:** 9:00AM to 5:00PM
**SID#:** N/A
**Special Handling:** Protect From Freeze
---
## Ship To
**Example Delivery Company**
123 Delivery Street
Toronto, ONM1B 0A1, Canada
**Location Type:** Business with Dock or Forklift
(123) 456-7890 (Example Consignee)
**Delivery Hours:** 8:00AM to 5:00PM
**CID#:** N/A
**Special Handling:** N/A
---
## Third Party Freight Charges Bill To
**Freightera Logistics Inc.**
408 - 55 Water Street, Office 8036
Vancouver, BC V6B 1A1
800-866-4870
---
## Carrier Information
**Carrier Name:** EXAMPLE CARRIER
**Trailer Number:**
**Seal Number(s):**
**Pro Number:** N/A
**Quote#:** N/A
**Customer PO#:** N/A
**Freight Charges Term** (prepaid unless otherwise marked)
- [ ] Prepaid
- [ ] Collect
- [x] 3rd Party
- [ ] Master BOL with underlying BOLs
**Please note:** Email carrier invoices, BOL, POD, any accessorial document to <[email protected]>. Additional accessorials MUST be approved by Freightera Dispatch at email <[email protected]> or call (800) 886-4870 Ext. 2.
---
## Special Instructions
- **Shipper:** Please call 15 minutes before pickup
---
## Handling Unit
| Qty | Type | Wt | Hzmt | Non-Stackable? | Description | NMFC | Class |
|-----|-------|-----|------|----------------|-----------------|------|-------|
| 1 | Pallet| 300 lb | No | No (48x40x48 in) | Example Goods | | |
**TOTAL WEIGHT 300 Lbs**
---
Where the rate is dependent on value, shippers are required to state specifically in writing the agreed or declared value on the property as follows:
"The agreed or declared value of property is specifically stated by shipper to be not exceeding _______ per _______."
Received, subject to individually determined rates or contracts that have been agreed upon in writing between carrier and shipper, if applicable, otherwise to the rates, classification and rules that have been established by the carrier and are available to the shipper, on request, and to all applicable state and federal regulations.
**Note:** Liability Limitation for loss or damage in this shipment may be applicable. See 49 U.S.C. -14706(c)(1)(A) and (B).
---
This is to certify the above named materials are properly classified, packaged, marked, and labeled, and are in proper condition for transportation according to the applicable regulations of the DOT.
| Freight Loaded | Freight Counted |
|----------------|-----------------|
| By Shipper | By Shipper |
| By Driver | By Driver/pallets said to contain |
| | By Driver Pieces|
Carrier acknowledges receipt of packages and required placards. Carrier certifies emergency response information was made available and/or carrier has DOT emergency response guidebook or equivalent documentation in the vehicle. Property described above is in good order, except as noted.
**Shipper Signature** ______________________ **Date** ___________
**Carrier Signature** ______________________ **Pickup Date** ___________
================================================
FILE: shared/outputs/0012.md
================================================
# UniformSoftware
**transport business name**
P.O. Box xxx, City
State / Province
ABN xx xxx xxx xxx
Tel: xx xxxx xxxx
---
## INVOICE
| Client | Name |
|--------|------|
| | Address |
| | City State ZIP |
| | Email |
---
| Date | Cust Ref | From | To | Descrip | Cubic | Weight | Rate | Fuel Levy | Extras | Total |
|-----------|----------|-------------|------------|---------|-------|--------|-------|-----------|--------|-------|
| 1/1/2017 | ref#1 | MASCOT | SYDENHAM | 4PCS | 0.500 | 187 | $11.00| $0.88 | | $11.00|
| 2/2/2017 | ref#2 | ALEXANDRIA | WARRIEWOOD | 4PCS | 5.000 | 1086 | $12.00| $0.96 | | $12.00|
| 3/3/2017 | ref#3 | EAST BOTANY | WARRIEWOOD | 1PC | 0.600 | 117 | $13.00| $1.04 | | $13.00|
| 4/4/2017 | ref#4 | SYDENHAM | WARRIEWOOD | 4PCS | 0.700 | 1317 | $14.00| $1.12 | | $14.00|
| 5/5/2017 | ref#1 | PORT BOTANY | WARRIEWOOD | 3PCS | 0.500 | 102 | $15.00| $1.20 | | $15.00|
| 6/6/2017 | ref#2 | PORT BOTANY | EAST BOTANY| 7PCS | 0.300 | 102 | $16.00| $1.28 | | $16.00|
| 7/7/2017 | ref#3 | PORT BOTANY | WARRWOOD | 4PCS | 6.940 | 1659 | $17.00| $1.36 | | $17.00|
| 8/8/2017 | ref#4 | PORT BOTANY | SOMERSBY | 6PCS | 4.600 | 6459 | $18.00| $1.44 | | $18.00|
| 9/9/2017 | ref#1 | MASCOT | TEMPE | 15PCS | 1.700 | 821 | $19.00| $1.52 | | $19.00|
| 5/15/2017 | ref#2 | ALEXANDRIA | WARRIEWOOD | 44PCS | 1.480 | 374 | $20.00| $1.60 | | $20.00|
---
**Please remit payment to:**
Bank: bank name
BSB: xxx-xxx
A/c No.: xxxxxxxxxx
Name: account name
---
| SUBTOTAL | 155.00 |
|----------|--------|
| GST 10.00% | 15.50 |
| TOTAL | 182.90 |
================================================
FILE: shared/outputs/0013.md
================================================
## Executive Summary
### Key Observations
1. **Concentration risk in Microsoft Corp (MSFT) and Apple Inc (AAPL); combining for 10.6% of the equity allocation.**
2. **International equity allocation falls below Clark Capital’s target range.**
3. **Mid cap stocks overweight, leaving Small cap stocks underweight.**
4. **Overweight Healthcare, leaving Consumer Discretionary underweight relative to the benchmark weight.**
5. **Fixed Income has shorter duration than current Clark Capital positioning — Limiting income generation potential.**
6. **Fixed Income has a concentrated maturity schedule between 0-3 years.**
- **Portfolio Value: $545k**
- **Allocation: 82%/18% Stocks/Bonds**
- **Profile: Growth**
### Asset Allocation
- **Cash:** 5.75%
- **US Stocks:** 78.65%
- **Non-US Stocks:** 3.45%
- **Bonds:** 12.12%
- **Other/Not Clsfd:** 0.03%
### Portfolio Construction
- **Cash**
- **ETFs**
- **Mutual Funds**
- **Individual Stocks**
**Total Stock Holdings: 634**
**Total Bond Holdings: 9,748**
---
*For one-on-one use with a client’s financial advisor only. Please see end disclosures for important information.*
Page 3
## Your Equity Allocation – 82%
### Key Observations
1. **Individual Stocks 96%, ETFs 4%**
2. **Size:** Mid cap stocks overweight, leaving Small cap stocks underweight
3. **Sectors:** Overweight Healthcare, leaving Consumer Discretionary underweight relative to the benchmark weight.
4. **International:** 4% of equity – Lower than Clark Capital’s target range of 25%-35%.
5. **Direct and indirect stock holdings in the portfolio total over 600.**
### Diversification Analysis
#### Some Portfolio Overlap – Specific Concentration Risk in MSFT and AAPL
1. Owning multiple funds does not always produce the anticipated diversification benefits. Several securities (e.g. Microsoft, Apple, Meta Platforms) are held directly and by an additional fund.
2. Fund overlap exacerbates concentration concern within the portfolio, Microsoft Corp (MSFT) and Apple Inc (AAPL) combine for 10.6% of the equity allocation, creating excessive exposure to single stock fluctuations.
---
**Equity Style**
| | Value | Blend | Growth |
|--------|-------|-------|--------|
| Large | 20 | 13 | 32 |
| Mid | 15 | 15 | 2 |
| Small | 2 | 1 | 0 |
---
**Sectors:**
- **Cyclical**
- Basic Matls: 3.00% (Bmark 2.43%)
- Consumer Cycl: 5.10% (Bmark 11.01%)
- Financial Svs: 10.90% (Bmark 12.77%)
- Real Estate: 1.62% (Bmark 2.52%)
- **Sensitive**
- Commun Svs: 11.36% (Bmark 8.39%)
- Energy: 3.37% (Bmark 3.91%)
- Industrials: 9.95% (Bmark 8.72%)
- Technology: 27.90% (Bmark 28.95%)
- **Defensive**
- Consumer Def: 6.75% (Bmark 6.24%)
- Healthcare: 16.56% (Bmark 12.68%)
- Utilities: 3.49% (Bmark 2.38%)
- **Not Classified:** 0.00% (Bmark 0.00%)
**Geographic:**
- **Americas**
- Portfolio: 96.86% (Bmark 95.30%)
- North America: 96.55% (Bmark 95.31%)
- Latin America: 0.31% (Bmark 0.00%)
- **Greater Europe**
- Portfolio: 2.25% (Bmark 3.24%)
- United Kingdom: 0.15% (Bmark 0.65%)
- Europe-Developed: 1.96% (Bmark 2.56%)
- Europe-Emerging: 0.00% (Bmark 0.00%)
- Africa/Middle East: 0.14% (Bmark 0.03%)
- **Greater Asia**
- Portfolio: 0.89% (Bmark 1.45%)
- Japan: 0.23% (Bmark 0.94%)
- Australasia: 0.00% (Bmark 0.32%)
- Asia-Developed: 0.51% (Bmark 0.19%)
- Asia-Emerging: 0.15% (Bmark 0.00%)
- **Not Classified:** 0.00% (Bmark 0.00%)
---
*Benchmark indicated is automatically customized by Morningstar based on the broad asset allocation of your portfolio. For benchmark detail, please see information in end disclosures.*
---
For one-on-one use with a client’s financial advisor only. Please see end disclosures for important information.
# Tax Transition/Overlap Analysis
**Objective**
Distribute realized gains out over multiple calendar years
**Market Value: $138,494**
**Unrealized Gains: $19,943**
| Security Name | Ticker | Units | Cost | Value | Gain/Loss | 2024 | 2025 | 2026 |
|----------------------------------------|--------|-------|--------|--------|-----------|------|------|------|
| PFIZER INC | PFE | 23.00 | $1,114.91 | $662 | ($453) | ($453) | | |
| VERIZON COMMUNICATIONS INC | VZ | 24.00 | $1,371.14 | $905 | ($466) | ($466) | | |
| YUM CHINA HOLDINGS INC | YUMC | 16.00 | $956.83 | $679 | ($278) | ($278) | | |
| FOX CORP CL A | FOXA | 28.00 | $1,132.52 | $830 | ($302) | ($302) | | |
| ROBERT HALF INC | RHI | 9.00 | $654.92 | $391 | ($264) | ($264) | | |
| BIO RAD LABS INC CL A | BIO | 2.00 | $819.31 | $645 | ($174) | ($174) | | |
| MEDTRONIC PLC | MDT | 16.00 | $1,560.10 | $1,238 | ($323) | ($323) | | |
| MODERNA INC | MRNA | 13.00 | $1,570.51 | $1,293 | ($278) | ($278) | | |
| HF SINCLAIR CORP | DINO | 14.00 | $908.61 | $778 | ($131) | ($131) | | |
| RIO TINTO PLC SPONSORED ADR | RIO | 9.00 | $770.59 | $670 | ($100) | ($100) | | |
| ARCHER DANIELS MIDLAND COMPANY | ADM | 16.00 | $1,303.25 | $1,156 | ($146) | ($146) | | |
| ISHARES 5-10 YEAR IG CORP BOND ETF | IGIB | 118.00 | $6,905.54 | $6,136 | ($770) | ($770) | | |
| ISHARES 3-7YR TREASURY BOND ETF | IEI | 79.00 | $8,181.57 | $7,953 | ($228) | ($228) | | |
| NEXSTAR MEDIA GROUP INC | NXST | 7.00 | $1,198.95 | $1,097 | ($102) | ($102) | | |
| AFFILIATED MANAGERS GROUP INC | AMG | 4.00 | $843.50 | $805 | ($38) | ($38) | | |
| COGNIZANT TECHNOLOGY SOLUTIONS CORP CL A | CTSH | 17.00 | $1,341.30 | $1,284 | ($57) | ($57) | | |
| LABORATORY CORP OF AMER HOLDINGS NEW | LH | 6.00 | $1,415.64 | $1,364 | ($52) | ($52) | | |
| UNUM GROUP | UNM | 32.00 | $1,456.47 | $1,209 | ($247) | ($209) | | |
| PIMCO MORTGAGE OPPTY'S & BOND INSTL CL | PMZIX | 311.55 | $2,930.57 | $2,957 | $26 | $26 | | |
| NVIDIA CORP | NVDA | 5.00 | $2,477.52 | $2,576 | $98 | $98 | | |
| PIMCO ENHANCED SHORT MATURITY ACTIVE ETF | MINT | 117.00 | $11,631.26 | $11,674 | $44 | $44 | | |
| ARCH CAPITAL GROUP LTD | ACGL | 7.00 | $393.87 | $418 | $24 | $24 | | |
| CONSOLIDATED EDISON INC | ED | 17.00 | $52.45 | $76 | $24 | $24 | | |
| KINGSWAY FINL SUPERMATION HOLDINGS INC | KNDX | 1.00 | $92.15 | $100 | $8 | $8 | | |
| TAIWAN SEMICON MFG CO LTD SPON ADR | TSM | 20.00 | $2,213.93 | $2,287 | $74 | $74 | | |
| CISCO SYSTEMS INC | CSCO | 38.00 | $1,381.08 | $1,970 | $589 | $589 | | |
| ELECTRONIC ARTS INC | EA | 7.00 | $912.42 | $945 | $33 | $33 | | |
| DEVON ENERGY CORP NEW | DVN | 14.00 | $803.98 | $835 | $31 | $31 | | |
| MANULIFE FINANCIAL CORP | MFC | 81.00 | $1,561.98 | $1,592 | $30 | $30 | | |
| PROCTER & GAMBLE CO | PG | 9.00 | $1,284.62 | $1,484 | $200 | $200 | | |
| TEXAS INSTRUMENTS INC | TXN | 22.00 | $3,540.00 | $3,840 | $300 | $300 | | |
| GILEAD SCIENCES INC | GILD | 36.00 | $2,633.50 | $2,916 | $283 | $283 | | |
**Tax Transition**
This approach illustrates how Clark Capital would attempt to maximize the amount of assets immediately managed within a proposed investment, while spreading realized gains out over time. Gain estimates are relevant to incoming securities only and do not reflect gain/loss from regular trading of Clark Capital investments.
In the first year it is possible to target specific tickers for liquidation/incorporation, a percentage of the remaining unrealized gains, or a dollar value. Subsequent years will target identified tickers, unless otherwise indicated. The approach demonstrated here targets specific tickers for liquidation/incorporation into our investment models.
Positions that are held are monitored on an ongoing basis in partnership with Clark Capital and the financial advisor. If there is a desire to liquidate ahead of schedule, client direction would be required.
Gain/loss estimates are based on cost basis data provided to Clark Capital. Actual gains/loss at time of liquidation will vary. Upon arrival, an updated proposed tax transition plan will be prepared and discussed with the financial advisor and Clark Capital’s Tax Transition Specialist. **The final plan will likely vary from the illustration shown here.**
---
**As of 1/31/2023**
For one-on-one use with a client’s financial advisor only. Please see end disclosures for important information.
================================================
FILE: shared/outputs/0014.md
================================================
# Portfolio Slicer - Dashboard
_Last quote update: 2013-11-30 05:20:51 ET_
## Portfolio
- Hers
- Hers-Tax
- His-CDN
- His-Tax
- Joint
## Report Currency
- **Original**
- CAD
- USD
---
### Total Value
**597,298**
- Profit / Loss: 75,651 (14.51%)
- Capital Gain: 66,087 (12.45%)
- Dividends: 9,564 (2.06%)
- Deposits / Withdrawals: 68,800
- Exchange Rate Impact: 22,886 (3.98%)
- Mgmt Fee: 1,031 (0.17%)
### Allocation Target
| Allocation | Target | Actual | $ |
|-----------------|--------|--------|-----------|
| US Broad | 30% | 35% | 211,409 |
| CDN Broad | 32% | 32% | 191,530 |
| Real Estate | 10% | 9% | 53,902 |
| Emerging | 10% | 9% | 53,902 |
| Near Cash | 5% | 5% | 29,865 |
| Gold | 5% | 4% | 23,868 |
| Cash | 5% | 5% | 29,825 |
| **Grand Total** | **100%**| **100%**| **594,659**|
### Allocation
- US Broad: 35%
- CDN Broad: 32%
- Real Estate: 9%
- Emerging: 9%
- Near Cash: 5%
- Gold: 4%
- Cash: 5%
### Sectors
- Financial: 40%
- Industrials: 13%
- Technology: 9%
- Health Care: 7%
- Materials: 7%
- Consumer: 7%
- Real Estate: 6%
- Energy: 5%
- Other: 6%
### Sensitivity
- Cyclic: 21%
- Defensive: 15%
- Sensitive: 64%
### Currency
- CAD: 36%
- USD: 64%
---
### Holdings %
- XFN.TO: 23%
- XIU.TO: 19%
- BND: 8%
- VWO: 7%
- VTI: 6%
- AMZN: 5%
- GLD: 5%
- UBA: 4%
- COST: 4%
- Other: 19%
### Top 10 Winners YTD
| Symbol | Profit/Loss |
|--------|-------------|
| XFN.TO | 26,337 |
| AMZN | 9,214 |
| VTI | 7,543 |
| PFE | 7,414 |
| MSFT | 5,143 |
| XIU.TO | 5,027 |
| C | 5,427 |
| AMD | 3,524 |
| DELL | 3,001 |
| COST | 2,791 |
### Top 10 Losers YTD
| Symbol | Profit/Loss |
|--------|-------------|
| GLD | -8,264 |
| BND | -2,000 |
| VNQ | -1,561 |
| GTY | -714 |
| HD | -471 |
| UBA | 314 |
| WMT | 699 |
| GE | 1,218 |
### Top 10 Dividends YTD
| Symbol | Dividends |
|--------|-----------|
| XFN.TO | 3,192 |
| XIU.TO | 2,400 |
| UBA | 1,240 |
| PFE | 1,206 |
| VWO | 918 |
| MSFT | 610 |
| VTI | 494 |
| TGT | 422 |
| COST | 121 |
---
## Portfolio Overview
| Portfolio | Deposits | Book Value | Equity Value | Cash Value | Total Value | Realized Cap Gain | Unrealized Cap Gain | Cap Gain | Dividends | Profit | Cap Gain Last Day | Y/Y Mgmt Fee % |
|-----------|----------|------------|--------------|------------|-------------|-------------------|---------------------|----------|-----------|--------|------------------|----------------|
| His-CDN | 161,200 | 152,431 | 24,122 | 214,801 | 38,248 | 38,248 | 38,248 | 15,353 | 15,353 | 53,601 | 667 | 0.43% |
| Hers | 95,000 | 70,707 | 13,658 | 103,659 | 939 | 10,165 | 10,165 | 15,767 | 7,926 | 2,110 | 0 | 0.00% |
| Joint | 70,200 | 70,200 | 10,365 | 80,565 | 1,515 | 1,515 | 1,515 | 10,757 | 7,926 | 2,110 | 0 | 0.00% |
| Hers-Tax | 70,200 | 70,200 | 10,365 | 80,565 | 1,515 | 1,515 | 1,515 | 10,757 | 7,926 | 2,110 | 0 | 0.00% |
| His-Tax | 24,200 | 22,888 | 3,477 | 26,365 | 1,454 | 1,454 | 1,454 | 1,454 | 1,454 | 1,454 | 0 | 0.00% |
| **Grand Total** | **458,400** | **437,948** | **542,638** | **54,659** | **597,298** | **1,454** | **104,690** | **106,144** | **32,754** | **138,898** | **2,155** | **0.17%** |
================================================
FILE: shared/outputs/0015.md
================================================
| 01,10,042001700112,07007198,286.23,20180323,12636666022332927910,1992, |
| --- |
| 二维码信息:湖北增值税普通发票 |
| 通行费 |
| 机器编号:499099660821 |
| 名称:武汉市车城物流有限公司 |
| 纳税人识别号:914201007483062457 |
| 地址、电话:武汉经济技术开发区车城大道7号 84289348 |
| 开户行及账号:中国农业银行股份有限公司武汉开发区支行 17-071201040004598 |
| 密码区:030243319>1*+9*239+></<59+3-786-646/16<248>/-029029>746*7>44<97*929379677-955315>*+-6/53<13+8*010369194565>-5/04 |
| 项目名称:*经营租赁*通行费 |
| 车牌号:鄂AHG248 |
| 类型:货车 |
| 通行日期起:20180212 |
| 通行日期止:20180212 |
| 金额:286.23 |
| 税率:3% |
| 税额:8.59 |
| 合计:¥286.23 |
| 价税合计(大写):贰佰玖拾肆元捌角贰分 |
| (小写):¥294.82 |
| 销售方名称:湖北随岳南高速公路有限公司 |
| 纳税人识别号:91420000753416406R |
| 地址、电话:武汉开发区C7C1地块东合中心B栋1601号 027-83458755 |
| 开户行及账号:民生银行武汉光谷口支行0514014170001889 |
| 备注: |
| 收款人:龙梦媛 |
| 复核:陈煜 |
| 开票人:尹晨 |
| 销售方(章):湖北随岳南高速公路有限公司 91420000753416406R 发票专用章 |
| 发票代码:042001700112 |
| 发票号码:07007198 |
| 开票日期:2018年03月23日 |
| 校验码:12636 66602 23329 27910 |
================================================
FILE: shared/outputs/0016.md
================================================
| Valori Nutrizionali/Nutrition Facts/ | per/per/pro/por |
|-----------------------------------|-----------------|
| Energia/Energy/Energie/Valor energético | Kj 2577/Kcal 616 |
| Grassi/Fat/Fett/Grasas | 49.9 g |
| di cui acidi grassi saturi/of which saturates/davon gesättigte Fettsäuren/de las cuales saturadas | 8.3 g |
| Carboidrati/Carbohydrate/Kohlenhydrate/Hidratos de carbono | 12.0 g |
| di cui zuccheri/of which sugars/davon Zucker/de los cuales azúcar | 5.1 g |
| Fibre/Fibre/Ballaststoffe/Fibra alimentaria | 8.3 g |
| Proteine/Protein/Eiweiß/Proteínas | 24.8 g |
| Sale/Salt/Salz/Sal | 0.0 g |
**IT Ingredienti:** 100% Arachidi
**EN Ingredients:** 100% Peanuts
**DE Zutaten:** 100% Erdnüsse
**ES Ingredientes:** 100% Cacahuetes
**220 g**
---
**100% PEANUT**
**PEANUT BUTTER**
**ORIGIN:** Argentina
Può contenere tracce di altra frutta a guscio, soia, latte e sesamo/May contain traces of other nuts, soya, milk and sesame/Kann Spuren von anderen Nüssen, Soja, Milch und Sesam enthalten/Puede contener trazas de otros frutos secos, soja, leche y sésamo
Conservare in luogo fresco e asciutto/Store in a cool and dry place/Kühl und trocken lagern/Conservar en un lugar fresco y seco
Prodotto e confezionato per Bowlpros Srl v.le E. Caldara 24. 20122 Milano (MI) Italia nello stabilimento di via Ferrovia 110.
80040 San Gennaro Vesuviano (NA)
<[email protected]>
<www.bowlpros.com>
Da consumarsi preferibilmente entro il/Best before/Mindestens haltbar bis/Consumir preferentemente antes del
![Barcode](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAABgElEQVR42mJ8//8/AyUYBFRgYGBg+P//P4QxA0MDw9D8//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3+//8fQwMDw3
================================================
FILE: shared/outputs/0017.md
================================================
# 御見積書
長崎北郵便局 様
下記の通り御見積申し上げます。
何卒御用命下さる様お願い申し上げます。
---
件名: 長崎北郵便局(親時計更新)
受渡期間:
- 受注後: 60日間
- 見積有効期間: 90日間
代金:
- 消費税(別途)
- 配線工事費(別途)
- 調整費(含む)
- 取付工事費(含む)
申受条件:
- 荷造運賃(含む)
〒812-0026
福岡県福岡市博多区上川端町8-18
TEL: 092-281-0020 FAX: 092-281-0112
シチズンTIC株式会社
---
御見積金額 ¥712,000 -
| 品目コード | 製品名 | 数量 | 単位 | 単価 | 金額 |
|-------------|--------|------|------|------|------|
| KM-82TC-4P | 親時計4回線壁掛型 タイム・チャイム | 1 | 台 | 712,000 | 712,000 |
※設置工事費含む。
※キャンペーン期間中の為設置工事費無料です。
---
| 標準価格計 | 712,000 |
|-------------|---------|
| 割引合計額 | |
| 総合計 | 712,000 |
---
※受注製作品・特注品が含まれている場合、御発注後にキャンセル又は仕様変更が発生した場合別途費用を御請求させていただきます。御了承ください。
受注製作品・特注品の内容に関しては営業担当者へ確認ください。
1/1
================================================
FILE: shared/outputs/0018.md
================================================
# UNITED STATES
# SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
# FORM 10-Q
(Mark One)
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the quarterly period ended September 30, 2024
OR
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from ______ to ______
Commission File Number: 001-34756
# Tesla, Inc
(Exact name of registrant as specified in its charter)
| Texas | 91-2197729 |
| (State or other jurisdiction of incorporation or organization) | (I.R.S. Employer Identification No.) |
1 Tesla Road
Austin, Texas
(Address of principal executive offices)
78725
(Zip Code)
(512) 516-8177
(Registrant’s telephone number, including area code)
Securities registered pursuant to Section 12(b) of the Act:
| Title of each class | Trading Symbol(s) | Name of each exchange on which registered |
|---------------------|-------------------|------------------------------------------|
| Common stock | TSLA | The Nasdaq Global Select Market |
Indicate by check mark whether the registrant (1) has
[Table of Contents](#)
**PART I. FINANCIAL INFORMATION**
**ITEM 1. FINANCIAL STATEMENTS**
**Tesla, Inc.**
**Consolidated Balance Sheets**
*(in millions, except per share data)*
*(unaudited)*
| | September 30, 2024 | December 31, 2023 |
|---|---|---|
| **Assets** | | |
| **Current assets** | | |
| Cash and cash equivalents | $18,111 | $16,398 |
| Short-term investments | 15,537 | 12,696 |
| Accounts receivable, net | 3,313 | 3,508 |
| Inventory | 14,530 | 13,626 |
| Prepaid expenses and other current assets | 4,888 | 3,388 |
| **Total current assets** | 56,379 | 49,616 |
| Operating lease vehicles, net | 5,380 | 5,989 |
| Solar energy systems, net | 5,040 | 5,229 |
| Property, plant and equipment, net | 36,116 | 29,725 |
| Operating lease right-of-use assets | 4,867 | 4,180 |
| Digital assets, net | 184 | 184 |
| Intangible assets, net | 158 | 178 |
| Goodwill | 253 | 253 |
| Deferred tax assets | 6,486 | 6,733 |
| Other non-current assets | 4,989 | 4,531 |
| **Total assets** | $119,852 | $106,618 |
| **Liabilities** | | |
| **Current liabilities** | | |
| Accounts payable | $14,654 | $14,431 |
| Accrued liabilities and other | 10,601 | 9,080 |
| Deferred revenue | 3,031 | 2,864 |
| Current portion of debt and finance leases | 2,291 | 2,373 |
| **Total current liabilities** | 30,577 | 28,748 |
| Debt and finance leases, net of current portion | 5,405 | 2,857 |
| Deferred revenue, net of current portion | 3,350 | 3,251 |
| Other long-term liabilities | 9,810 | 8,153 |
| **Total liabilities** | 49,142 | 43,009 |
| Commitments and contingencies (Note 10) | 70 | 242 |
| Redeemable noncontrolling interests in subsidiaries | | |
| **Equity** | | |
| Stockholders’ equity | | |
| Preferred stock; $0.001 par value; 100 shares authorized; no shares issued and outstanding | — | — |
| Common stock; $0.001 par value; 6,000 shares authorized; 3,207 and 3,185 shares issued and outstanding as of September 30, 2024 and December 31, 2023, respectively | 3 | 3 |
| Additional paid-in capital | 37,286 | 34,892 |
| Accumulated other comprehensive loss | (14) | (143) |
| Retained earnings | 32,656 | 27,882 |
| **Total stockholders’ equity** | 69,931 | 62,634 |
| Noncontrolling interests in subsidiaries | 709 | 733 |
| **Total liabilities and equity** | $119,852 | $106,618 |
The accompanying notes are an integral part of these consolidated financial statements.
4
# Tesla, Inc
## Consolidated Statements of Operations
### (in millions, except per share data)
### (unaudited)
| | Three Months Ended September 30, | Nine Months Ended September 30, |
|--------------------------------|----------------------------------|---------------------------------|
| | 2024 | 2023 | 2024 | 2023 |
| **Revenues** | | | | |
| Automotive sales | $18,831 | $18,582 | $53,821 | $57,879 |
| Automotive regulatory credits | 739 | 554 | 2,071 | 1,357 |
| Automotive leasing | 446 | 489 | 1,380 | 1,620 |
| **Total automotive revenues** | 20,016 | 19,625 | 57,272 | 60,856 |
| Energy generation and storage | 2,376 | 1,559 | 7,025 | 4,597 |
| Services and other | 2,790 | 2,166 | 7,686 | 6,153 |
| **Total revenues** | 25,182 | 23,350 | 71,983 | 71,606 |
| **Cost of revenues** | | | | |
| Automotive sales | 15,743 | 15,656 | 45,602 | 47,919 |
| Automotive leasing | 247 | 301 | 761 | 972 |
| **Total automotive cost of revenues** | 15,990 | 15,957 | 46,363 | 48,891 |
| Energy generation and storage | 1,651 | 1,178 | 5,157 | 3,770 |
| Services and other | 2,544 | 2,037 | 7,192 | 5,723 |
| **Total cost of revenues** | 20,185 | 19,172 | 58,712 | 58,384 |
| **Gross profit** | 4,997 | 4,178 | 13,271 | 13,222 |
| **Operating expenses** | | | | |
| Research and development | 1,039 | 1,161 | 3,264 | 2,875 |
| Selling, general and administrative | 1,186 | 1,253 | 3,837 | 3,520 |
| Restructuring and other | 55 | - | 677 | - |
| **Total operating expenses** | 2,280 | 2,414 | 7,778 | 6,395 |
| **Income from operations** | 2,717 | 1,764 | 5,493 | 6,827 |
| Interest income | 429 | 282 | 1,127 | 733 |
| Interest expense | (92) | (38) | (254) | (95) |
| Other (expense) income, net | (270) | 37 | (142) | 317 |
| **Income before income taxes** | 2,784 | 2,045 | 6,224 | 7,782 |
| Provision for income taxes | 601 | 167 | 1,403 | 751 |
| **Net income** | 2,183 | 1,878 | 4,821 | 7,031 |
| Net income (loss) attributable to noncontrolling interests and redeemable noncontrolling interests in subsidiaries | 16 | 25 | 47 | (38) |
| **Net income attributable to common stockholders** | $2,167 | $1,853 | $4,774 | $7,069 |
| **Net income per share of common stock attributable to common stockholders** | | | | |
| Basic | $0.68 | $0.58 | $1.51 | $2.23 |
| Diluted | $0.62 | $0.53 | $1.38 | $2.03 |
| Weighted average shares used in computing net income per share of common stock | | | | |
| Basic | 3,198 | 3,176 | 3,192 | 3,171 |
| Diluted | 3,497 | 3,493 | 3,489 | 3,481 |
The accompanying notes are an integral part of these consolidated financial statements.
5
# Tesla, Inc
## Consolidated Statements of Comprehensive Income
### (in millions)
### (unaudited)
| | Three Months Ended September 30, | Nine Months Ended September 30, |
|--------------------------------|----------------------------------|---------------------------------|
| | 2024 | 2023 | 2024 | 2023 |
| **Net income** | $ 2,183 | $ 1,878 | $ 4,821 | $ 7,031 |
| **Other comprehensive income (loss):** | | | | |
| Foreign currency translation adjustment | 445 | (289) | 121 | (343) |
| Unrealized net gain on investments, net of tax | 8 | 7 | 8 | 8 |
| Net loss realized and included in net income | — | — | — | 4 |
| **Comprehensive income** | 2,636 | 1,596 | 4,950 | 6,700 |
| **Less: Comprehensive income (loss) attributable to noncontrolling interests and redeemable noncontrolling interests in subsidiaries** | 16 | 25 | 47 | (38) |
| **Comprehensive income attributable to common stockholders** | $ 2,620 | $ 1,571 | $ 4,903 | $ 6,738 |
The accompanying notes are an integral part of these consolidated financial statements.
---
6
Table of Contents
---
Tesla, Inc.
Consolidated Statements of Cash Flows
(in millions)
(unaudited)
| | Nine Months Ended September 30, |
|---|---|---|
| | 2024 | 2023 |
**Cash Flows from Operating Activities**
Net income | $ 4,821 | $ 7,031
Adjustments to reconcile net income to net cash provided by operating activities:
- Depreciation, amortization and impairment | 3,872 | 3,435
- Stock-based compensation | 1,420 | 1,328
- Inventory and purchase commitments write-downs | 247 | 361
- Foreign currency transaction net unrealized loss (gain) | 197 | (317)
- Deferred income taxes | 418 | (316)
- Non-cash interest and other operating activities | 83 | 94
Changes in operating assets and liabilities:
- Accounts receivable | 144 | 377
- Inventory | (1,107) | (1,953)
- Operating lease vehicles | (82) | (1,858)
- Prepaid expenses and other assets | (2,639) | (1,992)
- Accounts payable, accrued and other liabilities | 2,504 | 1,922
- Deferred revenue | 231 | 774
Net cash provided by operating activities | 10,109 | 8,886
**Cash Flows from Investing Activities**
Purchases of property and equipment excluding finance leases, net of sales | (8,556) | (6,592)
Purchases of solar energy systems, net of sales | (6) | —
Purchases of investments | (20,797) | (13,221)
Proceeds from maturities of investments | 17,975 | 8,959
Proceeds from sales of investments | 200 | 138
Business combinations, net of cash acquired | — | (64)
Net cash used in investing activities | (11,184) | (10,780)
**Cash Flows from Financing Activities**
Proceeds from issuances of debt | 4,360 | 2,526
Repayments of debt | (1,783) | (887)
Proceeds from exercises of stock options and other stock issuances | 788 | 548
Principal payments on finance leases | (291) | (340)
Debt issuance costs | (6) | (23)
Distributions paid to noncontrolling interests in subsidiaries | (76) | (105)
Payments for buy-outs of noncontrolling interests in subsidiaries | (124) | (17)
Net cash provided by financing activities | 2,868 | 1,702
Effect of exchange rate changes on cash and cash equivalents and restricted cash | (8) | (142)
Net increase (decrease) in cash and cash equivalents and restricted cash | 1,785 | (334)
Cash and cash equivalents and restricted cash, beginning of period | 17,189 | 16,924
Cash and cash equivalents and restricted cash, end of period | $ 18,974 | $ 16,590
**Supplemental Non-Cash Investing and Financing Activities**
Acquisitions of property and equipment included in liabilities | $ 2,727 | $ 1,717
Leased assets obtained in exchange for finance lease liabilities | $ 32 | $ 1
Leased assets obtained in exchange for operating lease liabilities | $ 1,232 | $ 1,548
The accompanying notes are an integral part of these consolidated financial statements.
---
9
================================================
FILE: shared/outputs/0019.md
================================================
[Walmart]
**See back of receipt for your chance to win $1000 ID#: TN5NV1VXCQDQ**
317-851-1102 Mrg: JAMIE BROOKSHIRE
882 S. STATE ROAD 135
GREENWOOD, IN 46143
| Item Code | Description | Qty | Price |
|---------------|---------------------------|-----|-------|
| 05483 | TATER TOTS | | 2.96 |
| 0071436 | OPI | | 1.88 |
| 001320020062 | F | | 5.88 |
| 003120065 | SNACK BARS | | 5.88 |
| 003120248164 | HRI CL CHS | | 5.88 |
| 003120065000 | HRI CL CHS | | 5.88 |
| 001201254 | VOIDED ENTRY | | |
| 003120053000 | HRI 12 U SG | | 5.88 |
| 0074316 | PEANUT BUTTER | | 3.18 |
| 001376420528 | ACCESSORY | | 2.96 |
| 0000000000 | BTS DRY BLON | | 3.28 |
| 002246021090 | TR HS FRM 4 | | 32.00 |
| 003178201896 | GV SLIDERS | | 2.74 |
| 003178201526 | BAGELS | | 2.50 |
| 003178201286 | CHEEZE IT | | 4.00 |
| 003201563 | RITZ WLS 4.5 | | 2.78 |
| 004800078 | RUFFLES | | 2.78 |
| 004800092914 | GV HNY GRMS | | 2.50 |
---
**SUBTOTAL**: 139.24
**TAX 1**: 7.00%
**TOTAL**: 141.02
**CASH TEND**: 150.00
**CHANGE DUE**: 6.00
**# ITEMS SOLD**: 26
**TC#**: 0783 6080 4072 3416 2495 6
**Date**: 04/27/19
**Time**: 12:59:46
---
Scan with Walmart app to save receipts
================================================
FILE: shared/outputs/0020.md
================================================
# ZESTADO EXPRESS
**ABN:** 16 112 221 123
Mercure Tower, Floors 1 – 10
10 Queen Street, King George Square
Brisbane City QLD 4000
Australia
**Bill To:**
Custom Board Makers
Administration Centre
12 Salvage Road
Acacia Ridge BC QLD 4110
Australia
**SHIPPING INVOICE 10112**
Issue Date: 8th December 2021
Account No: 101234
Invoice Amount: $5,270.00 AUD
Please Pay By: 15th December 2021
**Page: 1 of 2**
---
## Waybill No: 012345A
### Main Shipping Information
**Sender**
**Customer Reference:** GC12345
Booking Contact: Zac Rider
Booking Phone: 61 7 4321 1234
Email: <[email protected]>
Type of Goods: Surfboards
Total Pieces: 360
Gross Weight: 1,010 kg
Place of Discharge: Port of Brisbane
Shipped Date: 12-Nov-2021
Place of Delivery: Kahului Maui Hawaii Port
Delivered Date: 23-Nov-2021
**Recipient**
Maui Surf Shop
12 Haleakala Drive
West Maui
Lahaina HI 96761
United States
**Description of Shipped Items**
| Description | Qty | Dimensions |
|-------------|-----|------------|
| 4321-A1 XL Custom Supreme Light Stand Up | 150 | 13' x 18" x 3' |
| 4421-B1 Custom Deluxe Longboards | 120 | 10' x 20" x 3' |
| 4231-C1 Eco High Performance Mini Mals | 90 | 7' x 18" x 2' |
**Description of Charges**
| Description | Amount |
|-------------|--------|
| Ocean Freight Charge | $1,500.00 |
| Insurance Cover | $250.00 |
| Terminal Handling | $55.00 |
**Charges**
| Description | Amount |
|-------------|--------|
| Customs Tax | $100.00 |
| Customs Duties | $75.00 |
**TOTAL**
$1,980.00
---
## Waybill No: 012346B
### Main Shipping Information
**Sender**
**Customer Reference:** SC12367
Booking Contact: Reece Gnarly
Booking Phone: 61 7 4331 1234
Email: <[email protected]>
Type of Goods: Surfboards
Total Pieces: 225
Gross Weight: 850 kg
Place of Discharge: Port of Brisbane
Shipped Date: 12-Nov-2021
Place of Delivery: Port of Long Beach
Delivered Date: 5-Dec-2021
**Recipient**
Long Beach Surf Shop
150 Foam Avenue
Shark Beach CA 90760
United States
**Description of Shipped Items**
| Description | Qty | Dimensions |
|-------------|-----|------------|
| 4321-A1 XL Custom Supreme Light Stand Up | 150 | 13' x 18" x 3' |
| 4123-D1 Finnastic Funboards | 75 | 7' x 18" x 2' |
**Description of Charges**
| Description | Amount |
|-------------|--------|
| Ocean Freight Charge | $980.00 |
| Terminal Handling | $55.00 |
**Charges**
| Description | Amount |
|-------------|--------|
| Customs Tax | $100.00 |
| Customs Duties | $75.00 |
**TOTAL**
$1,210.00
---
Please login to your [shipping portal](#) to pay for the invoice.
**Thank you for your business!**
================================================
FILE: shared/outputs/0021.md
================================================
**PRAWO JAZDY**
**RZECZPOSPOLITA POLSKA**
1\. BRANDT<br>
2\. MARTIN<br>
3\. 27.06.1988 CRIVITZ<br>
4a\. 24.04.2019<br>
4b\. 10.09.2033<br>
4c\. STAROSTA POLICKI<br>
4d\. 880627<br>
5\. 00359/19/3211<br>
7\.
8806670172
9\. AM/B1/B
PL
POLSKA
================================================
FILE: shared/outputs/0022.md
================================================
**Ohio**
**DRIVER LICENSE**
**DL**
**Class D**
TED STRICKLAND, GOVERNOR
Mike Rankin, Registrar BMV
9900TL5467900302
USA
1\. PUBLIC<br>
2\. JANE Q<br>
5\. 1970 W BROAD ST<br>
COLUMBUS, OH 43223<br>
4d\. LICENSE NO.<br>
TL545786<br>
3\. BIRTHDATE<br>
07-09-1962<br>
4a\. ISSUE DATE<br>
04-01-2009<br>
9\. CLASS<br>
D<br>
4b\. EXPIRES<br>
07-09-2012<br>
9a\. ENDORS<br>
12\. RESTR<br>
A<br>
07-09-1962
**Signature**
15\. Sex: F<br>
16\. Ht: 5-08<br>
17\. Wt: 130<br>
18\. Eyes: BRO<br>
19\. Hair: BRO<br>
**ORGAN DONOR OHIO**
HEALTHCARE<br>
POWER OF ATTY<br>
LIFE SUSTAINING<br>
EQUIPMENT<br>
================================================
FILE: shared/outputs/0023.md
================================================
# DRIVER LICENSE
## Tennessee
### THE VOLUNTEER STATE
USA
TN
**DL NO.** 123456789
**EXP** 02/11/2026
**DOB** 02/11/1974
**ISS** 02/11/2019
**CLASS** D
**END** NONE
**REST** 01
**SEX** F
**HGT** 5'-05"
**EYES** BLU
**DD** 1234567890123456
**SAMPLE**
**JANICE**
123 MAIN STREET
APT. 1
NASHVILLE, TN 37210
**Janice Sample**
**DL**
================================================
FILE: shared/outputs/0024.md
================================================
# CALIFORNIA
## EXPIRES ON BIRTHDAY
**1986**
## DRIVER LICENSE
- N8685798
- Michael Joe Jackson
- 4641 Hayvenhurst
- Los Angeles Ca 91316
| SEX | HAIR | EYES | HEIGHT | WEIGHT | DATE OF BIRTH |
| --- | ---- | ---- | ------ | ------ | ------------- |
| M | Blk | Brn | 5-9 | 120 | 8-29-58 |
**PRE LIC EXP** 85
**OTHER ADDRESS** CLASS 3
**MUST WEAR CORRECTIVE LENSES** □
**SEE OVER FOR ANY OTHER CONDITIONS**
**SECTION 12804 VEHICLE CODE**
**X** Michael Joe Jackson
4-28-83 | clckjw | DMV
**DO NOT LAMINATE**
**AHIJ**
================================================
FILE: shared/outputs/0025.md
================================================
# NEW YORK STATE USA
## LEARNER PERMIT
### UNDER 21
**Mark J.F. Schroeder**
Commissioner of Motor Vehicles
**ID** 987 654 321
**Class** DJ
**Sex** F
**Eyes** BLU
**Height** 5'-08"
**DOB** 10/31/2003
**Issued** 03/07/2022
**Expires** 10/31/2026
**E** NONE
**R** NONE
**Michelle M. Motorist**
**MOTORIST**
MICHELLE, MARIE
2345 ANYWHERE STREET
ALBANY, NY 12222
**U18 UNTIL** OCT 21
**U21 UNTIL** OCT 24
**Organ Donor**
**OCT 31 03 123456789**
================================================
FILE: shared/outputs/0026.md
================================================
# California USA | DRIVER LICENSE
**DL** 11234568
**EXP** 08/31/2014
**LN** CARDHOLDER
**FN** IMA
2570 24TH STREET
ANYTOWN, CA 95818
**DOB** 08/31/1977
**RSTR** NONE
**CLASS** C
**END** NONE
**DONOR**
**VETERAN**
**SEX** F
**HGT** 5'-05"
**HAIR** BRN
**WGT** 125 lb
**EYES** BRN
**DD** 00/00/0000NNNAN/ANFD/YY
**ISS** 08/31/2009
**Signature:** Ima Cardholder
**0831977**
================================================
FILE: shared/outputs/0027.md
================================================
# Pennsylvania | IDENTIFICATION CARD
visitPA.com | USA
**NOT FOR REAL ID PURPOSES**
**4d IDN:** 99 999 999
**DUPS:** 00
**3 DOB:** 01/07/1973
**1** SAMPLE
**2** ANDREW JASON
**8** 123 MAIN STREET
APT. 1
HARRISBURG, PA 17101-0000
**4b EXP:** 01/31/2026
**4a ISS:** 01/07/2022
**15 SEX:** M
**18 EYES:** BRO
**16 HGT:** 5'-11"
**5 DD:** 1234567890123
456789012345
**ID**
**❤️ ORGAN DONOR**
**SAMPLE**
**Andrew Sample**
================================================
FILE: shared/outputs/0028.md
================================================
# CALIFORNIA LICENSE
**EXPIRES ON BIRTHDAY**
**1970**
ISSUED IN ACCORDANCE WITH THE CALIFORNIA VEHICLE CODE
**Ronald J. Thomas**
DRIVERS LICENSE ADMINISTRATOR
**DRIVER**
David Franklin Thomas
5798 Olive St
Paradise, Calif 95969
**W106438**
| SEX | COLOR HAIR | COLOR EYES | HEIGHT | WEIGHT | MARRIED |
| --- | ---------- | ---------- | ------ | ------ | ------- |
| M | Gry | Blu | 6-0 | 205 | Yes |
| DATE OF BIRTH | AGE | PREVIOUS LICENSE |
| ------------- | --- | ---------------- |
| Aug 20, 1892 | 72 | Calif |
MUST WEAR
Corrective Lenses ☑
SEE OVER FOR ANY
OTHER CONDITIONS □
OTHER
ADDRESS
X D. F. Thomas
CLASS 3. MAY DRIVE 2 AXLE VEHICLE, EXCEPT BUS DESIGNED FOR MORE THAN 15 PASSENGERS. MAY TOW VEHICLE LESS THAN 6,000 LBS. GROSS.
**Office** Paradise
**Date** 8-4-65
**MUST BE CARRIED WHEN OPERATING A MOTOR VEHICLE AND WHEN APPLYING FOR RENEWAL**
================================================
FILE: shared/outputs/0029.md
================================================
**CALIFORNIA | DRIVER LICENSE**
MUST BE CARRIED WHEN OPERATING A MOTOR VEHICLE AND WHEN APPLYING FOR RENEWAL
**EXPIRES ON BIRTHDAY**
**1984**
- W0209369
- James Scott Garner
- 35 Oakmont Dr
- Los Angeles CA 90049
| SEX | HAIR | EYES | HEIGHT | WEIGHT | PRE LIC EXP |
| --- | ---- | ---- | ------ | ------ | ----------- |
| M | Blk | Brn | 6-3 | 210 | 80 |
**DATE OF BIRTH** 4-7-23
**MUST WEAR CORRECTIVE LENSES** □
**SEE OVER FOR ANY OTHER CONDITIONS**
**OTHER ADDRESS** CLASS 3
**SECTION 12804 VEHICLE CODE**
**X** James S. Garner
3-25-80 | Gln rc
**DO NOT LAMINATE**
================================================
FILE: shared/outputs/0030.md
================================================
# CALIFORNIA | DRIVER LICENSE
**MUST BE CARRIED WHEN OPERATING A MOTOR VEHICLE AND WHEN APPLYING FOR RENEWAL**
**EXPIRES ON BIRTHDAY**
- N2287802
- Kenneth Wayne Shanaberger
- 1541 Beloit Ave #208
- Los Angeles CA 90025
| SEX | HAIR | EYES | HEIGHT | WEIGHT | PRE LIC EXP |
| --- | ---- | ---- | ------ | ------ | ----------- |
| M | Brn | Brn | 5-6 | 130 | 82 |
**DATE OF BIRTH**
**MUST WEAR CORRECTIVE LENSES** □
**SEE OVER FOR ANY OTHER CONDITIONS**
**OTHER ADDRESS** CLASS 3
**SECTION 12804 VEHICLE CODE**
**X**
08-21-80 | Tor mw
**DO NOT LAMINATE**
================================================
FILE: shared/outputs/0031.md
================================================
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR**
---
**PASSPORT**
**PASSEPORT**
**PASAPORTE**
**UNITED STATES OF AMERICA**
**Type / Type / Tipo** P
**Code / Code / Codigo** USA
**Passport No. / No. du Passeport / No. de Pasaporte** 546844936
**Surname / Nom / Apellidos** ABRENICA
**Given Names / Prénoms / Nombres** JARED MICHAEL
**Nationality / Nationalité / Nacionalidad** UNITED STATES OF AMERICA
**Date of birth / Date de naissance / Fecha de nacimiento** 10 Feb 2001
**Place of birth / Lieu de naissance / Lugar de nacimiento** NEW YORK, U.S.A.
**Sex / Sexe / Sexo** M
**Date of issue / Date de délivrance / Fecha de expedición** 06 Jun 2016
**Date of expiration / Date d'expiration / Fecha de expiración** 05 Jun 2021
**Authority / Autorité / Autoridad** United States Department of State
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 27
**USA**
P<USAABRENICA<<JARED<MICHAEL<<<<<<<<<<<<<<<<
5468449363USA0102100M2106054275193173<681306
================================================
FILE: shared/outputs/0032.md
================================================
**ENDORSEMENTS AND LIMITATIONS**
This passport is valid for all countries unless otherwise specified. The bearer must comply with any visa or other entry regulations of the countries to be visited.
SEE OBSERVATIONS BEGINNING ON PAGE 5 (IF APPLICABLE)
**MENTIONS ET RESTRICTIONS**
Ce passeport est valable pour tous les pays, sauf indication contraire. Le titulaire doit se conformer aux formalités relatives aux visas ou aux autres formalités d'entrée des pays où il a l'intention de se rendre.
VOIR LES OBSERVATIONS DÉBUTANT À LA PAGE 5 (LE CAS ÉCHÉANT)
**Signature of bearer - Signature du titulaire**
GK141569
---
**CANADA**
**PASSPORT**
**PASSEPORT**
**Type/Type** P
**Issuing Country/Pays émetteur** CAN
**Passport No./N° de passeport** GK141569
**Surname/Nom** MANN
**Given names/Prénoms** JASKARAN SINGH
**Nationality/Nationalité** CANADIAN / CANADIENNE
**Date of birth/Date de naissance** 18 FEB / FÉV 93
**Sex/Sexe** M
**Place of birth/Lieu de naissance** MAUR NABHA, JNAOIA
**Date of issue/Date de délivrance** 17 FEB / FEB 18
**Date of expiry/Date d'expiration** 16 FEB / FEB 28
**Issuing Authority/Autorité de délivrance** TORONTO
P<CANMANN<<JASKARAN<<SINGH<<<<<<<<<<<<<<<<<<
GK141569<8CAN8607294M2707202<<<<<<<<<<<<<<00
ED197265
================================================
FILE: shared/outputs/0033.md
================================================
Assinatura do titular / Signature du titulaire
Bearer's signature / Firma del titular
Este passaporte deve ser assinado pelo titular, salvo em caso de incapacidade.
Ce passeport doit être signé par le titulaire, sauf en cas d'incapacité.
This passport must be signed, except where the bearer is unable to do so.
Este pasaporte debe ser firmado por el titular, salvo en caso de incapacidad.
AA000000
---
**REPÚBLICA FEDERATIVA DO BRASIL**
**PASSAPORTE**
**PASSPORT**
**TIPO/TYPE:** P
**PAÍS EMISSOR/ISSUING COUNTRY:** BRA
**PASSAPORTE Nº/PASSPORT No.:** AA000000
**SOBRENOME/SURNAME:** FARIAS DOS SANTOS
**NOME/GIVEN NAMES:** RODRIGO
**NACIONALIDADE/NATIONALITY:** BRASILEIRO(A)
**DATA DO NASCIMENTO/DATE OF BIRTH:** 16 MAR/MAR 2004
**IDENTIDADE Nº/PERSONAL No:**
**SEXO/SEX:** M
**NATURALIDADE/PLACE OF BIRTH:** BRASÍLIA/DF
**FILIAÇÃO/FILIATION:**
MARCOS JOSÉ DOS SANTOS
AMANDA FARIAS DOS SANTOS
O titular, enquanto menor, está autorizado pelos genitores, pelo prazo deste documento, a viajar
apenas com um dos pais, indistintamente: Res. CNJ 131/11, Art. 13.
**DATA DE EXPEDIÇÃO/DATE OF ISSUE:** 06 JUL/JUL 2015
**VALIDO ATÉ/DATE OF EXPIRY:** 05 JUL/JUL 2025
**AUTORIDADE/AUTHORITY:** DPAS/DPF
P<BRAFARIAS<DOS<SANTOS<<RODRIGO<<<<<<<<<<<<<
AA000000<0BRA0403162M2507053<<<<<<<<<<<<<<04
================================================
FILE: shared/outputs/0034.md
================================================
**ENDORSEMENTS AND LIMITATIONS**
This passport is valid for all countries unless otherwise specified. The bearer must comply with any visa or other entry regulations of the countries to be visited.
SEE OBSERVATIONS BEGINNING ON PAGE 5 (IF APPLICABLE)
**MENTIONS ET RESTRICTIONS**
Ce passeport est valable pour tous les pays, sauf indication contraire. Le titulaire doit se conformer aux formalités relatives aux visas ou aux autres formalités d'entrée des pays où il a l'intention de se rendre.
VOIR LES OBSERVATIONS DÉBUTANT À LA PAGE 5 (LE CAS ÉCHÉANT)
[Signature]
**Signature of bearer - Signature du titulaire**
HK444152
---
**CANADA**
**PASSPORT**
**PASSEPORT**
**Type/Type** P
**Issuing Country/Pays émetteur** CAN
**Passport No./N° de passeport** HK444152
**Surname/Nom** WITTMACK
**Given names/Prénoms** BRIAN FREDRICK
**Nationality/Nationalité** CANADIAN/CANADIENNE
**Date of birth/Date de naissance** 01 NOV 47
**Sex/Sexe** M
**Place of birth/Lieu de naissance** CONSORT CAN
**Date of issue/Date de délivrance** 13 JUNE/JUIN 16
**Date of expiry/Date d'expiration** 13 JUNE/JUIN 26
**Issuing Authority/Autorité de délivrance** MISSISSAUGA
P<CANWITTMACK<<BRIAN<FREDRICK<<<<<<<<<<<<<<
HK444152<5CAN4711018M2606130<<<<<<<<<<<<<<06
EGD69494
================================================
FILE: shared/outputs/0035.md
================================================
**THIS PAGE IS RESERVED FOR OFFICIAL OBSERVATIONS**
**CETTE PAGE EST RÉSERVÉE AUX OBSERVATIONS OFFICIELLES (11)**
**THERE ARE NO OFFICIAL OBSERVATIONS**
---
**UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND**
**PASSPORT**
**PASSEPORT**
**Type/Type** P
**Code/Code** GBR
**Passport No./Passeport No.** 518242591
**Surname/Nom (1)** WEBB
**Given names/Prénoms (2)** JAMES ROBERT
**Nationality/Nationalité (3)** BRITISH CITIZEN
**Date of Birth/Date de naissance (4)** 17 FEB / FEV 77
**Sex/Sexe (5)** M
**Place of birth/Lieu de naissance (6)** CROYDON
**Date of issue/Date de délivrance (7)** 24 OCT / OCT 13
**Authority/Autorité (8)** IPS
**Date of expiry/Date d'expiration (9)** 24 APR / AVR 24
**Holder's signature/Signature du titulaire (10)** [Signature]
P<GBRWEBB<<JAMES<ROBERT<<<<<<<<<<<<<<<<<<<<<
5182425917GBR7702174M2404244<<<<<<<<<<<<<<06
================================================
FILE: shared/outputs/0036.md
================================================
**RESIDENZA / RESIDENCE / DOMICILE (11)** TORINO (TO)
**RESIDENZA / RESIDENCE / DOMICILE (11)**
**RESIDENZA / RESIDENCE / DOMICILE (11)**
**STATURA / HEIGHT / TAILLE (12)** 176
**COLORE DEGLI OCCHI / COLOUR OF EYES / COULEUR DES YEUX (13)** MARRONI
---
**REPUBBLICA ITALIANA**
**PASSAPORTO**
**PASSPORT**
**PASSEPORT**
**Tipo. Type. Type.** P
**Codice Paese. Code of Issuing State. Code du pays émetteur.** ITA
**Passaporto N. Passport No. Passeport N°.** YA8116396
**Cognome. Surname. Nom. (1)** TREVISAN
**Nome. Given Names. Prénoms. (2)** MARCO
**Cittadinanza. Nationality. Nationalité. (3)** ITALIANA
**Data di nascita. Date of birth. Daté de naissance. (4)** 12 FEB / FEB 1966
**Sesso. Sex. Sexe. (5)** M
**Luogo di nascita. Place of birth. Lieu de naissance. (6)** FELTRE (BL)
**Data di rilascio. Date of issue. Date de délivrance. (7)** 10 LUG / JUL 2015
**Autorità. Authority. Autorité. (9)**
MINISTRO AFFARI ESTERI
E COOPERAZIONE INTERNAZIONALE
**Data di scadenza. Date of expiry. Date d'expiration. (8)** 09 LUG / JUL 2025
**Firma del titolare. Holder's signature / Signature du titulaire. (10)** [Signature]
P<ITATREVISAN<<MARCO<<<<<<<<<<<<<<<<<<<<<<<<
YA81163966ITA6602129M2507097<<<<<<<<<<<<<<08
================================================
FILE: shared/outputs/0037.md
================================================
# We the People
**Of the United States,**
in Order to form a more perfect Union,
establish Justice, insure domestic Tranquility,
provide for the common defence,
promote the general Welfare, and secure
the Blessings of Liberty to ourselves and
our Posterity, do ordain and establish this
Constitution for the United States of America.
[Signature]
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR**
---
**UNITED STATES OF AMERICA**
**PASSPORT**
**PASSEPORT**
**PASAPORTE**
**Type / Type / Tipo** P
**Code / Code / Código** USA
**Passport No. / No de Passeport / No. de Pasaporte** 910239248
**Surname / Nom / Apellidos** OBAMA
**Given Names / Prénoms / Nombres** MICHELLE
**Nationality / Nationalité / Nacionalidad** UNITED STATES OF AMERICA
**Date of birth / Date de naissance / Fecha de nacimiento** 17 Jan 1964
**Place of birth / Lieu de naissance / Lugar de nacimiento** ILLINOIS, U.S.A.
**Sex / Sexe / Sexo** F
**Date of issue / Date de délivrance / Fecha de expedición** 06 Dec 2013
**Date of expiration / Date d'expiration / Fecha de caducidad** 05 Dec 2018
**Authority / Autorité / Autoridad** United States Department of State
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 51
**USA**
P<USABOBAMA<<MICHELLE<<<<<<<<<<<<<<<<<<<<<<<<
9102392482USA6401171F1812051900781200<129676
**USA**
================================================
FILE: shared/outputs/0038.md
================================================
# We the People
**Of the United States,**
in Order to form a more perfect Union,
establish Justice, insure domestic Tranquility,
provide for the common defence,
promote the general Welfare, and secure
the Blessings of Liberty to ourselves and
our Posterity, do ordain and establish this
Constitution for the United States of America.
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR**
---
**UNITED STATES OF AMERICA**
**PASSPORT**
**PASSEPORT**
**PASAPORTE**
**Type / Type / Tipo** P
**Code / Code / Código** USA
**Passport No. / No de Passeport / No. de Pasaporte** 488839667
**Surname / Nom / Apellidos** VOLD
**Given Names / Prénoms / Nombres** STEPHEN HANSL
**Nationality / Nationalité / Nacionalidad** UNITED STATES OF AMERICA
**Date of birth / Date de naissance / Fecha de nacimiento** 15 Aug 1960
**Place of birth / Lieu de naissance / Lugar de nacimiento** WASHINGTON, U.S.A.
**Sex / Sexe / Sexo** M
**Date of issue / Date de délivrance / Fecha de expedición** 21 May 2012
**Date of expiration / Date d'expiration / Fecha de caducidad** 20 May 2022
**Authority / Autorité / Autoridad** United States Department of State
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 51
**USA**
P<USAVOLD<<STEPHEN<HANSL<<<<<<<<<<<<<<<<<<<<
4888396671USA6008156M220520112117147143<509936
**USA**
================================================
FILE: shared/outputs/0039.md
================================================
# We the People
**Of the United States,**
in Order to form a more perfect Union,
establish Justice, insure domestic Tranquility,
provide for the common defence,
promote the general Welfare, and secure
the Blessings of Liberty to ourselves and
our Posterity, do ordain and establish this
Constitution for the United States of America.
[Signature]
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR**
---
**UNITED STATES OF AMERICA**
**PASSPORT**
**PASSEPORT**
**PASAPORTE**
**Type / Type / Tipo** P
**Code / Code / Código** USA
**Passport No. / No de Passeport / No. de Pasaporte** 963545637
**Surname / Nom / Apellidos** JOHN
**Given Names / Prénoms / Nombres** DOE
**Nationality / Nationalité / Nacionalidad** USA
**Date of birth / Date de naissance / Fecha de nacimiento** 15 Mar 1996
**Place of birth / Lieu de naissance / Lugar de nacimiento** CALIFORNIA, U.S.A
**Sex / Sexe / Sexo** M
**Date of issue / Date de délivrance / Fecha de expedición** 14 Apr 2017
**Date of expiration / Date d'expiration / Fecha de caducidad** 14 Apr 2027
**Authority / Autorité / Autoridad** United States Department of State
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 17
**USA**
P<USAJOHN<<DOE<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
9635456374USA9603150M27041402O2113962<804330
**USA**
================================================
FILE: shared/outputs/0040.md
================================================
**THIS PAGE IS RESERVED FOR OFFICIAL OBSERVATIONS**
**CETTE PAGE EST RÉSERVÉE AUX OBSERVATIONS OFFICIELLES (11)**
**THERE ARE NO OFFICIAL OBSERVATIONS**
---
**UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND**
**PASSPORT**
**PASSEPORT**
**Type/Type** P
**Code/Code** GBR
**Passport No./Passeport No.** 925600253
**Surname/Nom (1)** UK SPECIMEN
**Given names/Prénoms (2)** ANGELA ZOE
**Nationality/Nationalité (3)** BRITISH CITIZEN
**Date of birth/Date de naissance (4)** 11 SEP / SEPT 88
**Sex/Sexe (5)** F
**Place of birth/Lieu de naissance (6)** CROYDON
**Date of issue/Date de délivrance (7)** 16 JUL / JUIL 10
**Authority/Autorité (8)** IPS
**Date of expiry/Date d’expiration (9)** 16 JUL / JUIL 20
**Holder's signature/Signature du titulaire (10)** A Specimen
P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<
9256002538GBR8809117F2007162<<<<<<<<<<<<<<06
================================================
FILE: .github/workflows/python-publish.yml
================================================
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.
name: Deploy Python Package
on:
release:
types: [published]
permissions:
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: python -m build
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment