Created
May 4, 2025 08:42
-
-
Save williamtran29/841131c6fb6b16d4931e6bb2282b5884 to your computer and use it in GitHub Desktop.
zerox
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Directory structure: | |
└── getomni-ai-zerox/ | |
├── README.md | |
├── commitlint.config.js | |
├── jest.config.js | |
├── LICENSE | |
├── Makefile | |
├── MANIFEST.in | |
├── package.json | |
├── pyproject.toml | |
├── setup.cfg | |
├── setup.py | |
├── .editorconfig | |
├── .npmignore | |
├── .pre-commit-config.yaml | |
├── assets/ | |
│ └── cs101.md | |
├── examples/ | |
│ └── node/ | |
│ ├── azure.ts | |
│ ├── bedrock.ts | |
│ ├── google.ts | |
│ └── openai.ts | |
├── node-zerox/ | |
│ ├── tsconfig.json | |
│ ├── scripts/ | |
│ │ └── install-dependencies.js | |
│ ├── src/ | |
│ │ ├── constants.ts | |
│ │ ├── handleWarnings.ts | |
│ │ ├── index.ts | |
│ │ ├── types.ts | |
│ │ ├── models/ | |
│ │ │ ├── azure.ts | |
│ │ │ ├── bedrock.ts | |
│ │ │ ├── google.ts | |
│ │ │ ├── index.ts | |
│ │ │ └── openAI.ts | |
│ │ └── utils/ | |
│ │ ├── common.ts | |
│ │ ├── file.ts | |
│ │ ├── image.ts | |
│ │ ├── index.ts | |
│ │ ├── model.ts | |
│ │ └── tesseract.ts | |
│ └── tests/ | |
│ ├── README.md | |
│ ├── index.ts | |
│ ├── performance.test.ts | |
│ ├── utils.ts | |
│ └── data/ | |
├── py_zerox/ | |
│ ├── pyzerox/ | |
│ │ ├── __init__.py | |
│ │ ├── constants/ | |
│ │ │ ├── __init__.py | |
│ │ │ ├── conversion.py | |
│ │ │ ├── messages.py | |
│ │ │ ├── patterns.py | |
│ │ │ └── prompts.py | |
│ │ ├── core/ | |
│ │ │ ├── __init__.py | |
│ │ │ ├── types.py | |
│ │ │ └── zerox.py | |
│ │ ├── errors/ | |
│ │ │ ├── __init__.py | |
│ │ │ ├── base.py | |
│ │ │ └── exceptions.py | |
│ │ ├── models/ | |
│ │ │ ├── __init__.py | |
│ │ │ ├── base.py | |
│ │ │ ├── modellitellm.py | |
│ │ │ └── types.py | |
│ │ └── processor/ | |
│ │ ├── __init__.py | |
│ │ ├── image.py | |
│ │ ├── pdf.py | |
│ │ ├── text.py | |
│ │ └── utils.py | |
│ ├── scripts/ | |
│ │ ├── __init__.py | |
│ │ └── pre_install.py | |
│ └── tests/ | |
│ └── test_noop.py | |
├── shared/ | |
│ ├── systemPrompt.txt | |
│ ├── test.json | |
│ ├── inputs/ | |
│ └── outputs/ | |
│ ├── 0001.md | |
│ ├── 0002.md | |
│ ├── 0003.md | |
│ ├── 0004.md | |
│ ├── 0005.md | |
│ ├── 0006.md | |
│ ├── 0007.md | |
│ ├── 0008.md | |
│ ├── 0009.md | |
│ ├── 0010.md | |
│ ├── 0011.md | |
│ ├── 0012.md | |
│ ├── 0013.md | |
│ ├── 0014.md | |
│ ├── 0015.md | |
│ ├── 0016.md | |
│ ├── 0017.md | |
│ ├── 0018.md | |
│ ├── 0019.md | |
│ ├── 0020.md | |
│ ├── 0021.md | |
│ ├── 0022.md | |
│ ├── 0023.md | |
│ ├── 0024.md | |
│ ├── 0025.md | |
│ ├── 0026.md | |
│ ├── 0027.md | |
│ ├── 0028.md | |
│ ├── 0029.md | |
│ ├── 0030.md | |
│ ├── 0031.md | |
│ ├── 0032.md | |
│ ├── 0033.md | |
│ ├── 0034.md | |
│ ├── 0035.md | |
│ ├── 0036.md | |
│ ├── 0037.md | |
│ ├── 0038.md | |
│ ├── 0039.md | |
│ └── 0040.md | |
└── .github/ | |
└── workflows/ | |
└── python-publish.yml | |
================================================ | |
FILE: README.md | |
================================================ | |
 | |
## Zerox OCR | |
<a href="https://discord.gg/smg2QfwtJ6"> | |
<img src="https://github.com/user-attachments/assets/cccc0e9a-e3b2-425e-9b54-e5024681b129" alt="Join us on Discord" width="200px"> | |
</a> | |
A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense! | |
The general logic: | |
- Pass in a file (PDF, DOCX, image, etc.) | |
- Convert that file into a series of images | |
- Pass each image to GPT and ask nicely for Markdown | |
- Aggregate the responses and return Markdown | |
Try out the hosted version here: <https://getomni.ai/ocr-demo> | |
Or visit our full documentation at: <https://docs.getomni.ai/zerox> | |
## Getting Started | |
Zerox is available as both a Node and Python package. | |
- [Node README](#node-zerox) - [npm package](https://www.npmjs.com/package/zerox) | |
- [Python README](#python-zerox) - [pip package](https://pypi.org/project/py-zerox/) | |
| Feature | Node.js | Python | | |
| ------------------------- | ---------------------------- | -------------------------- | | |
| PDF Processing | ✓ (requires graphicsmagick) | ✓ (requires poppler) | | |
| Image Processing | ✓ | ✓ | | |
| OpenAI Support | ✓ | ✓ | | |
| Azure OpenAI Support | ✓ | ✓ | | |
| AWS Bedrock Support | ✓ | ✓ | | |
| Google Gemini Support | ✓ | ✓ | | |
| Vertex AI Support | ✗ | ✓ | | |
| Data Extraction | ✓ (`schema`) | ✗ | | |
| Per-page Extraction | ✓ (`extractPerPage`) | ✗ | | |
| Custom System Prompts | ✗ | ✓ (`custom_system_prompt`) | | |
| Maintain Format Option | ✓ (`maintainFormat`) | ✓ (`maintain_format`) | | |
| Async API | ✓ | ✓ | | |
| Error Handling Modes | ✓ (`errorMode`) | ✗ | | |
| Concurrent Processing | ✓ (`concurrency`) | ✓ (`concurrency`) | | |
| Temp Directory Management | ✓ (`tempDir`) | ✓ (`temp_dir`) | | |
| Page Selection | ✓ (`pagesToConvertAsImages`) | ✓ (`select_pages`) | | |
| Orientation Correction | ✓ (`correctOrientation`) | ✗ | | |
| Edge Trimming | ✓ (`trimEdges`) | ✗ | | |
## Node Zerox | |
(Node.js SDK - supports vision models from different providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Google Gemini, etc.) | |
### Installation | |
```sh | |
npm install zerox | |
``` | |
Zerox uses `graphicsmagick` and `ghostscript` for the PDF => image processing step. These should be pulled automatically, but you may need to manually install. | |
On linux use: | |
``` | |
sudo apt-get update | |
sudo apt-get install -y graphicsmagick | |
``` | |
## Usage | |
**With file URL** | |
```ts | |
import { zerox } from "zerox"; | |
const result = await zerox({ | |
filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf", | |
credentials: { | |
apiKey: process.env.OPENAI_API_KEY, | |
}, | |
}); | |
``` | |
**From local path** | |
```ts | |
import { zerox } from "zerox"; | |
import path from "path"; | |
const result = await zerox({ | |
filePath: path.resolve(__dirname, "./cs101.pdf"), | |
credentials: { | |
apiKey: process.env.OPENAI_API_KEY, | |
}, | |
}); | |
``` | |
### Parameters | |
```ts | |
const result = await zerox({ | |
// Required | |
filePath: "path/to/file", | |
credentials: { | |
apiKey: "your-api-key", | |
// Additional provider-specific credentials as needed | |
}, | |
// Optional | |
cleanup: true, // Clear images from tmp after run | |
concurrency: 10, // Number of pages to run at a time | |
correctOrientation: true, // True by default, attempts to identify and correct page orientation | |
directImageExtraction: false, // Extract data directly from document images instead of the markdown | |
errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE, defaults to ErrorMode.IGNORE | |
extractionPrompt: "", // LLM instructions for extracting data from document | |
extractOnly: false, // Set to true to only extract structured data using a schema | |
extractPerPage, // Extract data per page instead of the entire document | |
imageDensity: 300, // DPI for image conversion | |
imageHeight: 2048, // Maximum height for converted images | |
llmParams: {}, // Additional parameters to pass to the LLM | |
maintainFormat: false, // Slower but helps maintain consistent formatting | |
maxImageSize: 15, // Maximum size of images to compress, defaults to 15MB | |
maxRetries: 1, // Number of retries to attempt on a failed page, defaults to 1 | |
maxTesseractWorkers: -1, // Maximum number of Tesseract workers. Zerox will start with a lower number and only reach maxTesseractWorkers if needed | |
model: ModelOptions.OPENAI_GPT_4O, // Model to use (supports various models from different providers) | |
modelProvider: ModelProvider.OPENAI, // Choose from OPENAI, BEDROCK, GOOGLE, or AZURE | |
outputDir: undefined, // Save combined result.md to a file | |
pagesToConvertAsImages: -1, // Page numbers to convert to image as array (e.g. `[1, 2, 3]`) or a number (e.g. `1`). Set to -1 to convert all pages | |
prompt: "", // LLM instructions for processing the document | |
schema: undefined, // Schema for structured data extraction | |
tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory) | |
trimEdges: true, // True by default, trims pixels from all edges that contain values similar to the given background color, which defaults to that of the top-left pixel | |
}); | |
``` | |
The `maintainFormat` option tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. This requires the requests to run synchronously, so it's a lot slower. But valuable if your documents have a lot of tabular data, or frequently have tables that cross pages. | |
``` | |
Request #1 => page_1_image | |
Request #2 => page_1_markdown + page_2_image | |
Request #3 => page_2_markdown + page_3_image | |
``` | |
### Example Output | |
```js | |
{ | |
completionTime: 10038, | |
fileName: 'invoice_36258', | |
inputTokens: 25543, | |
outputTokens: 210, | |
pages: [ | |
{ | |
page: 1, | |
content: '# INVOICE # 36258\n' + | |
'**Date:** Mar 06 2012 \n' + | |
'**Ship Mode:** First Class \n' + | |
'**Balance Due:** $50.10 \n' + | |
'## Bill To:\n' + | |
'Aaron Bergman \n' + | |
'98103, Seattle, \n' + | |
'Washington, United States \n' + | |
'## Ship To:\n' + | |
'Aaron Bergman \n' + | |
'98103, Seattle, \n' + | |
'Washington, United States \n' + | |
'\n' + | |
'| Item | Quantity | Rate | Amount |\n' + | |
'|--------------------------------------------|----------|--------|---------|\n' + | |
"| Global Push Button Manager's Chair, Indigo | 1 | $48.71 | $48.71 |\n" + | |
'| Chairs, Furniture, FUR-CH-4421 | | | |\n' + | |
'\n' + | |
'**Subtotal:** $48.71 \n' + | |
'**Discount (20%):** $9.74 \n' + | |
'**Shipping:** $11.13 \n' + | |
'**Total:** $50.10 \n' + | |
'---\n' + | |
'**Notes:** \n' + | |
'Thanks for your business! \n' + | |
'**Terms:** \n' + | |
'Order ID : CA-2012-AB10015140-40974 ', | |
contentLength: 747, | |
} | |
], | |
extracted: null, | |
summary: { | |
totalPages: 1, | |
ocr: { | |
failed: 0, | |
successful: 1, | |
}, | |
extracted: null, | |
}, | |
} | |
``` | |
### Data Extraction | |
Zerox supports structured data extraction from documents using a schema. This allows you to pull specific information from documents in a structured format instead of getting the full markdown conversion. | |
Set `extractOnly: true` and provide a `schema` to extract structured data. The schema follows the [JSON Schema standard](https://json-schema.org/understanding-json-schema/). | |
Use `extractPerPage` to extract data per page instead of from the whole document at once. | |
You can also set `extractionModel`, `extractionModelProvider`, and `extractionCredentials` to use a different model for extraction than OCR. By default, the same model is used. | |
### Supported Models | |
Zerox supports a wide range of models across different providers: | |
- **Azure OpenAI** | |
- GPT-4 Vision (gpt-4o) | |
- GPT-4 Vision Mini (gpt-4o-mini) | |
- GPT-4.1 (gpt-4.1) | |
- GPT-4.1 Mini (gpt-4.1-mini) | |
- **OpenAI** | |
- GPT-4 Vision (gpt-4o) | |
- GPT-4 Vision Mini (gpt-4o-mini) | |
- GPT-4.1 (gpt-4.1) | |
- GPT-4.1 Mini (gpt-4.1-mini) | |
- **AWS Bedrock** | |
- Claude 3 Haiku (2024.03, 2024.10) | |
- Claude 3 Sonnet (2024.02, 2024.06, 2024.10) | |
- Claude 3 Opus (2024.02) | |
- **Google Gemini** | |
- Gemini 1.5 (Flash, Flash-8B, Pro) | |
- Gemini 2.0 (Flash, Flash-Lite) | |
```ts | |
import { zerox } from "zerox"; | |
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types"; | |
// OpenAI | |
const openaiResult = await zerox({ | |
filePath: "path/to/file.pdf", | |
modelProvider: ModelProvider.OPENAI, | |
model: ModelOptions.OPENAI_GPT_4O, | |
credentials: { | |
apiKey: process.env.OPENAI_API_KEY, | |
}, | |
}); | |
// Azure OpenAI | |
const azureResult = await zerox({ | |
filePath: "path/to/file.pdf", | |
modelProvider: ModelProvider.AZURE, | |
model: ModelOptions.OPENAI_GPT_4O, | |
credentials: { | |
apiKey: process.env.AZURE_API_KEY, | |
endpoint: process.env.AZURE_ENDPOINT, | |
}, | |
}); | |
// AWS Bedrock | |
const bedrockResult = await zerox({ | |
filePath: "path/to/file.pdf", | |
modelProvider: ModelProvider.BEDROCK, | |
model: ModelOptions.BEDROCK_CLAUDE_3_SONNET_2024_10, | |
credentials: { | |
accessKeyId: process.env.AWS_ACCESS_KEY_ID, | |
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY, | |
region: process.env.AWS_REGION, | |
}, | |
}); | |
// Google Gemini | |
const geminiResult = await zerox({ | |
filePath: "path/to/file.pdf", | |
modelProvider: ModelProvider.GOOGLE, | |
model: ModelOptions.GOOGLE_GEMINI_1_5_PRO, | |
credentials: { | |
apiKey: process.env.GEMINI_API_KEY, | |
}, | |
}); | |
``` | |
## Python Zerox | |
(Python SDK - supports vision models from different providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, etc.) | |
### Installation | |
- Install **poppler** on the system, it should be available in path variable. See the [pdf2image documentation](https://pdf2image.readthedocs.io/en/latest/installation.html) for instructions by platform. | |
- Install py-zerox: | |
```sh | |
pip install py-zerox | |
``` | |
The `pyzerox.zerox` function is an asynchronous API that performs OCR (Optical Character Recognition) to markdown using vision models. It processes PDF files and converts them into markdown format. Make sure to set up the environment variables for the model and the model provider before using this API. | |
Refer to the [LiteLLM Documentation](https://docs.litellm.ai/docs/providers) for setting up the environment and passing the correct model name. | |
### Usage | |
```python | |
from pyzerox import zerox | |
import os | |
import json | |
import asyncio | |
### Model Setup (Use only Vision Models) Refer: https://docs.litellm.ai/docs/providers ### | |
## placeholder for additional model kwargs which might be required for some models | |
kwargs = {} | |
## system prompt to use for the vision model | |
custom_system_prompt = None | |
# to override | |
# custom_system_prompt = "For the below PDF page, do something..something..." ## example | |
###################### Example for OpenAI ###################### | |
model = "gpt-4o-mini" ## openai model | |
os.environ["OPENAI_API_KEY"] = "" ## your-api-key | |
###################### Example for Azure OpenAI ###################### | |
model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model> | |
os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key" | |
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com" | |
os.environ["AZURE_API_VERSION"] = "" # "2023-05-15" | |
###################### Example for Gemini ###################### | |
model = "gemini/gpt-4o-mini" ## "gemini/<gemini_model>" -> format <provider>/<model> | |
os.environ['GEMINI_API_KEY'] = "" # your-gemini-api-key | |
###################### Example for Anthropic ###################### | |
model="claude-3-opus-20240229" | |
os.environ["ANTHROPIC_API_KEY"] = "" # your-anthropic-api-key | |
###################### Vertex ai ###################### | |
model = "vertex_ai/gemini-1.5-flash-001" ## "vertex_ai/<model_name>" -> format <provider>/<model> | |
## GET CREDENTIALS | |
## RUN ## | |
# !gcloud auth application-default login - run this to add vertex credentials to your env | |
## OR ## | |
file_path = 'path/to/vertex_ai_service_account.json' | |
# Load the JSON file | |
with open(file_path, 'r') as file: | |
vertex_credentials = json.load(file) | |
# Convert to JSON string | |
vertex_credentials_json = json.dumps(vertex_credentials) | |
vertex_credentials=vertex_credentials_json | |
## extra args | |
kwargs = {"vertex_credentials": vertex_credentials} | |
###################### For other providers refer: https://docs.litellm.ai/docs/providers ###################### | |
# Define main async entrypoint | |
async def main(): | |
file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf" ## local filepath and file URL supported | |
## process only some pages or all | |
select_pages = None ## None for all, but could be int or list(int) page numbers (1 indexed) | |
output_dir = "./output_test" ## directory to save the consolidated markdown file | |
result = await zerox(file_path=file_path, model=model, output_dir=output_dir, | |
custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs) | |
return result | |
# run the main function: | |
result = asyncio.run(main()) | |
# print markdown result | |
print(result) | |
``` | |
### Parameters | |
```python | |
async def zerox( | |
cleanup: bool = True, | |
concurrency: int = 10, | |
file_path: Optional[str] = "", | |
maintain_format: bool = False, | |
model: str = "gpt-4o-mini", | |
output_dir: Optional[str] = None, | |
temp_dir: Optional[str] = None, | |
custom_system_prompt: Optional[str] = None, | |
select_pages: Optional[Union[int, Iterable[int]]] = None, | |
**kwargs | |
) -> ZeroxOutput: | |
... | |
``` | |
Parameters | |
- **cleanup** (bool, optional): | |
Whether to clean up temporary files after processing. Defaults to True. | |
- **concurrency** (int, optional): | |
The number of concurrent processes to run. Defaults to 10. | |
- **file_path** (Optional[str], optional): | |
The path to the PDF file to process. Defaults to an empty string. | |
- **maintain_format** (bool, optional): | |
Whether to maintain the format from the previous page. Defaults to False. | |
- **model** (str, optional): | |
The model to use for generating completions. Defaults to "gpt-4o-mini". | |
Refer to LiteLLM Providers for the correct model name, as it may differ depending on the provider. | |
- **output_dir** (Optional[str], optional): | |
The directory to save the markdown output. Defaults to None. | |
- **temp_dir** (str, optional): | |
The directory to store temporary files, defaults to some named folder in system's temp directory. If already exists, the contents will be deleted before Zerox uses it. | |
- **custom_system_prompt** (str, optional): | |
The system prompt to use for the model, this overrides the default system prompt of Zerox.Generally it is not required unless you want some specific behavior. Defaults to None. | |
- **select_pages** (Optional[Union[int, Iterable[int]]], optional): | |
Pages to process, can be a single page number or an iterable of page numbers. Defaults to None | |
- **kwargs** (dict, optional): | |
Additional keyword arguments to pass to the litellm.completion method. | |
Refer to the LiteLLM Documentation and Completion Input for details. | |
Returns | |
- ZeroxOutput: | |
Contains the markdown content generated by the model and also some metadata (refer below). | |
### Example Output (output from "azure/gpt-4o-mini") | |
Note the output is manually wrapped for this documentation for better readability. | |
````Python | |
ZeroxOutput( | |
completion_time=9432.975, | |
file_name='cs101', | |
input_tokens=36877, | |
output_tokens=515, | |
pages=[ | |
Page( | |
content='| Type | Description | Wrapper Class |\n' + | |
'|---------|--------------------------------------|---------------|\n' + | |
'| byte | 8-bit signed 2s complement integer | Byte |\n' + | |
'| short | 16-bit signed 2s complement integer | Short |\n' + | |
'| int | 32-bit signed 2s complement integer | Integer |\n' + | |
'| long | 64-bit signed 2s complement integer | Long |\n' + | |
'| float | 32-bit IEEE 754 floating point number| Float |\n' + | |
'| double | 64-bit floating point number | Double |\n' + | |
'| boolean | may be set to true or false | Boolean |\n' + | |
'| char | 16-bit Unicode (UTF-16) character | Character |\n\n' + | |
'Table 26.2.: Primitive types in Java\n\n' + | |
'### 26.3.1. Declaration & Assignment\n\n' + | |
'Java is a statically typed language meaning that all variables must be declared before you can use ' + | |
'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' + | |
'its identifier. For example:\n\n' + | |
'```java\n' + | |
'int numUnits;\n' + | |
'double costPerUnit;\n' + | |
'char firstInitial;\n' + | |
'boolean isStudent;\n' + | |
'```\n\n' + | |
'Each declaration specifies the variable’s type followed by the identifier and ending with a ' + | |
'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' + | |
'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' + | |
'character. We adopt the modern camelCasing naming convention for variables in our code. In ' + | |
'general, variables must be assigned a value before you can use them in an expression. You do not ' + | |
'have to immediately assign a value when you declare them (though it is good practice), but some ' + | |
'value must be assigned before they can be used or the compiler will issue an error.\n\n' + | |
'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' + | |
'the variable that we wish to assign the value to appears on the left-hand-side while the value ' + | |
'(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' + | |
'we can assign them values:\n\n' + | |
'> 2 Instance variables, that is variables declared as part of an object do have default values. ' + | |
'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' + | |
'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' + | |
'character (zero in the ASCII table).', | |
content_length=2333, | |
page=1 | |
) | |
] | |
) | |
```` | |
## Supported File Types | |
We use a combination of `libreoffice` and `graphicsmagick` to do document => image conversion. For non-image / non-PDF files, we use libreoffice to convert that file to a PDF, and then to an image. | |
```js | |
[ | |
"pdf", // Portable Document Format | |
"doc", // Microsoft Word 97-2003 | |
"docx", // Microsoft Word 2007-2019 | |
"odt", // OpenDocument Text | |
"ott", // OpenDocument Text Template | |
"rtf", // Rich Text Format | |
"txt", // Plain Text | |
"html", // HTML Document | |
"htm", // HTML Document (alternative extension) | |
"xml", // XML Document | |
"wps", // Microsoft Works Word Processor | |
"wpd", // WordPerfect Document | |
"xls", // Microsoft Excel 97-2003 | |
"xlsx", // Microsoft Excel 2007-2019 | |
"ods", // OpenDocument Spreadsheet | |
"ots", // OpenDocument Spreadsheet Template | |
"csv", // Comma-Separated Values | |
"tsv", // Tab-Separated Values | |
"ppt", // Microsoft PowerPoint 97-2003 | |
"pptx", // Microsoft PowerPoint 2007-2019 | |
"odp", // OpenDocument Presentation | |
"otp", // OpenDocument Presentation Template | |
]; | |
``` | |
## Credits | |
- [Litellm](https://github.com/BerriAI/litellm): <https://github.com/BerriAI/litellm> | This powers our python sdk to support all popular vision models from different providers. | |
### License | |
This project is licensed under the MIT License. | |
================================================ | |
FILE: commitlint.config.js | |
================================================ | |
module.exports = { | |
extends: [ | |
"@commitlint/config-conventional" | |
], | |
} | |
================================================ | |
FILE: jest.config.js | |
================================================ | |
/** @type {import('ts-jest').JestConfigWithTsJest} **/ | |
module.exports = { | |
preset: "ts-jest", | |
testEnvironment: "node", | |
moduleDirectories: ["node_modules"], | |
transform: { | |
"^.+\\.tsx?$": [ | |
"ts-jest", | |
{ | |
tsconfig: "node-zerox/tsconfig.json", | |
}, | |
], | |
}, | |
}; | |
================================================ | |
FILE: LICENSE | |
================================================ | |
The MIT License (MIT) | |
Permission is hereby granted, free of charge, to any person obtaining a copy | |
of this software and associated documentation files (the "Software"), to deal | |
in the Software without restriction, including without limitation the rights | |
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
copies of the Software, and to permit persons to whom the Software is | |
furnished to do so, subject to the following conditions: | |
The above copyright notice and this permission notice shall be included in all | |
copies or substantial portions of the Software. | |
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
SOFTWARE. | |
================================================ | |
FILE: Makefile | |
================================================ | |
# Define the package directory for zerox | |
PACKAGE_DIR := py_zerox | |
# Define directory configs | |
VENV_DIR := .venv | |
DIST_DIR := ${PACKAGE_DIR}/dist | |
SRC_DIR := $(PACKAGE_DIR)/zerox | |
TEST_DIR := $(PACKAGE_DIR)/tests | |
# Define the build configs | |
POETRY_VERSION := 1.8.3 | |
PYTHON_VERSION := 3.11 | |
POETRY := poetry | |
# Test related configs | |
PYTEST_OPTIONS := -v | |
# Default target | |
.PHONY: all | |
all: venv build test dev | |
# Conditional map executable | |
ifeq ($(VIRTUAL_ENV),) | |
PYTHON := python$(PYTHON_VERSION) | |
else | |
PYTHON := python | |
endif | |
# Initialization | |
.PHONY: init | |
init: | |
@echo "== Initializing Development Environment ==" | |
brew install node | |
brew install pre-commit | |
curl -sSL https://install.python-poetry.org | $(PYTHON) - | |
@echo "== Installing Pre-Commit Hooks ==" | |
pre-commit install | |
pre-commit autoupdate | |
pre-commit install --install-hooks | |
pre-commit install --hook-type commit-msg | |
# Create virtual environment if it doesn't exist | |
.PHONY: venv | |
venv: $(VENV_DIR)/bin/activate | |
$(VENV_DIR)/bin/activate: | |
@echo "== Creating Virtual Environment ==" | |
$(PYTHON) -m venv $(VENV_DIR) | |
. $(VENV_DIR)/bin/activate && pip install --upgrade pip setuptools wheel | |
touch $(VENV_DIR)/bin/activate | |
# Resolving dependencies and build the package using SetupTools | |
.PHONY: build | |
build: venv | |
@echo "== Resolving dependencies and building the package using SetupTools ==" | |
$(PYTHON) setup.py sdist --dist-dir $(DIST_DIR) | |
# Install test dependencies for test environment | |
.PHONY: install-test | |
install-test: venv | |
@echo "== Resolving test dependencies ==" | |
$(POETRY) install --with test | |
# Test out the build | |
.PHONY: test | |
test: install-test | |
@echo "== Triggering tests ==" | |
pytest $(TEST_DIR) $(PYTEST_OPTIONS) || (echo "Tests failed" && exit 1) | |
# Clean build artifacts | |
.PHONY: clean | |
clean: | |
@echo "== Cleaning DIST_DIR and VENV_DIR ==" | |
rm -rf $(DIST_DIR) | |
rm -rf $(VENV_DIR) | |
# Install dev dependencies for dev environment | |
.PHONY: install-dev | |
install-dev: venv build | |
@echo "== Resolving development dependencies ==" | |
$(POETRY) install --with dev | |
# Package Development Build | |
.PHONY: dev | |
dev: | |
@echo "== Preparing development build ==" | |
$(PYTHON) -m pip install -e . | |
.PHONY: check | |
check: install-dev lint format | |
.PHONY: lint | |
lint: venv | |
@echo "== Running Linting ==" | |
$(VENV_DIR)/bin/ruff check $(SRC_DIR) $(TEST_DIR) | |
.PHONY: format | |
format: venv | |
@echo "== Running Formatting ==" | |
$(VENV_DIR)/bin/black --check $(SRC_DIR) $(TEST_DIR) | |
.PHONY: fix | |
fix: install-dev lint-fix format-fix | |
.PHONY: lint-fix | |
lint-fix: venv | |
@echo "== Running Linting ==" | |
$(VENV_DIR)/bin/ruff check --fix $(SRC_DIR) $(TEST_DIR) | |
.PHONY: format-fix | |
format-fix: venv | |
@echo "== Running Formatmting ==" | |
$(VENV_DIR)/bin/black $(SRC_DIR) $(TEST_DIR) | |
================================================ | |
FILE: MANIFEST.in | |
================================================ | |
include setup.py | |
include README.md | |
include LICENSE | |
recursive-include py_zerox/zerox * | |
recursive-include py_zerox/scripts * | |
================================================ | |
FILE: package.json | |
================================================ | |
{ | |
"name": "zerox", | |
"version": "1.1.19", | |
"description": "ocr documents using gpt-4o-mini", | |
"main": "node-zerox/dist/index.js", | |
"scripts": { | |
"clean": "rm -rf node-zerox/dist", | |
"build": "npm run clean && tsc -p node-zerox/tsconfig.json", | |
"postinstall": "node node-zerox/scripts/install-dependencies.js", | |
"prepublishOnly": "npm run build", | |
"test": "ts-node node-zerox/tests/index.ts", | |
"test:performance": "jest node-zerox/tests/performance.test.ts --runInBand" | |
}, | |
"author": "tylermaran", | |
"license": "MIT", | |
"dependencies": { | |
"@aws-sdk/client-bedrock-runtime": "^3.734.0", | |
"@google/genai": "^0.9.0", | |
"axios": "^1.7.2", | |
"child_process": "^1.0.2", | |
"file-type": "^16.5.4", | |
"fs-extra": "^11.2.0", | |
"heic-convert": "^2.1.0", | |
"libreoffice-convert": "^1.6.0", | |
"mime-types": "^2.1.35", | |
"openai": "^4.82.0", | |
"os": "^0.1.2", | |
"p-limit": "^3.1.0", | |
"path": "^0.12.7", | |
"pdf-parse": "^1.1.1", | |
"pdf2pic": "^3.1.1", | |
"sharp": "^0.33.5", | |
"tesseract.js": "^5.1.1", | |
"util": "^0.12.5", | |
"uuid": "^11.0.3", | |
"xlsx": "^0.18.5" | |
}, | |
"devDependencies": { | |
"@types/fs-extra": "^11.0.4", | |
"@types/heic-convert": "^2.1.0", | |
"@types/jest": "^29.5.14", | |
"@types/mime-types": "^2.1.4", | |
"@types/node": "^20.14.11", | |
"@types/pdf-parse": "^1.1.4", | |
"@types/prompts": "^2.4.9", | |
"@types/xlsx": "^0.0.35", | |
"dotenv": "^16.4.5", | |
"jest": "^29.7.0", | |
"prompts": "^2.4.2", | |
"ts-jest": "^29.2.5", | |
"ts-node": "^10.9.2", | |
"typescript": "^5.5.3" | |
}, | |
"repository": { | |
"type": "git", | |
"url": "git+https://github.com/getomni-ai/zerox.git" | |
}, | |
"keywords": [ | |
"ocr", | |
"document", | |
"llm" | |
], | |
"types": "node-zerox/dist/index.d.ts", | |
"bugs": { | |
"url": "https://github.com/getomni-ai/zerox/issues" | |
}, | |
"homepage": "https://github.com/getomni-ai/zerox#readme" | |
} | |
================================================ | |
FILE: pyproject.toml | |
================================================ | |
[tool.poetry] | |
name = "py-zerox" | |
version = "0.0.7" | |
description = "ocr documents using vision models from all popular providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock etc" | |
authors = ["wizenheimer","pradhyumna85"] | |
license = "MIT" | |
readme = "README.md" | |
packages = [{ include = "pyzerox", from = "py_zerox" }] | |
repository = "https://github.com/getomni-ai/zerox.git" | |
documentation = "https://github.com/getomni-ai/zerox" | |
keywords = ["ocr", "document", "llm"] | |
package-mode = false | |
[tool.poetry.dependencies] | |
python = "^3.11" | |
aiofiles = "^23.0" | |
aiohttp = "^3.9.5" | |
pdf2image = "^1.17.0" | |
litellm = "^1.44.15" | |
aioshutil = "^1.5" | |
pypdf2 = "^3.0.1" | |
[tool.poetry.scripts] | |
pre-install = "py_zerox.scripts.pre_install:check_and_install" | |
[tool.poetry.group.dev.dependencies] | |
notebook = "^7.2.1" | |
black = "^24.4.2" | |
ruff = "^0.5.5" | |
[tool.poetry.group.test.dependencies] | |
pytest = "^8.3.2" | |
================================================ | |
FILE: setup.cfg | |
================================================ | |
[metadata] | |
name = py-zerox | |
version = 0.0.7 | |
description = ocr documents using vision models from all popular providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock etc | |
long_description = file: README.md | |
long_description_content_type = text/markdown | |
author = wizenheimer, pradhyumna85 | |
license = MIT | |
license_file = LICENSE | |
classifiers = | |
License :: OSI Approved :: MIT License | |
Programming Language :: Python :: 3 | |
Programming Language :: Python :: 3.11 | |
[options] | |
package_dir = | |
= py_zerox | |
packages = find: | |
python_requires = >=3.11 | |
install_requires = | |
aiofiles>=23.0 | |
aiohttp>=3.9.5 | |
pdf2image>=1.17.0 | |
litellm>=1.44.15 | |
aioshutil>=1.5 | |
PyPDF2>=3.0.1 | |
[options.packages.find] | |
where = py_zerox.pyzerox | |
[options.entry_points] | |
console_scripts = | |
py-zerox-pre-install = py_zerox.scripts.pre_install:check_and_install | |
================================================ | |
FILE: setup.py | |
================================================ | |
from setuptools import setup, find_packages | |
from setuptools.command.install import install | |
import subprocess | |
import sys | |
class InstallSystemDependencies(install): | |
def run(self): | |
try: | |
subprocess.check_call( | |
[sys.executable, "-m", "py_zerox.scripts.pre_install"]) | |
except subprocess.CalledProcessError as e: | |
print(f"Pre-install script failed: {e}", file=sys.stderr) | |
sys.exit(1) | |
install.run(self) | |
setup( | |
name="py-zerox", | |
cmdclass={ | |
"install": InstallSystemDependencies, | |
}, | |
version="0.0.7", | |
packages=find_packages(where="py_zerox"), # Specify the root folder | |
package_dir={"": "py_zerox"}, # Map root directory | |
include_package_data=True, | |
) | |
================================================ | |
FILE: .editorconfig | |
================================================ | |
# EditorConfig is awesome: https://EditorConfig.org | |
# top-most EditorConfig file | |
root = true | |
[*] | |
indent_style = space | |
indent_size = 4 | |
end_of_line = lf | |
charset = utf-8 | |
trim_trailing_whitespace = true | |
insert_final_newline = false | |
[{*.yaml,*.yml}] | |
indent_size = 2 | |
ij_yaml_keep_indents_on_empty_lines = false | |
ij_yaml_keep_line_breaks = true | |
[Makefile] | |
indent_style = tab | |
[*.py] | |
indent_size = 4 | |
[{*.js,*.ts,*.md,*.json}] | |
indent_size = 2 | |
================================================ | |
FILE: .npmignore | |
================================================ | |
# Folders | |
node-zerox/src/ | |
node-zerox/tests/ | |
py_zerox/ | |
assets/ | |
shared/ | |
# Config files | |
.pre-commit-config.yaml | |
.editorconfig | |
MANIFEST.in | |
commitlint.config.js | |
poetry.lock | |
pyproject.toml | |
setup.cfg | |
setup.py | |
Makefile | |
eng.traineddata | |
.env | |
# File types | |
*.ts | |
# Keep type declarations | |
!.gitignore | |
!node-zerox/dist/**/*.d.ts | |
================================================ | |
FILE: .pre-commit-config.yaml | |
================================================ | |
repos: | |
# pre-commit hooks for testing the files | |
- repo: https://github.com/pre-commit/pre-commit-hooks | |
rev: "v4.6.0" | |
hooks: | |
- id: check-added-large-files | |
- id: no-commit-to-branch | |
- id: check-toml | |
- id: check-yaml | |
- id: check-json | |
- id: check-xml | |
- id: end-of-file-fixer | |
exclude: \.json$ | |
files: \.py$ | |
- id: trailing-whitespace | |
- id: mixed-line-ending | |
# for formatting | |
- repo: https://github.com/psf/black | |
rev: 24.4.2 | |
hooks: | |
- id: black | |
args: ["--line-length=100"] | |
language_version: python3 | |
# for linting & style checks | |
- repo: https://github.com/astral-sh/ruff-pre-commit | |
rev: v0.5.5 | |
hooks: | |
- id: ruff | |
args: ["--fix"] | |
================================================ | |
FILE: assets/cs101.md | |
================================================ | |
| Type | Description | Wrapper Class | | |
| ------- | ------------------------------------- | ------------- | | |
| byte | 8-bit signed 2s complement integer | Byte | | |
| short | 16-bit signed 2s complement integer | Short | | |
| int | 32-bit signed 2s complement integer | Integer | | |
| long | 64-bit signed 2s complement integer | Long | | |
| float | 32-bit IEEE 754 floating point number | Float | | |
| double | 64-bit floating point number | Double | | |
| boolean | may be set to true or false | Boolean | | |
| char | 16-bit Unicode (UTF-16) character | Character | | |
Table 26.2.: Primitive types in Java | |
### 26.3.1. Declaration & Assignment | |
Java is a statically typed language meaning that all variables must be declared before you can use them or refer to them. In addition, when declaring a variable, you must specify both its type and its identifier. For example: | |
```java | |
int numUnits; | |
double costPerUnit; | |
char firstInitial; | |
boolean isStudent; | |
``` | |
Each declaration specifies the variable’s type followed by the identifier ending with a semicolon. The identifier rules are fairly standard: a name can consist of lowercase and uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric character. We adopt the modern camelCasing naming convention for variables in our code. In general, variables must be assigned a value before you can use them in an expression. You do not have to immediately assign a value when you declare them (though it is good practice), but some value must be assigned before they can be used or the compiler will issue an error. | |
The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, the variable that we wish to assign the value to appears on the left-hand-side while the value (literal, variable or expression) is on the right-hand-side. Using our variables from before, we can assign them values: | |
``` | |
2Instance variables, that is variables declared as part of an object do have default values. For objects, the default is `null`, for all numeric types, zero is the default value. For the `boolean` type, `false` is the default, and the default `char` value is `\0`, the null-terminating character (zero in the ASCII table). | |
``` | |
``` | |
391 | |
``` | |
================================================ | |
FILE: examples/node/azure.ts | |
================================================ | |
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types"; | |
import { zerox } from "zerox"; | |
/** | |
* Example using Azure OpenAI with Zerox to extract structured data from documents. | |
* This shows extraction setup with schema definition for a property report document. | |
*/ | |
async function main() { | |
// Define the schema for property report data extraction | |
const schema = { | |
type: "object", | |
properties: { | |
commercial_office: { | |
type: "object", | |
properties: { | |
average: { type: "string" }, | |
median: { type: "string" }, | |
}, | |
required: ["average", "median"], | |
}, | |
transactions_by_quarter: { | |
type: "array", | |
items: { | |
type: "object", | |
properties: { | |
quarter: { type: "string" }, | |
transactions: { type: "integer" }, | |
}, | |
required: ["quarter", "transactions"], | |
}, | |
}, | |
year: { type: "integer" }, | |
}, | |
required: ["commercial_office", "transactions_by_quarter", "year"], | |
}; | |
try { | |
const result = await zerox({ | |
credentials: { | |
apiKey: process.env.AZURE_API_KEY || "", | |
endpoint: process.env.AZURE_ENDPOINT || "", | |
}, | |
extractOnly: true, // Skip OCR, only perform extraction (defaults to false) | |
filePath: | |
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png", | |
model: ModelOptions.OPENAI_GPT_4O, | |
modelProvider: ModelProvider.AZURE, | |
schema, | |
}); | |
console.log("Extracted data:", result.extracted); | |
} catch (error) { | |
console.error("Error extracting data:", error); | |
} | |
} | |
main(); | |
================================================ | |
FILE: examples/node/bedrock.ts | |
================================================ | |
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types"; | |
import { zerox } from "zerox"; | |
/** | |
* Example using Bedrock Anthropic with Zerox to extract structured data from documents. | |
* This shows extraction setup with schema definition for a property report document. | |
*/ | |
async function main() { | |
// Define the schema for property report data extraction | |
const schema = { | |
type: "object", | |
properties: { | |
commercial_office: { | |
type: "object", | |
properties: { | |
average: { type: "string" }, | |
median: { type: "string" }, | |
}, | |
required: ["average", "median"], | |
}, | |
transactions_by_quarter: { | |
type: "array", | |
items: { | |
type: "object", | |
properties: { | |
quarter: { type: "string" }, | |
transactions: { type: "integer" }, | |
}, | |
required: ["quarter", "transactions"], | |
}, | |
}, | |
year: { type: "integer" }, | |
}, | |
required: ["commercial_office", "transactions_by_quarter", "year"], | |
}; | |
try { | |
const result = await zerox({ | |
credentials: { | |
accessKeyId: process.env.ACCESS_KEY_ID, | |
region: process.env.REGION || "us-east-1", | |
secretAccessKey: process.env.SECRET_ACCESS_KEY, | |
}, | |
extractOnly: true, // Skip OCR, only perform extraction (defaults to false) | |
filePath: | |
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png", | |
model: ModelOptions.BEDROCK_CLAUDE_3_HAIKU_2024_03, | |
modelProvider: ModelProvider.BEDROCK, | |
schema, | |
}); | |
console.log("Extracted data:", result.extracted); | |
} catch (error) { | |
console.error("Error extracting data:", error); | |
} | |
} | |
main(); | |
================================================ | |
FILE: examples/node/google.ts | |
================================================ | |
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types"; | |
import { zerox } from "zerox"; | |
/** | |
* Example using Google Gemini with Zerox to extract structured data from documents. | |
* This shows extraction setup with schema definition for a property report document. | |
*/ | |
async function main() { | |
// Define the schema for property report data extraction | |
const schema = { | |
type: "object", | |
properties: { | |
commercial_office: { | |
type: "object", | |
properties: { | |
average: { type: "string" }, | |
median: { type: "string" }, | |
}, | |
required: ["average", "median"], | |
}, | |
transactions_by_quarter: { | |
type: "array", | |
items: { | |
type: "object", | |
properties: { | |
quarter: { type: "string" }, | |
transactions: { type: "integer" }, | |
}, | |
required: ["quarter", "transactions"], | |
}, | |
}, | |
year: { type: "integer" }, | |
}, | |
required: ["commercial_office", "transactions_by_quarter", "year"], | |
}; | |
try { | |
const result = await zerox({ | |
credentials: { | |
apiKey: process.env.GEMINI_API_KEY || "", | |
}, | |
extractOnly: true, // Skip OCR, only perform extraction (defaults to false) | |
filePath: | |
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png", | |
model: ModelOptions.GOOGLE_GEMINI_2_FLASH, | |
modelProvider: ModelProvider.GOOGLE, | |
schema, | |
}); | |
console.log("Extracted data:", result.extracted); | |
} catch (error) { | |
console.error("Error extracting data:", error); | |
} | |
} | |
main(); | |
================================================ | |
FILE: examples/node/openai.ts | |
================================================ | |
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types"; | |
import { zerox } from "zerox"; | |
/** | |
* Example using OpenAI with Zerox to extract structured data from documents. | |
* This shows extraction setup with schema definition for a property report document. | |
*/ | |
async function main() { | |
// Define the schema for property report data extraction | |
const schema = { | |
type: "object", | |
properties: { | |
commercial_office: { | |
type: "object", | |
properties: { | |
average: { type: "string" }, | |
median: { type: "string" }, | |
}, | |
required: ["average", "median"], | |
}, | |
transactions_by_quarter: { | |
type: "array", | |
items: { | |
type: "object", | |
properties: { | |
quarter: { type: "string" }, | |
transactions: { type: "integer" }, | |
}, | |
required: ["quarter", "transactions"], | |
}, | |
}, | |
year: { type: "integer" }, | |
}, | |
required: ["commercial_office", "transactions_by_quarter", "year"], | |
}; | |
try { | |
const result = await zerox({ | |
credentials: { | |
apiKey: process.env.OPENAI_API_KEY || "", | |
}, | |
extractOnly: true, // Skip OCR, only perform extraction (defaults to false) | |
filePath: | |
"https://omni-demo-data.s3.amazonaws.com/test/property_report.png", | |
model: ModelOptions.OPENAI_GPT_4O, | |
modelProvider: ModelProvider.OPENAI, | |
schema, | |
}); | |
console.log("Extracted data:", result.extracted); | |
} catch (error) { | |
console.error("Error extracting data:", error); | |
} | |
} | |
main(); | |
================================================ | |
FILE: node-zerox/tsconfig.json | |
================================================ | |
{ | |
"compilerOptions": { | |
"target": "ES5", | |
"module": "commonjs", | |
"declaration": true, | |
"outDir": "./dist", | |
"strict": true, | |
"esModuleInterop": true, | |
"skipLibCheck": true | |
}, | |
"include": ["src/**/*"], | |
"exclude": ["node_modules", "**/*.test.ts"] | |
} | |
================================================ | |
FILE: node-zerox/scripts/install-dependencies.js | |
================================================ | |
const { exec } = require("child_process"); | |
const { promisify } = require("util"); | |
const execPromise = promisify(exec); | |
const installPackage = async (command, packageName) => { | |
try { | |
const { stdout, stderr } = await execPromise(command); | |
if (stderr) { | |
throw new Error(`Failed to install ${packageName}: ${stderr}`); | |
} | |
return stdout; | |
} catch (error) { | |
throw new Error(`Failed to install ${packageName}: ${error.message}`); | |
} | |
}; | |
const isSudoAvailable = async () => { | |
try { | |
// Try running a sudo command | |
await execPromise("sudo -n true"); | |
return true; | |
} catch { | |
return false; | |
} | |
}; | |
const checkAndInstall = async () => { | |
try { | |
const sudoAvailable = await isSudoAvailable(); | |
// Check and install Ghostscript | |
try { | |
await execPromise("gs --version"); | |
} catch { | |
if (process.platform === "darwin") { | |
await installPackage("brew install ghostscript", "Ghostscript"); | |
} else if (process.platform === "linux") { | |
const command = sudoAvailable | |
? "sudo apt-get update && sudo apt-get install -y ghostscript" | |
: "apt-get update && apt-get install -y ghostscript"; | |
await installPackage(command, "Ghostscript"); | |
} else { | |
throw new Error( | |
"Please install Ghostscript manually from https://www.ghostscript.com/download.html" | |
); | |
} | |
} | |
// Check and install GraphicsMagick | |
try { | |
await execPromise("gm -version"); | |
} catch { | |
if (process.platform === "darwin") { | |
await installPackage("brew install graphicsmagick", "GraphicsMagick"); | |
} else if (process.platform === "linux") { | |
const command = sudoAvailable | |
? "sudo apt-get update && sudo apt-get install -y graphicsmagick" | |
: "apt-get update && apt-get install -y graphicsmagick"; | |
await installPackage(command, "GraphicsMagick"); | |
} else { | |
throw new Error( | |
"Please install GraphicsMagick manually from http://www.graphicsmagick.org/download.html" | |
); | |
} | |
} | |
// Check and install LibreOffice | |
try { | |
await execPromise("soffice --version"); | |
} catch { | |
if (process.platform === "darwin") { | |
await installPackage("brew install --cask libreoffice", "LibreOffice"); | |
} else if (process.platform === "linux") { | |
const command = sudoAvailable | |
? "sudo apt-get update && sudo apt-get install -y libreoffice" | |
: "apt-get update && apt-get install -y libreoffice"; | |
await installPackage(command, "LibreOffice"); | |
} else { | |
throw new Error( | |
"Please install LibreOffice manually from https://www.libreoffice.org/download/download/" | |
); | |
} | |
} | |
// Check and install Poppler | |
try { | |
await execPromise("pdfinfo -v || pdftoppm -v"); | |
} catch { | |
if (process.platform === "darwin") { | |
await installPackage("brew install poppler", "Poppler"); | |
} else if (process.platform === "linux") { | |
const command = sudoAvailable | |
? "sudo apt-get update && sudo apt-get install -y poppler-utils" | |
: "apt-get update && apt-get install -y poppler-utils"; | |
await installPackage(command, "Poppler"); | |
} else { | |
throw new Error( | |
"Please install Poppler manually from https://poppler.freedesktop.org/" | |
); | |
} | |
} | |
} catch (err) { | |
console.error(`Error during installation: ${err.message}`); | |
process.exit(1); | |
} | |
}; | |
checkAndInstall(); | |
================================================ | |
FILE: node-zerox/src/constants.ts | |
================================================ | |
export const ASPECT_RATIO_THRESHOLD = 5; | |
// This is a rough guess; this will be used to create Tesseract workers by default, | |
// that cater to this many pages. If a document has more than this many pages, | |
// then more workers will be created dynamically. | |
export const NUM_STARTING_WORKERS = 3; | |
export const CONSISTENCY_PROMPT = (priorPage: string): string => | |
`Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`; | |
export const SYSTEM_PROMPT_BASE = ` | |
Convert the following document to markdown. | |
Return only the markdown with no explanation text. Do not include delimiters like \`\`\`markdown or \`\`\`html. | |
RULES: | |
- You must include all information on the page. Do not exclude headers, footers, or subtext. | |
- Return tables in an HTML format. | |
- Charts & infographics must be interpreted to a markdown format. Prefer table format when applicable. | |
- Logos should be wrapped in brackets. Ex: <logo>Coca-Cola<logo> | |
- Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY<watermark> | |
- Page numbers should be wrapped in brackets. Ex: <page_number>14<page_number> or <page_number>9/22<page_number> | |
- Prefer using ☐ and ☑ for check boxes. | |
`; | |
================================================ | |
FILE: node-zerox/src/handleWarnings.ts | |
================================================ | |
// Tesseract relies on node-fetch v2, which has a deprecated version of punycode | |
// Suppress the warning for now. Check in when teseract updates to node-fetch v3 | |
// https://github.com/naptha/tesseract.js/issues/876 | |
if (process.stderr.write === process.stderr.constructor.prototype.write) { | |
const stdErrWrite = process.stderr.write; | |
process.stderr.write = function (chunk: any, ...args: any[]) { | |
const str = Buffer.isBuffer(chunk) ? chunk.toString() : chunk; | |
// Filter out the punycode deprecation warning | |
if (str.includes("punycode")) { | |
return true; | |
} | |
return stdErrWrite.apply(process.stderr, [chunk]); | |
}; | |
} | |
================================================ | |
FILE: node-zerox/src/index.ts | |
================================================ | |
import fs from "fs-extra"; | |
import os from "os"; | |
import path from "path"; | |
import pLimit from "p-limit"; | |
import Tesseract from "tesseract.js"; | |
import "./handleWarnings"; | |
import { | |
addWorkersToTesseractScheduler, | |
checkIsCFBFile, | |
checkIsPdfFile, | |
cleanupImage, | |
CompletionProcessor, | |
compressImage, | |
convertFileToPdf, | |
convertHeicToJpeg, | |
convertPdfToImages, | |
downloadFile, | |
extractPagesFromStructuredDataFile, | |
getNumberOfPagesFromPdf, | |
getTesseractScheduler, | |
isCompletionResponse, | |
isStructuredDataFile, | |
prepareWorkersForImageProcessing, | |
runRetries, | |
splitSchema, | |
terminateScheduler, | |
} from "./utils"; | |
import { createModel } from "./models"; | |
import { | |
CompletionResponse, | |
ErrorMode, | |
ExtractionResponse, | |
HybridInput, | |
LogprobPage, | |
ModelOptions, | |
ModelProvider, | |
OperationMode, | |
Page, | |
PageStatus, | |
ZeroxArgs, | |
ZeroxOutput, | |
} from "./types"; | |
import { NUM_STARTING_WORKERS } from "./constants"; | |
export const zerox = async ({ | |
cleanup = true, | |
concurrency = 10, | |
correctOrientation = true, | |
credentials = { apiKey: "" }, | |
customModelFunction, | |
directImageExtraction = false, | |
enableHybridExtraction = false, | |
errorMode = ErrorMode.IGNORE, | |
extractionCredentials, | |
extractionLlmParams, | |
extractionModel, | |
extractionModelProvider, | |
extractionPrompt, | |
extractOnly = false, | |
extractPerPage, | |
filePath, | |
imageDensity, | |
imageHeight, | |
llmParams = {}, | |
maintainFormat = false, | |
maxImageSize = 15, | |
maxRetries = 1, | |
maxTesseractWorkers = -1, | |
model = ModelOptions.OPENAI_GPT_4O, | |
modelProvider = ModelProvider.OPENAI, | |
openaiAPIKey = "", | |
outputDir, | |
pagesToConvertAsImages = -1, | |
prompt, | |
schema, | |
tempDir = os.tmpdir(), | |
trimEdges = true, | |
}: ZeroxArgs): Promise<ZeroxOutput> => { | |
let extracted: Record<string, unknown> | null = null; | |
let extractedLogprobs: LogprobPage[] = []; | |
let inputTokenCount: number = 0; | |
let outputTokenCount: number = 0; | |
let numSuccessfulOCRRequests: number = 0; | |
let numFailedOCRRequests: number = 0; | |
let ocrLogprobs: LogprobPage[] = []; | |
let priorPage: string = ""; | |
let pages: Page[] = []; | |
let imagePaths: string[] = []; | |
const startTime = new Date(); | |
if (openaiAPIKey && openaiAPIKey.length > 0) { | |
modelProvider = ModelProvider.OPENAI; | |
credentials = { apiKey: openaiAPIKey }; | |
} | |
extractionCredentials = extractionCredentials ?? credentials; | |
extractionLlmParams = extractionLlmParams ?? llmParams; | |
extractionModel = extractionModel ?? model; | |
extractionModelProvider = extractionModelProvider ?? modelProvider; | |
// Validators | |
if (Object.values(credentials).every((credential) => !credential)) { | |
throw new Error("Missing credentials"); | |
} | |
if (!filePath || !filePath.length) { | |
throw new Error("Missing file path"); | |
} | |
if (enableHybridExtraction && (directImageExtraction || extractOnly)) { | |
throw new Error( | |
"Hybrid extraction cannot be used in direct image extraction or extract-only mode" | |
); | |
} | |
if (enableHybridExtraction && !schema) { | |
throw new Error("Schema is required when hybrid extraction is enabled"); | |
} | |
if (extractOnly && !schema) { | |
throw new Error("Schema is required for extraction mode"); | |
} | |
if (extractOnly && maintainFormat) { | |
throw new Error("Maintain format is only supported in OCR mode"); | |
} | |
if (extractOnly) directImageExtraction = true; | |
let scheduler: Tesseract.Scheduler | null = null; | |
// Add initial tesseract workers if we need to correct orientation | |
if (correctOrientation) { | |
scheduler = await getTesseractScheduler(); | |
const workerCount = | |
maxTesseractWorkers !== -1 && maxTesseractWorkers < NUM_STARTING_WORKERS | |
? maxTesseractWorkers | |
: NUM_STARTING_WORKERS; | |
await addWorkersToTesseractScheduler({ | |
numWorkers: workerCount, | |
scheduler, | |
}); | |
} | |
try { | |
// Ensure temp directory exists + create temp folder | |
const rand = Math.floor(1000 + Math.random() * 9000).toString(); | |
const tempDirectory = path.join( | |
tempDir || os.tmpdir(), | |
`zerox-temp-${rand}` | |
); | |
const sourceDirectory = path.join(tempDirectory, "source"); | |
await fs.ensureDir(sourceDirectory); | |
// Download the PDF. Get file name. | |
const { extension, localPath } = await downloadFile({ | |
filePath, | |
tempDir: sourceDirectory, | |
}); | |
if (!localPath) throw "Failed to save file to local drive"; | |
// Sort the `pagesToConvertAsImages` array to make sure we use the right index | |
// for `formattedPages` as `pdf2pic` always returns images in order | |
if (Array.isArray(pagesToConvertAsImages)) { | |
pagesToConvertAsImages.sort((a, b) => a - b); | |
} | |
// Check if the file is a structured data file (like Excel). | |
// If so, skip the image conversion process and extract the pages directly | |
if (isStructuredDataFile(localPath)) { | |
pages = await extractPagesFromStructuredDataFile(localPath); | |
} else { | |
// Read the image file or convert the file to images | |
if ( | |
extension === ".png" || | |
extension === ".jpg" || | |
extension === ".jpeg" | |
) { | |
imagePaths = [localPath]; | |
} else if (extension === ".heic") { | |
const imagePath = await convertHeicToJpeg({ | |
localPath, | |
tempDir: sourceDirectory, | |
}); | |
imagePaths = [imagePath]; | |
} else { | |
let pdfPath: string; | |
const isCFBFile = await checkIsCFBFile(localPath); | |
const isPdf = await checkIsPdfFile(localPath); | |
if ((extension === ".pdf" || isPdf) && !isCFBFile) { | |
pdfPath = localPath; | |
} else { | |
// Convert file to PDF if necessary | |
pdfPath = await convertFileToPdf({ | |
extension, | |
localPath, | |
tempDir: sourceDirectory, | |
}); | |
} | |
if (pagesToConvertAsImages !== -1) { | |
const totalPages = await getNumberOfPagesFromPdf({ pdfPath }); | |
pagesToConvertAsImages = Array.isArray(pagesToConvertAsImages) | |
? pagesToConvertAsImages | |
: [pagesToConvertAsImages]; | |
pagesToConvertAsImages = pagesToConvertAsImages.filter( | |
(page) => page > 0 && page <= totalPages | |
); | |
} | |
imagePaths = await convertPdfToImages({ | |
imageDensity, | |
imageHeight, | |
pagesToConvertAsImages, | |
pdfPath, | |
tempDir: sourceDirectory, | |
}); | |
} | |
// Compress images if maxImageSize is specified | |
if (maxImageSize && maxImageSize > 0) { | |
const compressPromises = imagePaths.map(async (imagePath: string) => { | |
const imageBuffer = await fs.readFile(imagePath); | |
const compressedBuffer = await compressImage( | |
imageBuffer, | |
maxImageSize | |
); | |
const originalName = path.basename( | |
imagePath, | |
path.extname(imagePath) | |
); | |
const compressedPath = path.join( | |
sourceDirectory, | |
`${originalName}_compressed.png` | |
); | |
await fs.writeFile(compressedPath, compressedBuffer); | |
return compressedPath; | |
}); | |
imagePaths = await Promise.all(compressPromises); | |
} | |
if (correctOrientation) { | |
await prepareWorkersForImageProcessing({ | |
maxTesseractWorkers, | |
numImages: imagePaths.length, | |
scheduler, | |
}); | |
} | |
// Start processing OCR using LLM | |
const modelInstance = createModel({ | |
credentials, | |
llmParams, | |
model, | |
provider: modelProvider, | |
}); | |
if (!extractOnly) { | |
const processOCR = async ( | |
imagePath: string, | |
pageIndex: number, | |
maintainFormat: boolean | |
): Promise<Page> => { | |
let pageNumber: number; | |
// If we convert all pages, just use the array index | |
if (pagesToConvertAsImages === -1) { | |
pageNumber = pageIndex + 1; | |
} | |
// Else if we convert specific pages, use the page number from the parameter | |
else if (Array.isArray(pagesToConvertAsImages)) { | |
pageNumber = pagesToConvertAsImages[pageIndex]; | |
} | |
// Else, the parameter is a number and use it for the page number | |
else { | |
pageNumber = pagesToConvertAsImages; | |
} | |
const imageBuffer = await fs.readFile(imagePath); | |
const buffers = await cleanupImage({ | |
correctOrientation, | |
imageBuffer, | |
scheduler, | |
trimEdges, | |
}); | |
let page: Page; | |
try { | |
let rawResponse: CompletionResponse | ExtractionResponse; | |
if (customModelFunction) { | |
rawResponse = await runRetries( | |
() => | |
customModelFunction({ | |
buffers, | |
image: imagePath, | |
maintainFormat, | |
pageNumber, | |
priorPage, | |
}), | |
maxRetries, | |
pageNumber | |
); | |
} else { | |
rawResponse = await runRetries( | |
() => | |
modelInstance.getCompletion(OperationMode.OCR, { | |
buffers, | |
maintainFormat, | |
priorPage, | |
prompt, | |
}), | |
maxRetries, | |
pageNumber | |
); | |
} | |
if (rawResponse.logprobs) { | |
ocrLogprobs.push({ | |
page: pageNumber, | |
value: rawResponse.logprobs, | |
}); | |
} | |
const response = CompletionProcessor.process( | |
OperationMode.OCR, | |
rawResponse | |
); | |
inputTokenCount += response.inputTokens; | |
outputTokenCount += response.outputTokens; | |
if (isCompletionResponse(OperationMode.OCR, response)) { | |
priorPage = response.content; | |
} | |
page = { | |
...response, | |
page: pageNumber, | |
status: PageStatus.SUCCESS, | |
}; | |
numSuccessfulOCRRequests++; | |
} catch (error) { | |
console.error(`Failed to process image ${imagePath}:`, error); | |
if (errorMode === ErrorMode.THROW) { | |
throw error; | |
} | |
page = { | |
content: "", | |
contentLength: 0, | |
error: `Failed to process page ${pageNumber}: ${error}`, | |
page: pageNumber, | |
status: PageStatus.ERROR, | |
}; | |
numFailedOCRRequests++; | |
} | |
return page; | |
}; | |
if (maintainFormat) { | |
// Use synchronous processing | |
for (let i = 0; i < imagePaths.length; i++) { | |
const page = await processOCR(imagePaths[i], i, true); | |
pages.push(page); | |
if (page.status === PageStatus.ERROR) { | |
break; | |
} | |
} | |
} else { | |
const limit = pLimit(concurrency); | |
await Promise.all( | |
imagePaths.map((imagePath, i) => | |
limit(() => | |
processOCR(imagePath, i, false).then((page) => { | |
pages[i] = page; | |
}) | |
) | |
) | |
); | |
} | |
} | |
} | |
// Start processing extraction using LLM | |
let numSuccessfulExtractionRequests: number = 0; | |
let numFailedExtractionRequests: number = 0; | |
if (schema) { | |
const extractionModelInstance = createModel({ | |
credentials: extractionCredentials, | |
llmParams: extractionLlmParams, | |
model: extractionModel, | |
provider: extractionModelProvider, | |
}); | |
const { fullDocSchema, perPageSchema } = splitSchema( | |
schema, | |
extractPerPage | |
); | |
const extractionTasks: Promise<any>[] = []; | |
const processExtraction = async ( | |
input: string | string[] | HybridInput, | |
pageNumber: number, | |
schema: Record<string, unknown> | |
): Promise<Record<string, unknown>> => { | |
let result: Record<string, unknown> = {}; | |
try { | |
await runRetries( | |
async () => { | |
const rawResponse = await extractionModelInstance.getCompletion( | |
OperationMode.EXTRACTION, | |
{ | |
input, | |
options: { correctOrientation, scheduler, trimEdges }, | |
prompt: extractionPrompt, | |
schema, | |
} | |
); | |
if (rawResponse.logprobs) { | |
extractedLogprobs.push({ | |
page: pageNumber, | |
value: rawResponse.logprobs, | |
}); | |
} | |
const response = CompletionProcessor.process( | |
OperationMode.EXTRACTION, | |
rawResponse | |
); | |
inputTokenCount += response.inputTokens; | |
outputTokenCount += response.outputTokens; | |
numSuccessfulExtractionRequests++; | |
for (const key of Object.keys(schema?.properties ?? {})) { | |
const value = response.extracted[key]; | |
if (value !== null && value !== undefined) { | |
if (!Array.isArray(result[key])) { | |
result[key] = []; | |
} | |
(result[key] as any[]).push({ page: pageNumber, value }); | |
} | |
} | |
}, | |
maxRetries, | |
pageNumber | |
); | |
} catch (error) { | |
numFailedExtractionRequests++; | |
throw error; | |
} | |
return result; | |
}; | |
if (perPageSchema) { | |
const inputs = | |
directImageExtraction && !isStructuredDataFile(localPath) | |
? imagePaths.map((imagePath) => [imagePath]) | |
: enableHybridExtraction | |
? imagePaths.map((imagePath, index) => ({ | |
imagePaths: [imagePath], | |
text: pages[index].content || "", | |
})) | |
: pages.map((page) => page.content || ""); | |
extractionTasks.push( | |
...inputs.map((input, i) => | |
processExtraction(input, i + 1, perPageSchema) | |
) | |
); | |
} | |
if (fullDocSchema) { | |
const input = | |
directImageExtraction && !isStructuredDataFile(localPath) | |
? imagePaths | |
: enableHybridExtraction | |
? { | |
imagePaths, | |
text: pages | |
.map((page, i) => | |
i === 0 ? page.content : "\n<hr><hr>\n" + page.content | |
) | |
.join(""), | |
} | |
: pages | |
.map((page, i) => | |
i === 0 ? page.content : "\n<hr><hr>\n" + page.content | |
) | |
.join(""); | |
extractionTasks.push( | |
(async () => { | |
let result: Record<string, unknown> = {}; | |
try { | |
await runRetries( | |
async () => { | |
const rawResponse = | |
await extractionModelInstance.getCompletion( | |
OperationMode.EXTRACTION, | |
{ | |
input, | |
options: { correctOrientation, scheduler, trimEdges }, | |
prompt: extractionPrompt, | |
schema: fullDocSchema, | |
} | |
); | |
if (rawResponse.logprobs) { | |
extractedLogprobs.push({ | |
page: null, | |
value: rawResponse.logprobs, | |
}); | |
} | |
const response = CompletionProcessor.process( | |
OperationMode.EXTRACTION, | |
rawResponse | |
); | |
inputTokenCount += response.inputTokens; | |
outputTokenCount += response.outputTokens; | |
numSuccessfulExtractionRequests++; | |
result = response.extracted; | |
}, | |
maxRetries, | |
0 | |
); | |
return result; | |
} catch (error) { | |
numFailedExtractionRequests++; | |
throw error; | |
} | |
})() | |
); | |
} | |
const results = await Promise.all(extractionTasks); | |
extracted = results.reduce((acc, result) => { | |
Object.entries(result || {}).forEach(([key, value]) => { | |
if (!acc[key]) { | |
acc[key] = []; | |
} | |
if (Array.isArray(value)) { | |
acc[key].push(...value); | |
} else { | |
acc[key] = value; | |
} | |
}); | |
return acc; | |
}, {}); | |
} | |
// Write the aggregated markdown to a file | |
const endOfPath = localPath.split("/")[localPath.split("/").length - 1]; | |
const rawFileName = endOfPath.split(".")[0]; | |
const fileName = rawFileName | |
.replace(/[^\w\s]/g, "") | |
.replace(/\s+/g, "_") | |
.toLowerCase() | |
.substring(0, 255); // Truncate file name to 255 characters to prevent ENAMETOOLONG errors | |
if (outputDir) { | |
const resultFilePath = path.join(outputDir, `${fileName}.md`); | |
const content = pages.map((page) => page.content).join("\n\n"); | |
await fs.writeFile(resultFilePath, content); | |
} | |
// Cleanup the downloaded PDF file | |
if (cleanup) await fs.remove(tempDirectory); | |
// Format JSON response | |
const endTime = new Date(); | |
const completionTime = endTime.getTime() - startTime.getTime(); | |
return { | |
completionTime, | |
extracted, | |
fileName, | |
inputTokens: inputTokenCount, | |
...(ocrLogprobs.length || extractedLogprobs.length | |
? { | |
logprobs: { | |
ocr: !extractOnly ? ocrLogprobs : null, | |
extracted: schema ? extractedLogprobs : null, | |
}, | |
} | |
: {}), | |
outputTokens: outputTokenCount, | |
pages, | |
summary: { | |
totalPages: pages.length, | |
ocr: !extractOnly | |
? { | |
successful: numSuccessfulOCRRequests, | |
failed: numFailedOCRRequests, | |
} | |
: null, | |
extracted: schema | |
? { | |
successful: numSuccessfulExtractionRequests, | |
failed: numFailedExtractionRequests, | |
} | |
: null, | |
}, | |
}; | |
} finally { | |
if (correctOrientation && scheduler) { | |
terminateScheduler(scheduler); | |
} | |
} | |
}; | |
================================================ | |
FILE: node-zerox/src/types.ts | |
================================================ | |
import { ChatCompletionTokenLogprob } from "openai/resources"; | |
import Tesseract from "tesseract.js"; | |
export interface ZeroxArgs { | |
cleanup?: boolean; | |
concurrency?: number; | |
correctOrientation?: boolean; | |
credentials?: ModelCredentials; | |
customModelFunction?: (params: { | |
buffers: Buffer[]; | |
image: string; | |
maintainFormat: boolean; | |
pageNumber: number; | |
priorPage: string; | |
}) => Promise<CompletionResponse>; | |
directImageExtraction?: boolean; | |
enableHybridExtraction?: boolean; | |
errorMode?: ErrorMode; | |
extractionCredentials?: ModelCredentials; | |
extractionLlmParams?: Partial<LLMParams>; | |
extractionModel?: ModelOptions | string; | |
extractionModelProvider?: ModelProvider | string; | |
extractionPrompt?: string; | |
extractOnly?: boolean; | |
extractPerPage?: string[]; | |
filePath: string; | |
imageDensity?: number; | |
imageHeight?: number; | |
llmParams?: Partial<LLMParams>; | |
maintainFormat?: boolean; | |
maxImageSize?: number; | |
maxRetries?: number; | |
maxTesseractWorkers?: number; | |
model?: ModelOptions | string; | |
modelProvider?: ModelProvider | string; | |
openaiAPIKey?: string; | |
outputDir?: string; | |
pagesToConvertAsImages?: number | number[]; | |
prompt?: string; | |
schema?: Record<string, unknown>; | |
tempDir?: string; | |
trimEdges?: boolean; | |
} | |
export interface ZeroxOutput { | |
completionTime: number; | |
extracted: Record<string, unknown> | null; | |
fileName: string; | |
inputTokens: number; | |
logprobs?: Logprobs; | |
outputTokens: number; | |
pages: Page[]; | |
summary: Summary; | |
} | |
export interface AzureCredentials { | |
apiKey: string; | |
endpoint: string; | |
} | |
export interface BedrockCredentials { | |
accessKeyId?: string; | |
region: string; | |
secretAccessKey?: string; | |
sessionToken?: string; | |
} | |
export interface GoogleCredentials { | |
apiKey: string; | |
} | |
export interface OpenAICredentials { | |
apiKey: string; | |
} | |
export type ModelCredentials = | |
| AzureCredentials | |
| BedrockCredentials | |
| GoogleCredentials | |
| OpenAICredentials; | |
export enum ModelOptions { | |
// Bedrock Claude 3 Models | |
BEDROCK_CLAUDE_3_HAIKU_2024_10 = "anthropic.claude-3-5-haiku-20241022-v1:0", | |
BEDROCK_CLAUDE_3_SONNET_2024_06 = "anthropic.claude-3-5-sonnet-20240620-v1:0", | |
BEDROCK_CLAUDE_3_SONNET_2024_10 = "anthropic.claude-3-5-sonnet-20241022-v2:0", | |
BEDROCK_CLAUDE_3_HAIKU_2024_03 = "anthropic.claude-3-haiku-20240307-v1:0", | |
BEDROCK_CLAUDE_3_OPUS_2024_02 = "anthropic.claude-3-opus-20240229-v1:0", | |
BEDROCK_CLAUDE_3_SONNET_2024_02 = "anthropic.claude-3-sonnet-20240229-v1:0", | |
// OpenAI GPT-4 Models | |
OPENAI_GPT_4_1 = "gpt-4.1", | |
OPENAI_GPT_4_1_MINI = "gpt-4.1-mini", | |
OPENAI_GPT_4O = "gpt-4o", | |
OPENAI_GPT_4O_MINI = "gpt-4o-mini", | |
// Google Gemini Models | |
GOOGLE_GEMINI_1_5_FLASH = "gemini-1.5-flash", | |
GOOGLE_GEMINI_1_5_FLASH_8B = "gemini-1.5-flash-8b", | |
GOOGLE_GEMINI_1_5_PRO = "gemini-1.5-pro", | |
GOOGLE_GEMINI_2_5_PRO = "gemini-2.5-pro-preview-03-25", | |
GOOGLE_GEMINI_2_FLASH = "gemini-2.0-flash-001", | |
GOOGLE_GEMINI_2_FLASH_LITE = "gemini-2.0-flash-lite-preview-02-05", | |
} | |
export enum ModelProvider { | |
AZURE = "AZURE", | |
BEDROCK = "BEDROCK", | |
GOOGLE = "GOOGLE", | |
OPENAI = "OPENAI", | |
} | |
export enum OperationMode { | |
EXTRACTION = "EXTRACTION", | |
OCR = "OCR", | |
} | |
export enum PageStatus { | |
SUCCESS = "SUCCESS", | |
ERROR = "ERROR", | |
} | |
export interface Page { | |
content?: string; | |
contentLength?: number; | |
error?: string; | |
extracted?: Record<string, unknown>; | |
inputTokens?: number; | |
outputTokens?: number; | |
page: number; | |
status: PageStatus; | |
} | |
export interface ConvertPdfOptions { | |
density: number; | |
format: "png"; | |
height: number; | |
preserveAspectRatio?: boolean; | |
saveFilename: string; | |
savePath: string; | |
} | |
export interface CompletionArgs { | |
buffers: Buffer[]; | |
maintainFormat: boolean; | |
priorPage: string; | |
prompt?: string; | |
} | |
export interface CompletionResponse { | |
content: string; | |
inputTokens: number; | |
logprobs?: ChatCompletionTokenLogprob[] | null; | |
outputTokens: number; | |
} | |
export type ProcessedCompletionResponse = Omit< | |
CompletionResponse, | |
"logprobs" | |
> & { | |
contentLength: number; | |
}; | |
export interface CreateModelArgs { | |
credentials: ModelCredentials; | |
llmParams: Partial<LLMParams>; | |
model: ModelOptions | string; | |
provider: ModelProvider | string; | |
} | |
export enum ErrorMode { | |
THROW = "THROW", | |
IGNORE = "IGNORE", | |
} | |
export interface ExtractionArgs { | |
input: string | string[] | HybridInput; | |
options?: { | |
correctOrientation?: boolean; | |
scheduler: Tesseract.Scheduler | null; | |
trimEdges?: boolean; | |
}; | |
prompt?: string; | |
schema: Record<string, unknown>; | |
} | |
export interface ExtractionResponse { | |
extracted: Record<string, unknown>; | |
inputTokens: number; | |
logprobs?: ChatCompletionTokenLogprob[] | null; | |
outputTokens: number; | |
} | |
export type ProcessedExtractionResponse = Omit<ExtractionResponse, "logprobs">; | |
export interface HybridInput { | |
imagePaths: string[]; | |
text: string; | |
} | |
interface BaseLLMParams { | |
frequencyPenalty?: number; | |
presencePenalty?: number; | |
temperature?: number; | |
topP?: number; | |
} | |
export interface AzureLLMParams extends BaseLLMParams { | |
logprobs: boolean; | |
maxTokens: number; | |
} | |
export interface BedrockLLMParams extends BaseLLMParams { | |
maxTokens: number; | |
} | |
export interface GoogleLLMParams extends BaseLLMParams { | |
maxOutputTokens: number; | |
} | |
export interface OpenAILLMParams extends BaseLLMParams { | |
logprobs: boolean; | |
maxTokens: number; | |
} | |
// Union type of all provider params | |
export type LLMParams = | |
| AzureLLMParams | |
| BedrockLLMParams | |
| GoogleLLMParams | |
| OpenAILLMParams; | |
export interface LogprobPage { | |
page: number | null; | |
value: ChatCompletionTokenLogprob[]; | |
} | |
interface Logprobs { | |
ocr: LogprobPage[] | null; | |
extracted: LogprobPage[] | null; | |
} | |
export interface MessageContentArgs { | |
input: string | string[] | HybridInput; | |
options?: { | |
correctOrientation?: boolean; | |
scheduler: Tesseract.Scheduler | null; | |
trimEdges?: boolean; | |
}; | |
} | |
export interface ModelInterface { | |
getCompletion( | |
mode: OperationMode, | |
params: CompletionArgs | ExtractionArgs | |
): Promise<CompletionResponse | ExtractionResponse>; | |
} | |
export interface Summary { | |
totalPages: number; | |
ocr: { | |
successful: number; | |
failed: number; | |
} | null; | |
extracted: { | |
successful: number; | |
failed: number; | |
} | null; | |
} | |
export interface ExcelSheetContent { | |
content: string; | |
contentLength: number; | |
sheetName: string; | |
} | |
================================================ | |
FILE: node-zerox/src/models/azure.ts | |
================================================ | |
import { | |
AzureCredentials, | |
AzureLLMParams, | |
CompletionArgs, | |
CompletionResponse, | |
ExtractionArgs, | |
ExtractionResponse, | |
MessageContentArgs, | |
ModelInterface, | |
OperationMode, | |
} from "../types"; | |
import { AzureOpenAI } from "openai"; | |
import { | |
cleanupImage, | |
convertKeysToCamelCase, | |
convertKeysToSnakeCase, | |
encodeImageToBase64, | |
} from "../utils"; | |
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants"; | |
import fs from "fs-extra"; | |
export default class AzureModel implements ModelInterface { | |
private client: AzureOpenAI; | |
private llmParams?: Partial<AzureLLMParams>; | |
constructor( | |
credentials: AzureCredentials, | |
model: string, | |
llmParams?: Partial<AzureLLMParams> | |
) { | |
this.client = new AzureOpenAI({ | |
apiKey: credentials.apiKey, | |
apiVersion: "2024-10-21", | |
deployment: model, | |
endpoint: credentials.endpoint, | |
}); | |
this.llmParams = llmParams; | |
} | |
async getCompletion( | |
mode: OperationMode, | |
params: CompletionArgs | ExtractionArgs | |
): Promise<CompletionResponse | ExtractionResponse> { | |
const modeHandlers = { | |
[OperationMode.EXTRACTION]: () => | |
this.handleExtraction(params as ExtractionArgs), | |
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs), | |
}; | |
const handler = modeHandlers[mode]; | |
if (!handler) { | |
throw new Error(`Unsupported operation mode: ${mode}`); | |
} | |
return await handler(); | |
} | |
private async createMessageContent({ | |
input, | |
options, | |
}: MessageContentArgs): Promise<any> { | |
const processImages = async (imagePaths: string[]) => { | |
const nestedImages = await Promise.all( | |
imagePaths.map(async (imagePath) => { | |
const imageBuffer = await fs.readFile(imagePath); | |
const buffers = await cleanupImage({ | |
correctOrientation: options?.correctOrientation ?? false, | |
imageBuffer, | |
scheduler: options?.scheduler ?? null, | |
trimEdges: options?.trimEdges ?? false, | |
}); | |
return buffers.map((buffer) => ({ | |
image_url: { | |
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`, | |
}, | |
type: "image_url", | |
})); | |
}) | |
); | |
return nestedImages.flat(); | |
}; | |
if (Array.isArray(input)) { | |
return processImages(input); | |
} | |
if (typeof input === "string") { | |
return [{ text: input, type: "text" }]; | |
} | |
const { imagePaths, text } = input; | |
const images = await processImages(imagePaths); | |
return [...images, { text, type: "text" }]; | |
} | |
private async handleOCR({ | |
buffers, | |
maintainFormat, | |
priorPage, | |
prompt, | |
}: CompletionArgs): Promise<CompletionResponse> { | |
const systemPrompt = prompt || SYSTEM_PROMPT_BASE; | |
// Default system message | |
const messages: any = [{ role: "system", content: systemPrompt }]; | |
// If content has already been generated, add it to context. | |
// This helps maintain the same format across pages | |
if (maintainFormat && priorPage && priorPage.length) { | |
messages.push({ | |
role: "system", | |
content: CONSISTENCY_PROMPT(priorPage), | |
}); | |
} | |
// Add image to request | |
const imageContents = buffers.map((buffer) => ({ | |
type: "image_url", | |
image_url: { | |
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`, | |
}, | |
})); | |
messages.push({ role: "user", content: imageContents }); | |
try { | |
const response = await this.client.chat.completions.create({ | |
messages, | |
model: "", | |
...convertKeysToSnakeCase(this.llmParams ?? null), | |
}); | |
const result: CompletionResponse = { | |
content: response.choices[0].message.content || "", | |
inputTokens: response.usage?.prompt_tokens || 0, | |
outputTokens: response.usage?.completion_tokens || 0, | |
}; | |
if (this.llmParams?.logprobs) { | |
result["logprobs"] = convertKeysToCamelCase( | |
response.choices[0].logprobs | |
)?.content; | |
} | |
return result; | |
} catch (err) { | |
console.error("Error in Azure completion", err); | |
throw err; | |
} | |
} | |
private async handleExtraction({ | |
input, | |
options, | |
prompt, | |
schema, | |
}: ExtractionArgs): Promise<ExtractionResponse> { | |
try { | |
const messages: any = []; | |
if (prompt) { | |
messages.push({ role: "system", content: prompt }); | |
} | |
messages.push({ | |
role: "user", | |
content: await this.createMessageContent({ input, options }), | |
}); | |
const response = await this.client.chat.completions.create({ | |
messages, | |
model: "", | |
response_format: { | |
json_schema: { name: "extraction", schema }, | |
type: "json_schema", | |
}, | |
...convertKeysToSnakeCase(this.llmParams ?? null), | |
}); | |
const result: ExtractionResponse = { | |
extracted: JSON.parse(response.choices[0].message.content || ""), | |
inputTokens: response.usage?.prompt_tokens || 0, | |
outputTokens: response.usage?.completion_tokens || 0, | |
}; | |
if (this.llmParams?.logprobs) { | |
result["logprobs"] = convertKeysToCamelCase( | |
response.choices[0].logprobs | |
)?.content; | |
} | |
return result; | |
} catch (err) { | |
console.error("Error in Azure completion", err); | |
throw err; | |
} | |
} | |
} | |
================================================ | |
FILE: node-zerox/src/models/bedrock.ts | |
================================================ | |
import { | |
BedrockCredentials, | |
BedrockLLMParams, | |
CompletionArgs, | |
CompletionResponse, | |
ExtractionArgs, | |
ExtractionResponse, | |
MessageContentArgs, | |
ModelInterface, | |
OperationMode, | |
} from "../types"; | |
import { | |
BedrockRuntimeClient, | |
InvokeModelCommand, | |
} from "@aws-sdk/client-bedrock-runtime"; | |
import { | |
cleanupImage, | |
convertKeysToSnakeCase, | |
encodeImageToBase64, | |
} from "../utils"; | |
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants"; | |
import fs from "fs-extra"; | |
// Currently only supports Anthropic models | |
export default class BedrockModel implements ModelInterface { | |
private client: BedrockRuntimeClient; | |
private model: string; | |
private llmParams?: Partial<BedrockLLMParams>; | |
constructor( | |
credentials: BedrockCredentials, | |
model: string, | |
llmParams?: Partial<BedrockLLMParams> | |
) { | |
this.client = new BedrockRuntimeClient({ | |
region: credentials.region, | |
credentials: credentials.accessKeyId | |
? { | |
accessKeyId: credentials.accessKeyId, | |
secretAccessKey: credentials.secretAccessKey!, | |
sessionToken: credentials.sessionToken, | |
} | |
: undefined, | |
}); | |
this.model = model; | |
this.llmParams = llmParams; | |
} | |
async getCompletion( | |
mode: OperationMode, | |
params: CompletionArgs | ExtractionArgs | |
): Promise<CompletionResponse | ExtractionResponse> { | |
const modeHandlers = { | |
[OperationMode.EXTRACTION]: () => | |
this.handleExtraction(params as ExtractionArgs), | |
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs), | |
}; | |
const handler = modeHandlers[mode]; | |
if (!handler) { | |
throw new Error(`Unsupported operation mode: ${mode}`); | |
} | |
return await handler(); | |
} | |
private async createMessageContent({ | |
input, | |
options, | |
}: MessageContentArgs): Promise<any> { | |
const processImages = async (imagePaths: string[]) => { | |
const nestedImages = await Promise.all( | |
imagePaths.map(async (imagePath) => { | |
const imageBuffer = await fs.readFile(imagePath); | |
const buffers = await cleanupImage({ | |
correctOrientation: options?.correctOrientation ?? false, | |
imageBuffer, | |
scheduler: options?.scheduler ?? null, | |
trimEdges: options?.trimEdges ?? false, | |
}); | |
return buffers.map((buffer) => ({ | |
source: { | |
data: encodeImageToBase64(buffer), | |
media_type: "image/png", | |
type: "base64", | |
}, | |
type: "image", | |
})); | |
}) | |
); | |
return nestedImages.flat(); | |
}; | |
if (Array.isArray(input)) { | |
return processImages(input); | |
} | |
if (typeof input === "string") { | |
return [{ text: input, type: "text" }]; | |
} | |
const { imagePaths, text } = input; | |
const images = await processImages(imagePaths); | |
return [...images, { text, type: "text" }]; | |
} | |
private async handleOCR({ | |
buffers, | |
maintainFormat, | |
priorPage, | |
prompt, | |
}: CompletionArgs): Promise<CompletionResponse> { | |
let systemPrompt = prompt || SYSTEM_PROMPT_BASE; | |
// Default system message | |
const messages: any = []; | |
// If content has already been generated, add it to context. | |
// This helps maintain the same format across pages | |
if (maintainFormat && priorPage && priorPage.length) { | |
systemPrompt += `\n\n${CONSISTENCY_PROMPT(priorPage)}`; | |
} | |
// Add image to request | |
const imageContents = buffers.map((buffer) => ({ | |
source: { | |
data: encodeImageToBase64(buffer), | |
media_type: "image/png", | |
type: "base64", | |
}, | |
type: "image", | |
})); | |
messages.push({ role: "user", content: imageContents }); | |
try { | |
const body = { | |
anthropic_version: "bedrock-2023-05-31", | |
max_tokens: this.llmParams?.maxTokens || 4096, | |
messages, | |
system: systemPrompt, | |
...convertKeysToSnakeCase(this.llmParams ?? {}), | |
}; | |
const command = new InvokeModelCommand({ | |
accept: "application/json", | |
body: JSON.stringify(body), | |
contentType: "application/json", | |
modelId: this.model, | |
}); | |
const response = await this.client.send(command); | |
const parsedResponse = JSON.parse( | |
new TextDecoder().decode(response.body) | |
); | |
return { | |
content: parsedResponse.content[0].text, | |
inputTokens: parsedResponse.usage?.input_tokens || 0, | |
outputTokens: parsedResponse.usage?.output_tokens || 0, | |
}; | |
} catch (err) { | |
console.error("Error in Bedrock completion", err); | |
throw err; | |
} | |
} | |
private async handleExtraction({ | |
input, | |
options, | |
prompt, | |
schema, | |
}: ExtractionArgs): Promise<ExtractionResponse> { | |
try { | |
const messages = [ | |
{ | |
role: "user", | |
content: await this.createMessageContent({ input, options }), | |
}, | |
]; | |
const tools = [ | |
{ | |
input_schema: schema, | |
name: "json", | |
}, | |
]; | |
const body = { | |
anthropic_version: "bedrock-2023-05-31", | |
max_tokens: this.llmParams?.maxTokens || 4096, | |
messages, | |
system: prompt, | |
tool_choice: { name: "json", type: "tool" }, | |
tools, | |
...convertKeysToSnakeCase(this.llmParams ?? {}), | |
}; | |
const command = new InvokeModelCommand({ | |
accept: "application/json", | |
body: JSON.stringify(body), | |
contentType: "application/json", | |
modelId: this.model, | |
}); | |
const response = await this.client.send(command); | |
const parsedResponse = JSON.parse( | |
new TextDecoder().decode(response.body) | |
); | |
return { | |
extracted: parsedResponse.content[0].input, | |
inputTokens: parsedResponse.usage?.input_tokens || 0, | |
outputTokens: parsedResponse.usage?.output_tokens || 0, | |
}; | |
} catch (err) { | |
console.error("Error in Bedrock completion", err); | |
throw err; | |
} | |
} | |
} | |
================================================ | |
FILE: node-zerox/src/models/google.ts | |
================================================ | |
import { | |
cleanupImage, | |
convertKeysToSnakeCase, | |
encodeImageToBase64, | |
} from "../utils"; | |
import { | |
CompletionArgs, | |
CompletionResponse, | |
ExtractionArgs, | |
ExtractionResponse, | |
GoogleCredentials, | |
GoogleLLMParams, | |
MessageContentArgs, | |
ModelInterface, | |
OperationMode, | |
} from "../types"; | |
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants"; | |
import { GoogleGenAI, createPartFromBase64 } from "@google/genai"; | |
import fs from "fs-extra"; | |
export default class GoogleModel implements ModelInterface { | |
private client: GoogleGenAI; | |
private model: string; | |
private llmParams?: Partial<GoogleLLMParams>; | |
constructor( | |
credentials: GoogleCredentials, | |
model: string, | |
llmParams?: Partial<GoogleLLMParams> | |
) { | |
this.client = new GoogleGenAI({ apiKey: credentials.apiKey }); | |
this.model = model; | |
this.llmParams = llmParams; | |
} | |
async getCompletion( | |
mode: OperationMode, | |
params: CompletionArgs | ExtractionArgs | |
): Promise<CompletionResponse | ExtractionResponse> { | |
const modeHandlers = { | |
[OperationMode.EXTRACTION]: () => | |
this.handleExtraction(params as ExtractionArgs), | |
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs), | |
}; | |
const handler = modeHandlers[mode]; | |
if (!handler) { | |
throw new Error(`Unsupported operation mode: ${mode}`); | |
} | |
return await handler(); | |
} | |
private async createMessageContent({ | |
input, | |
options, | |
}: MessageContentArgs): Promise<any> { | |
const processImages = async (imagePaths: string[]) => { | |
const nestedImages = await Promise.all( | |
imagePaths.map(async (imagePath) => { | |
const imageBuffer = await fs.readFile(imagePath); | |
const buffers = await cleanupImage({ | |
correctOrientation: options?.correctOrientation ?? false, | |
imageBuffer, | |
scheduler: options?.scheduler ?? null, | |
trimEdges: options?.trimEdges ?? false, | |
}); | |
return buffers.map((buffer) => | |
createPartFromBase64(encodeImageToBase64(buffer), "image/png") | |
); | |
}) | |
); | |
return nestedImages.flat(); | |
}; | |
if (Array.isArray(input)) { | |
return processImages(input); | |
} | |
if (typeof input === "string") { | |
return [{ text: input }]; | |
} | |
const { imagePaths, text } = input; | |
const images = await processImages(imagePaths); | |
return [...images, { text }]; | |
} | |
private async handleOCR({ | |
buffers, | |
maintainFormat, | |
priorPage, | |
prompt, | |
}: CompletionArgs): Promise<CompletionResponse> { | |
// Insert the text prompt after the image contents array | |
// https://ai.google.dev/gemini-api/docs/image-understanding?lang=node#technical-details-image | |
// Build the prompt parts | |
const promptParts: any = []; | |
// Add image contents | |
const imageContents = buffers.map((buffer) => | |
createPartFromBase64(encodeImageToBase64(buffer), "image/png") | |
); | |
promptParts.push(...imageContents); | |
// Add system prompt | |
promptParts.push({ text: prompt || SYSTEM_PROMPT_BASE }); | |
// If content has already been generated, add it to context | |
if (maintainFormat && priorPage && priorPage.length) { | |
promptParts.push({ text: CONSISTENCY_PROMPT(priorPage) }); | |
} | |
try { | |
const response = await this.client.models.generateContent({ | |
config: convertKeysToSnakeCase(this.llmParams ?? null), | |
contents: promptParts, | |
model: this.model, | |
}); | |
return { | |
content: response.text || "", | |
inputTokens: response.usageMetadata?.promptTokenCount || 0, | |
outputTokens: response.usageMetadata?.candidatesTokenCount || 0, | |
}; | |
} catch (err) { | |
console.error("Error in Google completion", err); | |
throw err; | |
} | |
} | |
private async handleExtraction({ | |
input, | |
options, | |
prompt, | |
schema, | |
}: ExtractionArgs): Promise<ExtractionResponse> { | |
// Build the prompt parts | |
const promptParts: any = []; | |
const parts = await this.createMessageContent({ input, options }); | |
promptParts.push(...parts); | |
// Add system prompt | |
promptParts.push({ text: prompt || "Extract schema data" }); | |
try { | |
const response = await this.client.models.generateContent({ | |
config: { | |
...convertKeysToSnakeCase(this.llmParams ?? null), | |
responseMimeType: "application/json", | |
responseSchema: schema, | |
}, | |
contents: promptParts, | |
model: this.model, | |
}); | |
return { | |
extracted: response.text ? JSON.parse(response.text) : {}, | |
inputTokens: response.usageMetadata?.promptTokenCount || 0, | |
outputTokens: response.usageMetadata?.candidatesTokenCount || 0, | |
}; | |
} catch (err) { | |
console.error("Error in Google completion", err); | |
throw err; | |
} | |
} | |
} | |
================================================ | |
FILE: node-zerox/src/models/index.ts | |
================================================ | |
import { | |
AzureCredentials, | |
BedrockCredentials, | |
CreateModelArgs, | |
GoogleCredentials, | |
ModelInterface, | |
ModelProvider, | |
OpenAICredentials, | |
} from "../types"; | |
import { validateLLMParams } from "../utils/model"; | |
import AzureModel from "./azure"; | |
import BedrockModel from "./bedrock"; | |
import GoogleModel from "./google"; | |
import OpenAIModel from "./openAI"; | |
// Type guard for Azure credentials | |
const isAzureCredentials = ( | |
credentials: any | |
): credentials is AzureCredentials => { | |
return ( | |
credentials && | |
typeof credentials.endpoint === "string" && | |
typeof credentials.apiKey === "string" | |
); | |
}; | |
// Type guard for Bedrock credentials | |
const isBedrockCredentials = ( | |
credentials: any | |
): credentials is BedrockCredentials => { | |
return credentials && typeof credentials.region === "string"; | |
}; | |
// Type guard for Google credentials | |
const isGoogleCredentials = ( | |
credentials: any | |
): credentials is GoogleCredentials => { | |
return credentials && typeof credentials.apiKey === "string"; | |
}; | |
// Type guard for OpenAI credentials | |
const isOpenAICredentials = ( | |
credentials: any | |
): credentials is OpenAICredentials => { | |
return credentials && typeof credentials.apiKey === "string"; | |
}; | |
export const createModel = ({ | |
credentials, | |
llmParams, | |
model, | |
provider, | |
}: CreateModelArgs): ModelInterface => { | |
const validatedParams = validateLLMParams(llmParams, provider); | |
switch (provider) { | |
case ModelProvider.AZURE: | |
if (!isAzureCredentials(credentials)) { | |
throw new Error("Invalid credentials for Azure provider"); | |
} | |
return new AzureModel(credentials, model, validatedParams); | |
case ModelProvider.BEDROCK: | |
if (!isBedrockCredentials(credentials)) { | |
throw new Error("Invalid credentials for Bedrock provider"); | |
} | |
return new BedrockModel(credentials, model, validatedParams); | |
case ModelProvider.GOOGLE: | |
if (!isGoogleCredentials(credentials)) { | |
throw new Error("Invalid credentials for Google provider"); | |
} | |
return new GoogleModel(credentials, model, validatedParams); | |
case ModelProvider.OPENAI: | |
if (!isOpenAICredentials(credentials)) { | |
throw new Error("Invalid credentials for OpenAI provider"); | |
} | |
return new OpenAIModel(credentials, model, validatedParams); | |
default: | |
throw new Error(`Unsupported model provider: ${provider}`); | |
} | |
}; | |
================================================ | |
FILE: node-zerox/src/models/openAI.ts | |
================================================ | |
import { | |
CompletionArgs, | |
CompletionResponse, | |
ExtractionArgs, | |
ExtractionResponse, | |
MessageContentArgs, | |
ModelInterface, | |
OpenAICredentials, | |
OpenAILLMParams, | |
OperationMode, | |
} from "../types"; | |
import { | |
cleanupImage, | |
convertKeysToCamelCase, | |
convertKeysToSnakeCase, | |
encodeImageToBase64, | |
} from "../utils"; | |
import { CONSISTENCY_PROMPT, SYSTEM_PROMPT_BASE } from "../constants"; | |
import axios from "axios"; | |
import fs from "fs-extra"; | |
export default class OpenAIModel implements ModelInterface { | |
private apiKey: string; | |
private model: string; | |
private llmParams?: Partial<OpenAILLMParams>; | |
constructor( | |
credentials: OpenAICredentials, | |
model: string, | |
llmParams?: Partial<OpenAILLMParams> | |
) { | |
this.apiKey = credentials.apiKey; | |
this.model = model; | |
this.llmParams = llmParams; | |
} | |
async getCompletion( | |
mode: OperationMode, | |
params: CompletionArgs | ExtractionArgs | |
): Promise<CompletionResponse | ExtractionResponse> { | |
const modeHandlers = { | |
[OperationMode.EXTRACTION]: () => | |
this.handleExtraction(params as ExtractionArgs), | |
[OperationMode.OCR]: () => this.handleOCR(params as CompletionArgs), | |
}; | |
const handler = modeHandlers[mode]; | |
if (!handler) { | |
throw new Error(`Unsupported operation mode: ${mode}`); | |
} | |
return await handler(); | |
} | |
private async createMessageContent({ | |
input, | |
options, | |
}: MessageContentArgs): Promise<any> { | |
const processImages = async (imagePaths: string[]) => { | |
const nestedImages = await Promise.all( | |
imagePaths.map(async (imagePath) => { | |
const imageBuffer = await fs.readFile(imagePath); | |
const buffers = await cleanupImage({ | |
correctOrientation: options?.correctOrientation ?? false, | |
imageBuffer, | |
scheduler: options?.scheduler ?? null, | |
trimEdges: options?.trimEdges ?? false, | |
}); | |
return buffers.map((buffer) => ({ | |
image_url: { | |
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`, | |
}, | |
type: "image_url", | |
})); | |
}) | |
); | |
return nestedImages.flat(); | |
}; | |
if (Array.isArray(input)) { | |
return processImages(input); | |
} | |
if (typeof input === "string") { | |
return [{ text: input, type: "text" }]; | |
} | |
const { imagePaths, text } = input; | |
const images = await processImages(imagePaths); | |
return [...images, { text, type: "text" }]; | |
} | |
private async handleOCR({ | |
buffers, | |
maintainFormat, | |
priorPage, | |
prompt, | |
}: CompletionArgs): Promise<CompletionResponse> { | |
const systemPrompt = prompt || SYSTEM_PROMPT_BASE; | |
// Default system message | |
const messages: any = [{ role: "system", content: systemPrompt }]; | |
// If content has already been generated, add it to context. | |
// This helps maintain the same format across pages | |
if (maintainFormat && priorPage && priorPage.length) { | |
messages.push({ | |
role: "system", | |
content: CONSISTENCY_PROMPT(priorPage), | |
}); | |
} | |
// Add image to request | |
const imageContents = buffers.map((buffer) => ({ | |
type: "image_url", | |
image_url: { | |
url: `data:image/png;base64,${encodeImageToBase64(buffer)}`, | |
}, | |
})); | |
messages.push({ role: "user", content: imageContents }); | |
try { | |
const response = await axios.post( | |
"https://api.openai.com/v1/chat/completions", | |
{ | |
messages, | |
model: this.model, | |
...convertKeysToSnakeCase(this.llmParams ?? null), | |
}, | |
{ | |
headers: { | |
Authorization: `Bearer ${this.apiKey}`, | |
"Content-Type": "application/json", | |
}, | |
} | |
); | |
const data = response.data; | |
const result: CompletionResponse = { | |
content: data.choices[0].message.content, | |
inputTokens: data.usage.prompt_tokens, | |
outputTokens: data.usage.completion_tokens, | |
}; | |
if (this.llmParams?.logprobs) { | |
result["logprobs"] = convertKeysToCamelCase( | |
data.choices[0].logprobs | |
)?.content; | |
} | |
return result; | |
} catch (err) { | |
console.error("Error in OpenAI completion", err); | |
throw err; | |
} | |
} | |
private async handleExtraction({ | |
input, | |
options, | |
prompt, | |
schema, | |
}: ExtractionArgs): Promise<ExtractionResponse> { | |
try { | |
const messages: any = []; | |
if (prompt) { | |
messages.push({ role: "system", content: prompt }); | |
} | |
messages.push({ | |
role: "user", | |
content: await this.createMessageContent({ input, options }), | |
}); | |
const response = await axios.post( | |
"https://api.openai.com/v1/chat/completions", | |
{ | |
messages, | |
model: this.model, | |
response_format: { | |
json_schema: { name: "extraction", schema }, | |
type: "json_schema", | |
}, | |
...convertKeysToSnakeCase(this.llmParams ?? null), | |
}, | |
{ | |
headers: { | |
Authorization: `Bearer ${this.apiKey}`, | |
"Content-Type": "application/json", | |
}, | |
} | |
); | |
const data = response.data; | |
const result: ExtractionResponse = { | |
extracted: data.choices[0].message.content, | |
inputTokens: data.usage.prompt_tokens, | |
outputTokens: data.usage.completion_tokens, | |
}; | |
if (this.llmParams?.logprobs) { | |
result["logprobs"] = convertKeysToCamelCase( | |
data.choices[0].logprobs | |
)?.content; | |
} | |
return result; | |
} catch (err) { | |
console.error("Error in OpenAI completion", err); | |
throw err; | |
} | |
} | |
} | |
================================================ | |
FILE: node-zerox/src/utils/common.ts | |
================================================ | |
export const camelToSnakeCase = (str: string) => | |
str.replace(/[A-Z]/g, (letter: string) => `_${letter.toLowerCase()}`); | |
export const convertKeysToCamelCase = ( | |
obj: Record<string, any> | null | |
): Record<string, any> => { | |
if (typeof obj !== "object" || obj === null) { | |
return obj ?? {}; | |
} | |
if (Array.isArray(obj)) { | |
return obj.map(convertKeysToCamelCase); | |
} | |
return Object.fromEntries( | |
Object.entries(obj).map(([key, value]) => [ | |
snakeToCamelCase(key), | |
convertKeysToCamelCase(value), | |
]) | |
); | |
}; | |
export const convertKeysToSnakeCase = ( | |
obj: Record<string, any> | null | |
): Record<string, any> => { | |
if (typeof obj !== "object" || obj === null) { | |
return obj ?? {}; | |
} | |
return Object.fromEntries( | |
Object.entries(obj).map(([key, value]) => [camelToSnakeCase(key), value]) | |
); | |
}; | |
export const isString = (value: string | null): value is string => { | |
return value !== null; | |
}; | |
export const isValidUrl = (string: string): boolean => { | |
let url; | |
try { | |
url = new URL(string); | |
} catch (_) { | |
return false; | |
} | |
return url.protocol === "http:" || url.protocol === "https:"; | |
}; | |
// Strip out the ```markdown wrapper | |
export const formatMarkdown = (text: string): string => { | |
return ( | |
text | |
// First preserve all language code blocks except html and markdown | |
.replace(/```(?!html|markdown)(\w+)([\s\S]*?)```/g, "§§§$1$2§§§") | |
// Then remove html and markdown code markers | |
.replace(/```(?:html|markdown)|````(?:html|markdown)|```/g, "") | |
// Finally restore all preserved language blocks | |
.replace(/§§§(\w+)([\s\S]*?)§§§/g, "```$1$2```") | |
); | |
}; | |
export const runRetries = async <T>( | |
operation: () => Promise<T>, | |
maxRetries: number, | |
pageNumber: number | |
): Promise<T> => { | |
let retryCount = 0; | |
while (retryCount <= maxRetries) { | |
try { | |
return await operation(); | |
} catch (error) { | |
if (retryCount === maxRetries) { | |
throw error; | |
} | |
console.log(`Retrying page ${pageNumber}...`); | |
retryCount++; | |
} | |
} | |
throw new Error("Unexpected retry error"); | |
}; | |
export const snakeToCamelCase = (str: string): string => | |
str.replace(/_([a-z])/g, (_, letter: string) => letter.toUpperCase()); | |
export const splitSchema = ( | |
schema: Record<string, unknown>, | |
extractPerPage?: string[] | |
): { | |
fullDocSchema: Record<string, unknown> | null; | |
perPageSchema: Record<string, unknown> | null; | |
} => { | |
if (!extractPerPage?.length) { | |
return { fullDocSchema: schema, perPageSchema: null }; | |
} | |
const fullDocSchema: Record<string, unknown> = {}; | |
const perPageSchema: Record<string, unknown> = {}; | |
for (const [key, value] of Object.entries(schema.properties || {})) { | |
(extractPerPage.includes(key) ? perPageSchema : fullDocSchema)[key] = value; | |
} | |
const requiredKeys = Array.isArray(schema.required) ? schema.required : []; | |
return { | |
fullDocSchema: Object.keys(fullDocSchema).length | |
? { | |
type: schema.type, | |
properties: fullDocSchema, | |
required: requiredKeys.filter((key) => !extractPerPage.includes(key)), | |
} | |
: null, | |
perPageSchema: Object.keys(perPageSchema).length | |
? { | |
type: schema.type, | |
properties: perPageSchema, | |
required: requiredKeys.filter((key) => extractPerPage.includes(key)), | |
} | |
: null, | |
}; | |
}; | |
================================================ | |
FILE: node-zerox/src/utils/file.ts | |
================================================ | |
import { convert } from "libreoffice-convert"; | |
import { exec } from "child_process"; | |
import { fromPath } from "pdf2pic"; | |
import { pipeline } from "stream/promises"; | |
import { promisify } from "util"; | |
import { v4 as uuidv4 } from "uuid"; | |
import { WriteImageResponse } from "pdf2pic/dist/types/convertResponse"; | |
import axios from "axios"; | |
import fileType from "file-type"; | |
import fs from "fs-extra"; | |
import heicConvert from "heic-convert"; | |
import mime from "mime-types"; | |
import path from "path"; | |
import pdf from "pdf-parse"; | |
import util from "util"; | |
import xlsx from "xlsx"; | |
import { ASPECT_RATIO_THRESHOLD } from "../constants"; | |
import { | |
ConvertPdfOptions, | |
ExcelSheetContent, | |
Page, | |
PageStatus, | |
} from "../types"; | |
import { isValidUrl } from "./common"; | |
const convertAsync = promisify(convert); | |
const execAsync = util.promisify(exec); | |
// Save file to local tmp directory | |
export const downloadFile = async ({ | |
filePath, | |
tempDir, | |
}: { | |
filePath: string; | |
tempDir: string; | |
}): Promise<{ extension: string; localPath: string }> => { | |
const fileNameExt = path.extname(filePath.split("?")[0]); | |
const localPath = path.join(tempDir, uuidv4() + fileNameExt); | |
let mimetype; | |
// Check if filePath is a URL | |
if (isValidUrl(filePath)) { | |
const writer = fs.createWriteStream(localPath); | |
const response = await axios({ | |
url: filePath, | |
method: "GET", | |
responseType: "stream", | |
}); | |
if (response.status !== 200) { | |
throw new Error(`HTTP error! Status: ${response.status}`); | |
} | |
mimetype = response.headers?.["content-type"]; | |
await pipeline(response.data, writer); | |
} else { | |
// If filePath is a local file, copy it to the temp directory | |
await fs.copyFile(filePath, localPath); | |
} | |
if (!mimetype) { | |
mimetype = mime.lookup(localPath); | |
} | |
let extension = mime.extension(mimetype); | |
if (!extension) { | |
extension = fileNameExt || ""; | |
} | |
if (!extension) { | |
if (mimetype === "binary/octet-stream") { | |
extension = ".bin"; | |
} else { | |
throw new Error("File extension missing"); | |
} | |
} | |
if (!extension.startsWith(".")) { | |
extension = `.${extension}`; | |
} | |
return { extension, localPath }; | |
}; | |
// Check if file is a Compound File Binary (legacy Office format) | |
export const checkIsCFBFile = async (filePath: string): Promise<boolean> => { | |
const type = await fileType.fromFile(filePath); | |
return type?.mime === "application/x-cfb"; | |
}; | |
// Check if file is a PDF by inspecting its magic number ("%PDF" at the beginning) | |
export const checkIsPdfFile = async (filePath: string): Promise<boolean> => { | |
const buffer = await fs.readFile(filePath); | |
return buffer.subarray(0, 4).toString() === "%PDF"; | |
}; | |
// Convert HEIC file to JPEG | |
export const convertHeicToJpeg = async ({ | |
localPath, | |
tempDir, | |
}: { | |
localPath: string; | |
tempDir: string; | |
}): Promise<string> => { | |
try { | |
const inputBuffer = await fs.readFile(localPath); | |
const outputBuffer = await heicConvert({ | |
buffer: inputBuffer, | |
format: "JPEG", | |
quality: 1, | |
}); | |
const jpegPath = path.join( | |
tempDir, | |
`${path.basename(localPath, ".heic")}.jpg` | |
); | |
await fs.writeFile(jpegPath, Buffer.from(outputBuffer)); | |
return jpegPath; | |
} catch (err) { | |
console.error(`Error converting .heic to .jpeg:`, err); | |
throw err; | |
} | |
}; | |
// Convert each page (from other formats like docx) to a png and save that image to tmp | |
export const convertFileToPdf = async ({ | |
extension, | |
localPath, | |
tempDir, | |
}: { | |
extension: string; | |
localPath: string; | |
tempDir: string; | |
}): Promise<string> => { | |
const inputBuffer = await fs.readFile(localPath); | |
const outputFilename = path.basename(localPath, extension) + ".pdf"; | |
const outputPath = path.join(tempDir, outputFilename); | |
try { | |
const pdfBuffer = await convertAsync(inputBuffer, ".pdf", undefined); | |
await fs.writeFile(outputPath, pdfBuffer); | |
return outputPath; | |
} catch (err) { | |
console.error(`Error converting ${extension} to .pdf:`, err); | |
throw err; | |
} | |
}; | |
// Convert each page to a png and save that image to tempDir | |
export const convertPdfToImages = async ({ | |
imageDensity = 300, | |
imageHeight = 2048, | |
pagesToConvertAsImages, | |
pdfPath, | |
tempDir, | |
}: { | |
imageDensity?: number; | |
imageHeight?: number; | |
pagesToConvertAsImages: number | number[]; | |
pdfPath: string; | |
tempDir: string; | |
}): Promise<string[]> => { | |
const aspectRatio = (await getPdfAspectRatio(pdfPath)) || 1; | |
const shouldAdjustHeight = aspectRatio > ASPECT_RATIO_THRESHOLD; | |
const adjustedHeight = shouldAdjustHeight | |
? Math.max(imageHeight, Math.round(aspectRatio * imageHeight)) | |
: imageHeight; | |
const options: ConvertPdfOptions = { | |
density: imageDensity, | |
format: "png", | |
height: adjustedHeight, | |
preserveAspectRatio: true, | |
saveFilename: path.basename(pdfPath, path.extname(pdfPath)), | |
savePath: tempDir, | |
}; | |
try { | |
try { | |
const storeAsImage = fromPath(pdfPath, options); | |
const convertResults: WriteImageResponse[] = await storeAsImage.bulk( | |
pagesToConvertAsImages | |
); | |
// Validate that all pages were converted | |
return convertResults.map((result) => { | |
if (!result.page || !result.path) { | |
throw new Error("Could not identify page data"); | |
} | |
return result.path; | |
}); | |
} catch (err) { | |
return await convertPdfWithPoppler( | |
pagesToConvertAsImages, | |
pdfPath, | |
options | |
); | |
} | |
} catch (err) { | |
console.error("Error during PDF conversion:", err); | |
throw err; | |
} | |
}; | |
// Converts an Excel file to HTML format | |
export const convertExcelToHtml = async ( | |
filePath: string | |
): Promise<ExcelSheetContent[]> => { | |
const tableClass = "zerox-excel-table"; | |
try { | |
if (!(await fs.pathExists(filePath))) { | |
throw new Error(`Excel file not found: ${filePath}`); | |
} | |
const workbook = xlsx.readFile(filePath, { | |
type: "file", | |
cellStyles: true, | |
cellHTML: true, | |
}); | |
if (!workbook || !workbook.SheetNames || workbook.SheetNames.length === 0) { | |
throw new Error("Invalid Excel file or no sheets found"); | |
} | |
const sheets: ExcelSheetContent[] = []; | |
for (const sheetName of workbook.SheetNames) { | |
const worksheet = workbook.Sheets[sheetName]; | |
const jsonData = xlsx.utils.sheet_to_json<any[]>(worksheet, { | |
header: 1, | |
}); | |
let sheetContent = ""; | |
sheetContent += `<h2>Sheet: ${sheetName}</h2>`; | |
sheetContent += `<table class="${tableClass}">`; | |
if (jsonData.length > 0) { | |
jsonData.forEach((row: any[], rowIndex: number) => { | |
sheetContent += "<tr>"; | |
const cellTag = rowIndex === 0 ? "th" : "td"; | |
if (row && row.length > 0) { | |
row.forEach((cell) => { | |
const cellContent = | |
cell !== null && cell !== undefined ? cell.toString() : ""; | |
sheetContent += `<${cellTag}>${cellContent}</${cellTag}>`; | |
}); | |
} | |
sheetContent += "</tr>"; | |
}); | |
} | |
sheetContent += "</table>"; | |
sheets.push({ | |
sheetName, | |
content: sheetContent, | |
contentLength: sheetContent.length, | |
}); | |
} | |
return sheets; | |
} catch (error) { | |
throw error; | |
} | |
}; | |
// Alternative PDF to PNG conversion using Poppler | |
const convertPdfWithPoppler = async ( | |
pagesToConvertAsImages: number | number[], | |
pdfPath: string, | |
options: ConvertPdfOptions | |
): Promise<string[]> => { | |
const { density, format, height, saveFilename, savePath } = options; | |
const outputPrefix = path.join(savePath, saveFilename); | |
const run = async (from?: number, to?: number) => { | |
const pageArgs = from && to ? `-f ${from} -l ${to}` : ""; | |
const cmd = `pdftoppm -${format} -r ${density} -scale-to-y ${height} -scale-to-x -1 ${pageArgs} "${pdfPath}" "${outputPrefix}"`; | |
await execAsync(cmd); | |
}; | |
if (pagesToConvertAsImages === -1) { | |
await run(); | |
} else if (typeof pagesToConvertAsImages === "number") { | |
await run(pagesToConvertAsImages, pagesToConvertAsImages); | |
} else if (Array.isArray(pagesToConvertAsImages)) { | |
await Promise.all(pagesToConvertAsImages.map((page) => run(page, page))); | |
} | |
const convertResults = await fs.readdir(savePath); | |
return convertResults | |
.filter( | |
(result) => | |
result.startsWith(saveFilename) && result.endsWith(`.${format}`) | |
) | |
.map((result) => path.join(savePath, result)); | |
}; | |
// Extracts pages from a structured data file (like Excel) | |
export const extractPagesFromStructuredDataFile = async ( | |
filePath: string | |
): Promise<Page[]> => { | |
if (isExcelFile(filePath)) { | |
const sheets = await convertExcelToHtml(filePath); | |
const pages: Page[] = []; | |
sheets.forEach((sheet: ExcelSheetContent, index: number) => { | |
pages.push({ | |
content: sheet.content, | |
contentLength: sheet.contentLength, | |
page: index + 1, | |
status: PageStatus.SUCCESS, | |
}); | |
}); | |
return pages; | |
} | |
return []; | |
}; | |
// Gets the number of pages from a PDF | |
export const getNumberOfPagesFromPdf = async ({ | |
pdfPath, | |
}: { | |
pdfPath: string; | |
}): Promise<number> => { | |
const dataBuffer = await fs.readFile(pdfPath); | |
const data = await pdf(dataBuffer); | |
return data.numpages; | |
}; | |
// Gets the aspect ratio (height/width) of a PDF | |
const getPdfAspectRatio = async ( | |
pdfPath: string | |
): Promise<number | undefined> => { | |
return new Promise((resolve) => { | |
exec(`pdfinfo "${pdfPath}"`, (error, stdout) => { | |
if (error) return resolve(undefined); | |
const sizeMatch = stdout.match(/Page size:\s+([\d.]+)\s+x\s+([\d.]+)/); | |
if (sizeMatch) { | |
const height = parseFloat(sizeMatch[2]); | |
const width = parseFloat(sizeMatch[1]); | |
return resolve(height / width); | |
} | |
resolve(undefined); | |
}); | |
}); | |
}; | |
// Checks if a file is an Excel file | |
export const isExcelFile = (filePath: string): boolean => { | |
const extension = path.extname(filePath).toLowerCase(); | |
return ( | |
extension === ".xlsx" || | |
extension === ".xls" || | |
extension === ".xlsm" || | |
extension === ".xlsb" | |
); | |
}; | |
// Checks if a file is a structured data file (like Excel) | |
export const isStructuredDataFile = (filePath: string): boolean => { | |
return isExcelFile(filePath); | |
}; | |
================================================ | |
FILE: node-zerox/src/utils/image.ts | |
================================================ | |
import sharp from "sharp"; | |
import Tesseract from "tesseract.js"; | |
import { ASPECT_RATIO_THRESHOLD } from "../constants"; | |
interface CleanupImageProps { | |
correctOrientation: boolean; | |
imageBuffer: Buffer; | |
scheduler: Tesseract.Scheduler | null; | |
trimEdges: boolean; | |
} | |
export const encodeImageToBase64 = (imageBuffer: Buffer) => { | |
return imageBuffer.toString("base64"); | |
}; | |
export const cleanupImage = async ({ | |
correctOrientation, | |
imageBuffer, | |
scheduler, | |
trimEdges, | |
}: CleanupImageProps): Promise<Buffer[]> => { | |
const image = sharp(imageBuffer); | |
// Trim extra space around the content in the image | |
if (trimEdges) { | |
image.trim(); | |
} | |
// scheduler would always be non-null if correctOrientation is true | |
// Adding this check to satisfy typescript | |
if (correctOrientation && scheduler) { | |
const optimalRotation = await determineOptimalRotation({ | |
image, | |
scheduler, | |
}); | |
if (optimalRotation) { | |
image.rotate(optimalRotation); | |
} | |
} | |
// Correct the image orientation | |
const correctedBuffer = await image.toBuffer(); | |
return await splitTallImage(correctedBuffer); | |
}; | |
// Determine the optimal image orientation based on OCR confidence | |
// Run Tesseract on 4 image orientations and compare the outputs | |
const determineOptimalRotation = async ({ | |
image, | |
scheduler, | |
}: { | |
image: sharp.Sharp; | |
scheduler: Tesseract.Scheduler; | |
}): Promise<number> => { | |
const imageBuffer = await image.toBuffer(); | |
const { | |
data: { orientation_confidence, orientation_degrees }, | |
} = await scheduler.addJob("detect", imageBuffer); | |
if (orientation_degrees) { | |
console.log( | |
`Reorienting image ${orientation_degrees} degrees (confidence: ${orientation_confidence}%)` | |
); | |
return orientation_degrees; | |
} | |
return 0; | |
}; | |
/** | |
* Compress an image to a maximum size | |
* @param image - The image to compress as a buffer | |
* @param maxSize - The maximum size in MB | |
* @returns The compressed image as a buffer | |
*/ | |
export const compressImage = async ( | |
image: Buffer, | |
maxSize: number | |
): Promise<Buffer> => { | |
if (maxSize <= 0) { | |
throw new Error("maxSize must be greater than 0"); | |
} | |
// Convert maxSize from MB to bytes | |
const maxBytes = maxSize * 1024 * 1024; | |
if (image.length <= maxBytes) { | |
return image; | |
} | |
try { | |
// Start with quality 90 and gradually decrease if needed | |
let quality = 90; | |
let compressedImage: Buffer; | |
do { | |
compressedImage = await sharp(image).jpeg({ quality }).toBuffer(); | |
quality -= 10; | |
if (quality < 20) { | |
throw new Error( | |
`Unable to compress image to ${maxSize}MB while maintaining acceptable quality.` | |
); | |
} | |
} while (compressedImage.length > maxBytes); | |
return compressedImage; | |
} catch (error) { | |
return image; | |
} | |
}; | |
export const splitTallImage = async ( | |
imageBuffer: Buffer | |
): Promise<Buffer[]> => { | |
const image = sharp(imageBuffer); | |
const metadata = await image.metadata(); | |
const height = metadata.height || 0; | |
const width = metadata.width || 0; | |
const aspectRatio = height / width; | |
if (aspectRatio <= ASPECT_RATIO_THRESHOLD) { | |
return [await image.toBuffer()]; | |
} | |
const { data: imageData } = await image | |
.grayscale() | |
.raw() | |
.toBuffer({ resolveWithObject: true }); | |
const emptySpaces = new Array(height).fill(0); | |
// Analyze each row to find empty spaces | |
for (let y = 0; y < height; y++) { | |
let emptyPixels = 0; | |
for (let x = 0; x < width; x++) { | |
const pixelIndex = y * width + x; | |
if (imageData[pixelIndex] > 230) { | |
emptyPixels++; | |
} | |
} | |
// Calculate percentage of empty pixels in this row | |
const emptyRatio = emptyPixels / width; | |
// Mark rows that are mostly empty (whitespace) | |
emptySpaces[y] = emptyRatio > 0.95 ? 1 : 0; | |
} | |
const significantEmptySpaces = []; | |
let currentEmptyStart = -1; | |
for (let y = 0; y < height; y++) { | |
if (emptySpaces[y] === 1) { | |
if (currentEmptyStart === -1) { | |
currentEmptyStart = y; | |
} | |
} else { | |
if (currentEmptyStart !== -1) { | |
const emptyHeight = y - currentEmptyStart; | |
if (emptyHeight >= 5) { | |
// Minimum height for a significant empty space | |
significantEmptySpaces.push({ | |
center: Math.floor(currentEmptyStart + emptyHeight / 2), | |
end: y - 1, | |
height: emptyHeight, | |
start: currentEmptyStart, | |
}); | |
} | |
currentEmptyStart = -1; | |
} | |
} | |
} | |
// Handle if there's an empty space at the end | |
if (currentEmptyStart !== -1) { | |
const emptyHeight = height - currentEmptyStart; | |
if (emptyHeight >= 5) { | |
significantEmptySpaces.push({ | |
center: Math.floor(currentEmptyStart + emptyHeight / 2), | |
end: height - 1, | |
height: emptyHeight, | |
start: currentEmptyStart, | |
}); | |
} | |
} | |
const numSections = Math.ceil(aspectRatio); | |
const approxSectionHeight = Math.floor(height / numSections); | |
const splitPoints = [0]; | |
for (let i = 1; i < numSections; i++) { | |
const targetY = i * approxSectionHeight; | |
// Find empty spaces near the target position | |
const searchRadius = Math.min(150, approxSectionHeight / 3); | |
const nearbyEmptySpaces = significantEmptySpaces.filter( | |
(space) => | |
Math.abs(space.center - targetY) < searchRadius && | |
space.start > splitPoints[splitPoints.length - 1] + 50 | |
); | |
if (nearbyEmptySpaces.length > 0) { | |
// Sort by proximity to target | |
nearbyEmptySpaces.sort( | |
(a, b) => Math.abs(a.center - targetY) - Math.abs(b.center - targetY) | |
); | |
// Choose center of the best empty space | |
splitPoints.push(nearbyEmptySpaces[0].center); | |
} else { | |
// Fallback if no good empty spaces found | |
const minY = splitPoints[splitPoints.length - 1] + 50; | |
const maxY = Math.min(height - 50, targetY + searchRadius); | |
splitPoints.push(Math.max(minY, Math.min(maxY, targetY))); | |
} | |
} | |
splitPoints.push(height); | |
return Promise.all( | |
splitPoints.slice(0, -1).map((top, i) => { | |
const sectionHeight = splitPoints[i + 1] - top; | |
return sharp(imageBuffer) | |
.extract({ left: 0, top, width, height: sectionHeight }) | |
.toBuffer(); | |
}) | |
); | |
}; | |
================================================ | |
FILE: node-zerox/src/utils/index.ts | |
================================================ | |
export * from "./common"; | |
export * from "./file"; | |
export * from "./image"; | |
export * from "./model"; | |
export * from "./tesseract"; | |
================================================ | |
FILE: node-zerox/src/utils/model.ts | |
================================================ | |
import { | |
CompletionResponse, | |
ExtractionResponse, | |
LLMParams, | |
ModelProvider, | |
OperationMode, | |
ProcessedCompletionResponse, | |
ProcessedExtractionResponse, | |
} from "../types"; | |
import { formatMarkdown } from "./common"; | |
export const isCompletionResponse = ( | |
mode: OperationMode, | |
response: CompletionResponse | ExtractionResponse | |
): response is CompletionResponse => { | |
return mode === OperationMode.OCR; | |
}; | |
const isExtractionResponse = ( | |
mode: OperationMode, | |
response: CompletionResponse | ExtractionResponse | |
): response is ExtractionResponse => { | |
return mode === OperationMode.EXTRACTION; | |
}; | |
export class CompletionProcessor { | |
static process<T extends OperationMode>( | |
mode: T, | |
response: CompletionResponse | ExtractionResponse | |
): T extends OperationMode.EXTRACTION | |
? ProcessedExtractionResponse | |
: ProcessedCompletionResponse { | |
const { logprobs, ...responseWithoutLogprobs } = response; | |
if (isCompletionResponse(mode, response)) { | |
const content = response.content; | |
return { | |
...responseWithoutLogprobs, | |
content: | |
typeof content === "string" ? formatMarkdown(content) : content, | |
contentLength: response.content?.length || 0, | |
} as T extends OperationMode.EXTRACTION | |
? ProcessedExtractionResponse | |
: ProcessedCompletionResponse; | |
} | |
if (isExtractionResponse(mode, response)) { | |
const extracted = response.extracted; | |
return { | |
...responseWithoutLogprobs, | |
extracted: | |
typeof extracted === "object" ? extracted : JSON.parse(extracted), | |
} as T extends OperationMode.EXTRACTION | |
? ProcessedExtractionResponse | |
: ProcessedCompletionResponse; | |
} | |
return responseWithoutLogprobs as T extends OperationMode.EXTRACTION | |
? ProcessedExtractionResponse | |
: ProcessedCompletionResponse; | |
} | |
} | |
const providerDefaultParams: Record<ModelProvider | string, LLMParams> = { | |
[ModelProvider.AZURE]: { | |
frequencyPenalty: 0, | |
logprobs: false, | |
maxTokens: 4000, | |
presencePenalty: 0, | |
temperature: 0, | |
topP: 1, | |
}, | |
[ModelProvider.BEDROCK]: { | |
maxTokens: 4000, | |
temperature: 0, | |
topP: 1, | |
}, | |
[ModelProvider.GOOGLE]: { | |
frequencyPenalty: 0, | |
maxOutputTokens: 4000, | |
presencePenalty: 0, | |
temperature: 0, | |
topP: 1, | |
}, | |
[ModelProvider.OPENAI]: { | |
frequencyPenalty: 0, | |
logprobs: false, | |
maxTokens: 4000, | |
presencePenalty: 0, | |
temperature: 0, | |
topP: 1, | |
}, | |
}; | |
export const validateLLMParams = <T extends LLMParams>( | |
params: Partial<T>, | |
provider: ModelProvider | string | |
): LLMParams => { | |
const defaultParams = providerDefaultParams[provider]; | |
if (!defaultParams) { | |
throw new Error(`Unsupported model provider: ${provider}`); | |
} | |
const validKeys = new Set(Object.keys(defaultParams)); | |
for (const [key, value] of Object.entries(params)) { | |
if (!validKeys.has(key)) { | |
throw new Error( | |
`Invalid LLM parameter for ${provider}: ${key}. Valid parameters are: ${Array.from( | |
validKeys | |
).join(", ")}` | |
); | |
} | |
const expectedType = typeof defaultParams[key as keyof LLMParams]; | |
if (typeof value !== expectedType) { | |
throw new Error(`Value for '${key}' must be a ${expectedType}`); | |
} | |
} | |
return { ...defaultParams, ...params }; | |
}; | |
================================================ | |
FILE: node-zerox/src/utils/tesseract.ts | |
================================================ | |
import * as Tesseract from "tesseract.js"; | |
import { NUM_STARTING_WORKERS } from "../constants"; | |
export const getTesseractScheduler = async () => { | |
return Tesseract.createScheduler(); | |
}; | |
const createAndAddWorker = async (scheduler: Tesseract.Scheduler) => { | |
const worker = await Tesseract.createWorker("eng", 2, { | |
legacyCore: true, | |
legacyLang: true, | |
}); | |
await worker.setParameters({ | |
tessedit_pageseg_mode: Tesseract.PSM.OSD_ONLY, | |
}); | |
return scheduler.addWorker(worker); | |
}; | |
export const addWorkersToTesseractScheduler = async ({ | |
numWorkers, | |
scheduler, | |
}: { | |
numWorkers: number; | |
scheduler: Tesseract.Scheduler; | |
}) => { | |
let resArr = Array.from({ length: numWorkers }); | |
await Promise.all(resArr.map(() => createAndAddWorker(scheduler))); | |
return true; | |
}; | |
export const terminateScheduler = (scheduler: Tesseract.Scheduler) => { | |
return scheduler.terminate(); | |
}; | |
export const prepareWorkersForImageProcessing = async ({ | |
numImages, | |
maxTesseractWorkers, | |
scheduler, | |
}: { | |
numImages: number; | |
maxTesseractWorkers: number; | |
scheduler: Tesseract.Scheduler | null; | |
}) => { | |
// Add more workers if correctOrientation is true | |
const numRequiredWorkers = numImages; | |
let numNewWorkers = numRequiredWorkers - NUM_STARTING_WORKERS; | |
if (maxTesseractWorkers !== -1) { | |
const numPreviouslyInitiatedWorkers = | |
maxTesseractWorkers < NUM_STARTING_WORKERS | |
? maxTesseractWorkers | |
: NUM_STARTING_WORKERS; | |
if (numRequiredWorkers > numPreviouslyInitiatedWorkers) { | |
numNewWorkers = Math.min( | |
numRequiredWorkers - numPreviouslyInitiatedWorkers, | |
maxTesseractWorkers - numPreviouslyInitiatedWorkers | |
); | |
} else { | |
numNewWorkers = 0; | |
} | |
} | |
// Add more workers if needed | |
if (numNewWorkers > 0 && maxTesseractWorkers !== 0 && scheduler) | |
addWorkersToTesseractScheduler({ | |
numWorkers: numNewWorkers, | |
scheduler, | |
}); | |
}; | |
================================================ | |
FILE: node-zerox/tests/README.md | |
================================================ | |
# Test Script README | |
This script runs a quick test of the zerox output against a set keywords from known documents. This is not an exhaustive test, as it will not cover layout, but gives a good sense of any regressions. | |
## Overview | |
- **Processes Files**: Reads documents from `shared/inputs` (mix of PDFs, images, Word docs, etc.). | |
- **Runs OCR**: Runs `zerox` live against all the files. | |
- **Keyword Verification**: Compares extracted text with expected keywords from `shared/test.json`. | |
- **Results**: Outputs counts of keywords found and missing, and displays a summary table. | |
## How to Run | |
You should be able to run this test with `npm run test` from the root directory. | |
Note you will need a `.env` file in `node-zerox` with your OpenAI API key: | |
``` | |
OPENAI_API_KEY=your_api_key_here | |
``` | |
## Contributing new tests | |
1. Add Your Document: | |
- Place the file in `shared/inputs` (e.g., `0005.pdf`). | |
2. Update `test.json`: | |
- Add an entry: | |
```json | |
{ | |
"file": "your_file.ext", | |
"expectedKeywords": [ | |
["keyword1_page1", "keyword2_page1"], | |
["keyword1_page2", "keyword2_page2"] | |
] | |
} | |
``` | |
3. Run the Test: | |
- Execute the script to include the new file. | |
## Performance Tests | |
To run the performance tests, use `npm run test:performance`. | |
================================================ | |
FILE: node-zerox/tests/index.ts | |
================================================ | |
import { compareKeywords } from "./utils"; | |
import { ModelOptions } from "../src/types"; | |
import { zerox } from "../src"; | |
import dotenv from "dotenv"; | |
import fs from "node:fs"; | |
import path from "node:path"; | |
import pLimit from "p-limit"; | |
dotenv.config({ path: path.join(__dirname, "../.env") }); | |
interface TestInput { | |
expectedKeywords: string[][]; | |
file: string; | |
} | |
const FILE_CONCURRENCY = 10; | |
const INPUT_DIR = path.join(__dirname, "../../shared/inputs"); | |
const TEST_JSON_PATH = path.join(__dirname, "../../shared/test.json"); | |
const OUTPUT_DIR = path.join(__dirname, "results", `test-run-${Date.now()}`); | |
const TEMP_DIR = path.join(OUTPUT_DIR, "temp"); | |
async function main() { | |
const T1 = new Date(); | |
// Read the test inputs and expected keywords | |
const testInputs: TestInput[] = JSON.parse( | |
fs.readFileSync(TEST_JSON_PATH, "utf-8") | |
); | |
// Create the output directory | |
fs.mkdirSync(OUTPUT_DIR, { recursive: true }); | |
const limit = pLimit(FILE_CONCURRENCY); | |
const results = await Promise.all( | |
testInputs.map((testInput) => | |
limit(async () => { | |
const filePath = path.join(INPUT_DIR, testInput.file); | |
// Check if the file exists | |
if (!fs.existsSync(filePath)) { | |
console.warn(`File not found: ${filePath}`); | |
return null; | |
} | |
// Run OCR on the file | |
const ocrResult = await zerox({ | |
cleanup: false, | |
filePath, | |
maintainFormat: false, | |
model: ModelOptions.OPENAI_GPT_4O, | |
openaiAPIKey: process.env.OPENAI_API_KEY, | |
outputDir: OUTPUT_DIR, | |
tempDir: TEMP_DIR, | |
}); | |
// Compare expected keywords with OCR output | |
const keywordCounts = compareKeywords( | |
ocrResult.pages, | |
testInput.expectedKeywords | |
); | |
// Prepare the result | |
return { | |
file: testInput.file, | |
keywordCounts, | |
totalKeywords: testInput.expectedKeywords.flat().length, | |
}; | |
}) | |
) | |
); | |
// Filter out any null results (due to missing files) | |
const filteredResults = results.filter((result) => result !== null); | |
const tableData = filteredResults.map((result) => { | |
const totalFound = | |
result?.keywordCounts.reduce( | |
(sum, page) => sum + page.keywordsFound.length, | |
0 | |
) ?? 0; | |
const totalMissing = | |
result?.keywordCounts.reduce( | |
(sum, page) => sum + page.keywordsMissing.length, | |
0 | |
) ?? 0; | |
const totalKeywords = totalFound + totalMissing; | |
const percentage = | |
totalKeywords > 0 | |
? ((totalFound / totalKeywords) * 100).toFixed(2) + "%" | |
: "N/A"; | |
return { | |
fileName: result?.file, | |
keywordsFound: totalFound, | |
keywordsMissing: totalMissing, | |
percentage, | |
}; | |
}); | |
// Write the test results to output.json | |
fs.writeFileSync( | |
path.join(OUTPUT_DIR, "output.json"), | |
JSON.stringify(filteredResults, null, 2) | |
); | |
const T2 = new Date(); | |
const completionTime = ((T2.getTime() - T1.getTime()) / 1000).toFixed(2); | |
// Calculate overall accuracy and total pages tested | |
const totalKeywordsFound = filteredResults.reduce( | |
(sum, result) => | |
sum + | |
(result?.keywordCounts?.reduce( | |
(s, page) => s + (page.keywordsFound?.length ?? 0), | |
0 | |
) ?? 0), | |
0 | |
); | |
const totalKeywordsMissing = filteredResults.reduce( | |
(sum, result) => | |
sum + | |
(result?.keywordCounts?.reduce( | |
(s, page) => s + (page.keywordsMissing?.length ?? 0), | |
0 | |
) ?? 0), | |
0 | |
); | |
const totalKeywords = totalKeywordsFound + totalKeywordsMissing; | |
const overallAccuracy = | |
totalKeywords > 0 | |
? ((totalKeywordsFound / totalKeywords) * 100).toFixed(2) + "%" | |
: "N/A"; | |
const pagesTested = filteredResults.reduce( | |
(sum, result) => sum + (result?.keywordCounts?.length ?? 0), | |
0 | |
); | |
console.log("\n"); | |
console.log("-------------------------------------------------------------"); | |
console.log("Test complete in", completionTime, "seconds"); | |
console.log("Overall accuracy:", overallAccuracy); | |
console.log("Pages tested:", pagesTested); | |
console.log("-------------------------------------------------------------"); | |
console.table(tableData); | |
console.log("-------------------------------------------------------------"); | |
console.log(`Full test results are available in ${OUTPUT_DIR}`); | |
console.log("-------------------------------------------------------------"); | |
console.log("\n"); | |
} | |
main().catch((error) => { | |
console.error("An error occurred during the test run:", error); | |
}); | |
================================================ | |
FILE: node-zerox/tests/performance.test.ts | |
================================================ | |
import path from "path"; | |
import fs from "fs-extra"; | |
import { zerox } from "../src"; | |
import { ModelOptions } from "../src/types"; | |
const MOCK_OPENAI_TIME = 0; | |
const TEST_FILES_DIR = path.join(__dirname, "data"); | |
interface TestResult { | |
numPages: number; | |
concurrency: number; | |
duration: number; | |
avgTimePerPage: number; | |
} | |
// Mock the OpenAIModel class | |
jest.mock("../src/models/openAI", () => { | |
return { | |
__esModule: true, | |
default: class MockOpenAIModel { | |
constructor() { | |
// Mock constructor | |
} | |
async getCompletion() { | |
await new Promise((resolve) => setTimeout(resolve, MOCK_OPENAI_TIME)); | |
return { | |
content: | |
"# Mocked Content\n\nThis is a mocked response for testing purposes.", | |
inputTokens: 100, | |
outputTokens: 50, | |
}; | |
} | |
}, | |
}; | |
}); | |
describe("Zerox Performance Tests", () => { | |
const allResults: TestResult[] = []; | |
beforeAll(async () => { | |
// Ensure test directories exist | |
await fs.ensureDir(TEST_FILES_DIR); | |
}); | |
const runPerformanceTest = async (numPages: number, concurrency: number) => { | |
const filePath = path.join(TEST_FILES_DIR, `${numPages}-pages.pdf`); | |
console.log(`\nTesting ${numPages} pages with concurrency ${concurrency}`); | |
console.time(`Processing ${numPages} pages`); | |
const startTime = Date.now(); | |
const result = await zerox({ | |
cleanup: true, | |
concurrency, | |
filePath, | |
model: ModelOptions.OPENAI_GPT_4O, | |
openaiAPIKey: "mock-key", | |
}); | |
const duration = Date.now() - startTime; | |
console.timeEnd(`Processing ${numPages} pages`); | |
return { | |
numPages, | |
concurrency, | |
duration, | |
avgTimePerPage: duration / numPages, | |
successRate: | |
((result.summary.ocr?.successful || 0) / result.summary.totalPages) * | |
100, | |
}; | |
}; | |
const testCases = [ | |
{ pages: 1, concurrency: 20 }, | |
{ pages: 10, concurrency: 20 }, | |
{ pages: 20, concurrency: 20 }, | |
{ pages: 30, concurrency: 20 }, | |
{ pages: 50, concurrency: 20 }, | |
{ pages: 100, concurrency: 20 }, | |
{ pages: 1, concurrency: 50 }, | |
{ pages: 10, concurrency: 50 }, | |
{ pages: 20, concurrency: 50 }, | |
{ pages: 30, concurrency: 50 }, | |
{ pages: 50, concurrency: 50 }, | |
{ pages: 100, concurrency: 50 }, | |
]; | |
test.each(testCases)( | |
"Performance test with $pages pages and concurrency $concurrency", | |
async ({ pages, concurrency }) => { | |
const results = await runPerformanceTest(pages, concurrency); | |
allResults.push(results); | |
console.table({ | |
"Number of Pages": results.numPages, | |
Concurrency: results.concurrency, | |
"Total Duration (ms)": results.duration, | |
"Avg Time per Page (ms)": Math.round(results.avgTimePerPage), | |
}); | |
expect(results.duration).toBeGreaterThan(0); | |
}, | |
// Set timeout to accommodate larger tests | |
120000 | |
); | |
afterAll(() => { | |
// Print performance comparison | |
console.log("\n=== FINAL PERFORMANCE COMPARISON ==="); | |
const comparisonTable = Array.from(new Set(testCases.map((tc) => tc.pages))) | |
.sort((a, b) => a - b) | |
.map((pages) => { | |
const c20 = allResults.find( | |
(r) => r.numPages === pages && r.concurrency === 20 | |
); | |
const c50 = allResults.find( | |
(r) => r.numPages === pages && r.concurrency === 50 | |
); | |
return { | |
Pages: pages, | |
"Time (concurrency=20) (s)": c20 | |
? (c20.duration / 1000).toFixed(2) | |
: "N/A", | |
"Time (concurrency=50) (s)": c50 | |
? (c50.duration / 1000).toFixed(2) | |
: "N/A", | |
Improvement: | |
c20 && c50 | |
? `${((1 - c50.duration / c20.duration) * 100).toFixed(1)}%` | |
: "N/A", | |
}; | |
}); | |
console.table(comparisonTable); | |
}); | |
}); | |
================================================ | |
FILE: node-zerox/tests/utils.ts | |
================================================ | |
import { Page } from "../src/types"; | |
export const compareKeywords = ( | |
pages: Page[], | |
expectedKeywords: string[][] | |
) => { | |
const keywordCounts: { | |
keywordsFound: string[]; | |
keywordsMissing: string[]; | |
page: number; | |
totalKeywords: number; | |
}[] = []; | |
for (let i = 0; i < expectedKeywords.length; i++) { | |
const page = pages[i]; | |
const keywords = expectedKeywords[i]; | |
const keywordsFound: string[] = []; | |
const keywordsMissing: string[] = []; | |
if (page && keywords && page.content !== undefined) { | |
const pageContent = page.content.toLowerCase(); | |
keywords.forEach((keyword) => { | |
if (pageContent.includes(keyword.toLowerCase())) { | |
keywordsFound.push(keyword); | |
} else { | |
keywordsMissing.push(keyword); | |
} | |
}); | |
} | |
keywordCounts.push({ | |
keywordsFound, | |
keywordsMissing, | |
page: i + 1, | |
totalKeywords: keywords.length, | |
}); | |
} | |
return keywordCounts; | |
}; | |
================================================ | |
FILE: py_zerox/pyzerox/__init__.py | |
================================================ | |
from .core import zerox | |
from .constants.prompts import Prompts | |
DEFAULT_SYSTEM_PROMPT = Prompts.DEFAULT_SYSTEM_PROMPT | |
__all__ = [ | |
"zerox", | |
"Prompts", | |
"DEFAULT_SYSTEM_PROMPT", | |
] | |
================================================ | |
FILE: py_zerox/pyzerox/constants/__init__.py | |
================================================ | |
from .conversion import PDFConversionDefaultOptions | |
from .messages import Messages | |
from .prompts import Prompts | |
__all__ = [ | |
"PDFConversionDefaultOptions", | |
"Messages", | |
"Prompts", | |
] | |
================================================ | |
FILE: py_zerox/pyzerox/constants/conversion.py | |
================================================ | |
class PDFConversionDefaultOptions: | |
"""Default options for converting PDFs to images""" | |
DPI = 300 | |
FORMAT = "png" | |
SIZE = (None, 1056) | |
THREAD_COUNT = 4 | |
USE_PDFTOCAIRO = True | |
================================================ | |
FILE: py_zerox/pyzerox/constants/messages.py | |
================================================ | |
class Messages: | |
"""User-facing messages""" | |
MISSING_ENVIRONMENT_VARIABLES = """ | |
Required environment variable (keys) from the model are Missing. Please set the required environment variables for the model provider. | |
Refer: https://docs.litellm.ai/docs/providers | |
""" | |
NON_VISION_MODEL = """ | |
The provided model is not a vision model. Please provide a vision model. | |
""" | |
MODEL_ACCESS_ERROR = """ | |
Your provided model can't be accessed. Please make sure you have access to the model and also required environment variables are setup correctly including valid api key(s). | |
Refer: https://docs.litellm.ai/docs/providers | |
""" | |
CUSTOM_SYSTEM_PROMPT_WARNING = """ | |
Custom system prompt was provided which overrides the default system prompt. We assume that you know what you are doing. | |
""" | |
MAINTAIN_FORMAT_SELECTED_PAGES_WARNING = """ | |
The maintain_format flag is set to True in conjunction with select_pages input given. This may result in unexpected behavior. | |
""" | |
PAGE_NUMBER_OUT_OF_BOUND_ERROR = """ | |
The page number(s) provided is out of bound. Please provide a valid page number(s). | |
""" | |
NON_200_RESPONSE = """ | |
Model API returned status code {status_code}: {data} | |
Please check the litellm documentation for more information. https://docs.litellm.ai/docs/exception_mapping. | |
""" | |
COMPLETION_ERROR = """ | |
Error in Completion Response. Error: {0} | |
Please check the status of your model provider API status. | |
""" | |
PDF_CONVERSION_FAILED = """ | |
Error during PDF conversion: {0} | |
Please check the PDF file and try again. For more information: https://github.com/Belval/pdf2image | |
""" | |
FILE_UNREACHAGBLE = """ | |
File not found or unreachable. Status Code: {0} | |
""" | |
FILE_PATH_MISSING = """ | |
File path is invalid or missing. | |
""" | |
FAILED_TO_SAVE_FILE = """Failed to save file to local drive""" | |
FAILED_TO_PROCESS_IMAGE = """Failed to process image""" | |
================================================ | |
FILE: py_zerox/pyzerox/constants/patterns.py | |
================================================ | |
class Patterns: | |
"""Regex patterns for markdown and code blocks""" | |
MATCH_MARKDOWN_BLOCKS = r"^```[a-z]*\n([\s\S]*?)\n```$" | |
MATCH_CODE_BLOCKS = r"^```\n([\s\S]*?)\n```$" | |
================================================ | |
FILE: py_zerox/pyzerox/constants/prompts.py | |
================================================ | |
class Prompts: | |
"""Class for storing prompts for the Zerox system.""" | |
DEFAULT_SYSTEM_PROMPT = """ | |
Convert the following document to markdown. | |
Return only the markdown with no explanation text. Do not include delimiters like ```markdown or ```html. | |
RULES: | |
- You must include all information on the page. Do not exclude headers, footers, or subtext. | |
- Return tables in an HTML format. | |
- Charts & infographics must be interpreted to a markdown format. Prefer table format when applicable. | |
- Logos should be wrapped in brackets. Ex: <logo>Coca-Cola<logo> | |
- Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY<watermark> | |
- Page numbers should be wrapped in brackets. Ex: <page_number>14<page_number> or <page_number>9/22<page_number> | |
- Prefer using ☐ and ☑ for check boxes. | |
""" | |
================================================ | |
FILE: py_zerox/pyzerox/core/__init__.py | |
================================================ | |
from .zerox import zerox | |
__all__ = [ | |
"zerox", | |
] | |
================================================ | |
FILE: py_zerox/pyzerox/core/types.py | |
================================================ | |
from typing import List, Optional, Dict, Any, Union, Iterable | |
from dataclasses import dataclass, field | |
@dataclass | |
class ZeroxArgs: | |
""" | |
Dataclass to store the arguments for the Zerox class. | |
""" | |
file_path: str | |
cleanup: bool = True | |
concurrency: int = 10 | |
maintain_format: bool = False | |
model: str = "gpt-4o-mini", | |
output_dir: Optional[str] = None | |
temp_dir: Optional[str] = None | |
custom_system_prompt: Optional[str] = None | |
select_pages: Optional[Union[int, Iterable[int]]] = None | |
kwargs: Dict[str, Any] = field(default_factory=dict) | |
@dataclass | |
class Page: | |
""" | |
Dataclass to store the page content. | |
""" | |
content: str | |
content_length: int | |
page: int | |
@dataclass | |
class ZeroxOutput: | |
""" | |
Dataclass to store the output of the Zerox class. | |
""" | |
completion_time: float | |
file_name: str | |
input_tokens: int | |
output_tokens: int | |
pages: List[Page] | |
================================================ | |
FILE: py_zerox/pyzerox/core/zerox.py | |
================================================ | |
import os | |
import aioshutil as async_shutil | |
import tempfile | |
import warnings | |
from typing import List, Optional, Union, Iterable | |
from datetime import datetime | |
import aiofiles | |
import aiofiles.os as async_os | |
import asyncio | |
from ..constants import PDFConversionDefaultOptions | |
# Package Imports | |
from ..processor import ( | |
convert_pdf_to_images, | |
download_file, | |
process_page, | |
process_pages_in_batches, | |
create_selected_pages_pdf, | |
) | |
from ..errors import FileUnavailable | |
from ..constants.messages import Messages | |
from ..models import litellmmodel | |
from .types import Page, ZeroxOutput | |
async def zerox( | |
cleanup: bool = True, | |
concurrency: int = 10, | |
file_path: Optional[str] = "", | |
image_density: int = PDFConversionDefaultOptions.DPI, | |
image_height: tuple[Optional[int], int] = PDFConversionDefaultOptions.SIZE, | |
maintain_format: bool = False, | |
model: str = "gpt-4o-mini", | |
output_dir: Optional[str] = None, | |
temp_dir: Optional[str] = None, | |
custom_system_prompt: Optional[str] = None, | |
select_pages: Optional[Union[int, Iterable[int]]] = None, | |
**kwargs | |
) -> ZeroxOutput: | |
""" | |
API to perform OCR to markdown using Vision models. | |
Please setup the environment variables for the model and model provider before using this API. Refer: https://docs.litellm.ai/docs/providers | |
:param cleanup: Whether to cleanup the temporary files after processing, defaults to True | |
:type cleanup: bool, optional | |
:param concurrency: The number of concurrent processes to run, defaults to 10 | |
:type concurrency: int, optional | |
:param file_path: The path or URL to the PDF file to process. | |
:type file_path: str, optional | |
:param maintain_format: Whether to maintain the format from the previous page, defaults to False | |
:type maintain_format: bool, optional | |
:param model: The model to use for generating completions, defaults to "gpt-4o-mini". Note - Refer: https://docs.litellm.ai/docs/providers to pass correct model name as according to provider it might be different from actual name. | |
:type model: str, optional | |
:param output_dir: The directory to save the markdown output, defaults to None | |
:type output_dir: str, optional | |
:param temp_dir: The directory to store temporary files, defaults to some named folder in system's temp directory. If already exists, the contents will be deleted for zerox uses it. | |
:type temp_dir: str, optional | |
:param custom_system_prompt: The system prompt to use for the model, this overrides the default system prompt of zerox. Generally it is not required unless you want some specific behaviour. When set, it will raise a friendly warning, defaults to None | |
:type custom_system_prompt: str, optional | |
:param select_pages: Pages to process, can be a single page number or an iterable of page numbers, defaults to None | |
:type select_pages: int or Iterable[int], optional | |
:param kwargs: Additional keyword arguments to pass to the model.completion -> litellm.completion method. Refer: https://docs.litellm.ai/docs/providers and https://docs.litellm.ai/docs/completion/input | |
:return: The markdown content generated by the model. | |
""" | |
input_token_count = 0 | |
output_token_count = 0 | |
prior_page = "" | |
aggregated_markdown: List[str] = [] | |
start_time = datetime.now() | |
# File Path Validators | |
if not file_path: | |
raise FileUnavailable() | |
# Create an instance of the litellm model interface | |
vision_model = litellmmodel(model=model,**kwargs) | |
# override the system prompt if a custom prompt is provided | |
if custom_system_prompt: | |
vision_model.system_prompt = custom_system_prompt | |
# Check if both maintain_format and select_pages are provided | |
if maintain_format and select_pages is not None: | |
warnings.warn(Messages.MAINTAIN_FORMAT_SELECTED_PAGES_WARNING) | |
# If select_pages is a single integer, convert it to a list for consistency | |
if isinstance(select_pages, int): | |
select_pages = [select_pages] | |
# Sort the pages to maintain consistency | |
if select_pages is not None: | |
select_pages = sorted(select_pages) | |
# Ensure the output directory exists | |
if output_dir: | |
await async_os.makedirs(output_dir, exist_ok=True) | |
## delete tmp_dir if exists and then recreate it | |
if temp_dir: | |
if os.path.exists(temp_dir): | |
await async_shutil.rmtree(temp_dir) | |
await async_os.makedirs(temp_dir, exist_ok=True) | |
# Create a temporary directory to store the PDF and images | |
with tempfile.TemporaryDirectory() as temp_dir_: | |
if temp_dir: | |
## use the user provided temp directory | |
temp_directory = temp_dir | |
else: | |
## use the system temp directory | |
temp_directory = temp_dir_ | |
# Download the PDF. Get file name. | |
local_path = await download_file(file_path=file_path, temp_dir=temp_directory) | |
if not local_path: | |
raise FileUnavailable() | |
raw_file_name = os.path.splitext(os.path.basename(local_path))[0] | |
file_name = "".join(c.lower() if c.isalnum() else "_" for c in raw_file_name) | |
# Truncate file name to 255 characters to prevent ENAMETOOLONG errors | |
file_name = file_name[:255] | |
# create a subset pdf in temp dir with only the requested pages if select_pages is provided | |
if select_pages is not None: | |
subset_pdf_create_kwargs = {"original_pdf_path":local_path, "select_pages":select_pages, | |
"save_directory":temp_directory, "suffix":"_selected_pages"} | |
local_path = await asyncio.to_thread(create_selected_pages_pdf, | |
**subset_pdf_create_kwargs) | |
# Convert the file to a series of images, below function returns a list of image paths in page order | |
images = await convert_pdf_to_images(image_density=image_density, image_height=image_height, local_path=local_path, temp_dir=temp_directory) | |
if maintain_format: | |
for image in images: | |
result, input_token_count, output_token_count, prior_page = await process_page( | |
image, | |
vision_model, | |
temp_directory, | |
input_token_count, | |
output_token_count, | |
prior_page, | |
) | |
if result: | |
aggregated_markdown.append(result) | |
else: | |
results = await process_pages_in_batches( | |
images, | |
concurrency, | |
vision_model, | |
temp_directory, | |
input_token_count, | |
output_token_count, | |
prior_page, | |
) | |
aggregated_markdown = [result[0] for result in results if isinstance(result[0], str)] | |
## add token usage | |
input_token_count += sum([result[1] for result in results]) | |
output_token_count += sum([result[2] for result in results]) | |
# Write the aggregated markdown to a file | |
if output_dir: | |
result_file_path = os.path.join(output_dir, f"{file_name}.md") | |
async with aiofiles.open(result_file_path, "w", encoding="utf-8") as f: | |
await f.write("\n\n".join(aggregated_markdown)) | |
# Cleanup the downloaded PDF file | |
if cleanup and os.path.exists(temp_directory): | |
await async_shutil.rmtree(temp_directory) | |
# Format JSON response | |
end_time = datetime.now() | |
completion_time = (end_time - start_time).total_seconds() * 1000 | |
# Adjusting the formatted_pages logic to account for select_pages to output the correct page numbers | |
if select_pages is not None: | |
# Map aggregated markdown to the selected pages | |
formatted_pages = [ | |
Page(content=content, page=select_pages[i], content_length=len(content)) | |
for i, content in enumerate(aggregated_markdown) | |
] | |
else: | |
# Default behavior when no select_pages is provided | |
formatted_pages = [ | |
Page(content=content, page=i + 1, content_length=len(content)) | |
for i, content in enumerate(aggregated_markdown) | |
] | |
return ZeroxOutput( | |
completion_time=completion_time, | |
file_name=file_name, | |
input_tokens=input_token_count, | |
output_tokens=output_token_count, | |
pages=formatted_pages, | |
) | |
================================================ | |
FILE: py_zerox/pyzerox/errors/__init__.py | |
================================================ | |
from .exceptions import ( | |
NotAVisionModel, | |
ModelAccessError, | |
PageNumberOutOfBoundError, | |
MissingEnvironmentVariables, | |
ResourceUnreachableException, | |
FileUnavailable, | |
FailedToSaveFile, | |
FailedToProcessFile, | |
) | |
__all__ = [ | |
"NotAVisionModel", | |
"ModelAccessError", | |
"PageNumberOutOfBoundError", | |
"MissingEnvironmentVariables", | |
"ResourceUnreachableException", | |
"FileUnavailable", | |
"FailedToSaveFile", | |
"FailedToProcessFile", | |
] | |
================================================ | |
FILE: py_zerox/pyzerox/errors/base.py | |
================================================ | |
from typing import Optional | |
class CustomException(Exception): | |
""" | |
Base class for custom exceptions | |
""" | |
def __init__( | |
self, | |
message: Optional[str] = None, | |
extra_info: Optional[dict] = None, | |
): | |
self.message = message | |
self.extra_info = extra_info | |
super().__init__(self.message) | |
def __str__(self): | |
if self.extra_info: | |
return f"{self.message} (Extra Info: {self.extra_info})" | |
return self.message | |
================================================ | |
FILE: py_zerox/pyzerox/errors/exceptions.py | |
================================================ | |
from typing import Dict, Optional | |
# Package Imports | |
from ..constants import Messages | |
from .base import CustomException | |
class MissingEnvironmentVariables(CustomException): | |
"""Exception raised when the model provider environment variables, API key(s) are missing. Refer: https://docs.litellm.ai/docs/providers""" | |
def __init__( | |
self, | |
message: str = Messages.MISSING_ENVIRONMENT_VARIABLES, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class NotAVisionModel(CustomException): | |
"""Exception raised when the provided model is not a vision model.""" | |
def __init__( | |
self, | |
message: str = Messages.NON_VISION_MODEL, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class ModelAccessError(CustomException): | |
"""Exception raised when the provided model can't be accessed due to incorrect credentials/keys or incorrect environent variables setup.""" | |
def __init__( | |
self, | |
message: str = Messages.MODEL_ACCESS_ERROR, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class PageNumberOutOfBoundError(CustomException): | |
"""Exception invalid page number(s) provided.""" | |
def __init__( | |
self, | |
message: str = Messages.PAGE_NUMBER_OUT_OF_BOUND_ERROR, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class ResourceUnreachableException(CustomException): | |
"""Exception raised when a resource is unreachable.""" | |
def __init__( | |
self, | |
message: str = Messages.FILE_UNREACHAGBLE, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class FileUnavailable(CustomException): | |
"""Exception raised when a file is unavailable.""" | |
def __init__( | |
self, | |
message: str = Messages.FILE_PATH_MISSING, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class FailedToSaveFile(CustomException): | |
"""Exception raised when a file fails to save.""" | |
def __init__( | |
self, | |
message: str = Messages.FAILED_TO_SAVE_FILE, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
class FailedToProcessFile(CustomException): | |
"""Exception raised when a file fails to process.""" | |
def __init__( | |
self, | |
message: str = Messages.FAILED_TO_PROCESS_IMAGE, | |
extra_info: Optional[Dict] = None, | |
): | |
super().__init__(message, extra_info) | |
================================================ | |
FILE: py_zerox/pyzerox/models/__init__.py | |
================================================ | |
from .modellitellm import litellmmodel | |
from .types import CompletionResponse | |
__all__ = [ | |
"litellmmodel", | |
"CompletionResponse", | |
] | |
================================================ | |
FILE: py_zerox/pyzerox/models/base.py | |
================================================ | |
from abc import ABC, abstractmethod | |
from typing import Dict, Optional, Type, TypeVar, TYPE_CHECKING | |
if TYPE_CHECKING: | |
from ..models import CompletionResponse | |
T = TypeVar("T", bound="BaseModel") | |
class BaseModel(ABC): | |
""" | |
Base class for all models. | |
""" | |
@abstractmethod | |
async def completion( | |
self, | |
) -> "CompletionResponse": | |
raise NotImplementedError("Subclasses must implement this method") | |
@abstractmethod | |
def validate_access( | |
self, | |
) -> None: | |
raise NotImplementedError("Subclasses must implement this method") | |
@abstractmethod | |
def validate_model( | |
self, | |
) -> None: | |
raise NotImplementedError("Subclasses must implement this method") | |
def __init__( | |
self, | |
model: Optional[str] = None, | |
**kwargs, | |
): | |
self.model = model | |
self.kwargs = kwargs | |
## validations | |
# self.validate_model() | |
# self.validate_access() | |
================================================ | |
FILE: py_zerox/pyzerox/models/modellitellm.py | |
================================================ | |
import os | |
import aiohttp | |
import litellm | |
from typing import List, Dict, Any, Optional | |
# Package Imports | |
from .base import BaseModel | |
from .types import CompletionResponse | |
from ..errors import ModelAccessError, NotAVisionModel, MissingEnvironmentVariables | |
from ..constants.messages import Messages | |
from ..constants.prompts import Prompts | |
from ..processor.image import encode_image_to_base64 | |
DEFAULT_SYSTEM_PROMPT = Prompts.DEFAULT_SYSTEM_PROMPT | |
class litellmmodel(BaseModel): | |
## setting the default system prompt | |
_system_prompt = DEFAULT_SYSTEM_PROMPT | |
def __init__( | |
self, | |
model: Optional[str] = None, | |
**kwargs, | |
): | |
""" | |
Initializes the Litellm model interface. | |
:param model: The model to use for generating completions, defaults to "gpt-4o-mini". Refer: https://docs.litellm.ai/docs/providers | |
:type model: str, optional | |
:param kwargs: Additional keyword arguments to pass to self.completion -> litellm.completion. Refer: https://docs.litellm.ai/docs/providers and https://docs.litellm.ai/docs/completion/input | |
""" | |
super().__init__(model=model, **kwargs) | |
## calling custom methods to validate the environment and model | |
self.validate_environment() | |
self.validate_model() | |
self.validate_access() | |
@property | |
def system_prompt(self) -> str: | |
'''Returns the system prompt for the model.''' | |
return self._system_prompt | |
@system_prompt.setter | |
def system_prompt(self, prompt: str) -> None: | |
''' | |
Sets/overrides the system prompt for the model. | |
''' | |
self._system_prompt = prompt | |
## custom method on top of BaseModel | |
def validate_environment(self) -> None: | |
"""Validates the environment variables required for the model.""" | |
env_config = litellm.validate_environment(model=self.model) | |
if not env_config["keys_in_environment"]: | |
raise MissingEnvironmentVariables(extra_info=env_config) | |
def validate_model(self) -> None: | |
'''Validates the model to ensure it is a vision model.''' | |
if not litellm.supports_vision(model=self.model): | |
raise NotAVisionModel(extra_info={"model": self.model}) | |
def validate_access(self) -> None: | |
"""Validates access to the model -> if environment variables are set correctly with correct values.""" | |
if not litellm.check_valid_key(model=self.model,api_key=None): | |
raise ModelAccessError(extra_info={"model": self.model}) | |
async def completion( | |
self, | |
image_path: str, | |
maintain_format: bool, | |
prior_page: str, | |
) -> CompletionResponse: | |
"""LitellM completion for image to markdown conversion. | |
:param image_path: Path to the image file. | |
:type image_path: str | |
:param maintain_format: Whether to maintain the format from the previous page. | |
:type maintain_format: bool | |
:param prior_page: The markdown content of the previous page. | |
:type prior_page: str | |
:return: The markdown content generated by the model. | |
""" | |
messages = await self._prepare_messages( | |
image_path=image_path, | |
maintain_format=maintain_format, | |
prior_page=prior_page, | |
) | |
try: | |
response = await litellm.acompletion(model=self.model, messages=messages, **self.kwargs) | |
## completion response | |
response = CompletionResponse( | |
content=response["choices"][0]["message"]["content"], | |
input_tokens=response["usage"]["prompt_tokens"], | |
output_tokens=response["usage"]["completion_tokens"], | |
) | |
return response | |
except Exception as err: | |
raise Exception(Messages.COMPLETION_ERROR.format(err)) | |
async def _prepare_messages( | |
self, | |
image_path: str, | |
maintain_format: bool, | |
prior_page: str, | |
) -> List[Dict[str, Any]]: | |
"""Prepares the messages to send to the LiteLLM Completion API. | |
:param image_path: Path to the image file. | |
:type image_path: str | |
:param maintain_format: Whether to maintain the format from the previous page. | |
:type maintain_format: bool | |
:param prior_page: The markdown content of the previous page. | |
:type prior_page: str | |
""" | |
# Default system message | |
messages: List[Dict[str, Any]] = [ | |
{ | |
"role": "system", | |
"content": self._system_prompt, | |
}, | |
] | |
# If content has already been generated, add it to context. | |
# This helps maintain the same format across pages. | |
if maintain_format and prior_page: | |
messages.append( | |
{ | |
"role": "system", | |
"content": f'Markdown must maintain consistent formatting with the following page: \n\n """{prior_page}"""', | |
}, | |
) | |
# Add Image to request | |
base64_image = await encode_image_to_base64(image_path) | |
messages.append( | |
{ | |
"role": "user", | |
"content": [ | |
{ | |
"type": "image_url", | |
"image_url": {"url": f"data:image/png;base64,{base64_image}"}, | |
}, | |
], | |
} | |
) | |
return messages | |
================================================ | |
FILE: py_zerox/pyzerox/models/types.py | |
================================================ | |
from dataclasses import dataclass | |
@dataclass | |
class CompletionResponse: | |
""" | |
A class representing the response of a completion. | |
""" | |
content: str | |
input_tokens: int | |
output_tokens: int | |
================================================ | |
FILE: py_zerox/pyzerox/processor/__init__.py | |
================================================ | |
from .image import save_image, encode_image_to_base64 | |
from .pdf import ( | |
convert_pdf_to_images, | |
process_page, | |
process_pages_in_batches, | |
) | |
from .text import format_markdown | |
from .utils import download_file, create_selected_pages_pdf | |
__all__ = [ | |
"save_image", | |
"encode_image_to_base64", | |
"convert_pdf_to_images", | |
"format_markdown", | |
"download_file", | |
"process_page", | |
"process_pages_in_batches", | |
"create_selected_pages_pdf", | |
] | |
================================================ | |
FILE: py_zerox/pyzerox/processor/image.py | |
================================================ | |
import aiofiles | |
import base64 | |
import io | |
async def encode_image_to_base64(image_path: str) -> str: | |
"""Encode an image to base64 asynchronously.""" | |
async with aiofiles.open(image_path, "rb") as image_file: | |
image_data = await image_file.read() | |
return base64.b64encode(image_data).decode("utf-8") | |
async def save_image(image, image_path: str): | |
"""Save an image to a file asynchronously.""" | |
# Convert PIL Image to BytesIO object | |
with io.BytesIO() as buffer: | |
image.save(buffer, format=image.format) # Save the image to the BytesIO object | |
image_data = buffer.getvalue() # Get the image data from the BytesIO object | |
# Write image data to file asynchronously | |
async with aiofiles.open(image_path, "wb") as f: | |
await f.write(image_data) | |
================================================ | |
FILE: py_zerox/pyzerox/processor/pdf.py | |
================================================ | |
import logging | |
import os | |
import asyncio | |
from typing import List, Optional, Tuple | |
from pdf2image import convert_from_path | |
# Package Imports | |
from .image import save_image | |
from .text import format_markdown | |
from ..constants import PDFConversionDefaultOptions, Messages | |
from ..models import litellmmodel | |
async def convert_pdf_to_images(image_density: int, image_height: tuple[Optional[int], int], local_path: str, temp_dir: str) -> List[str]: | |
"""Converts a PDF file to a series of images in the temp_dir. Returns a list of image paths in page order.""" | |
options = { | |
"pdf_path": local_path, | |
"output_folder": temp_dir, | |
"dpi": image_density, | |
"fmt": PDFConversionDefaultOptions.FORMAT, | |
"size": image_height, | |
"thread_count": PDFConversionDefaultOptions.THREAD_COUNT, | |
"use_pdftocairo": PDFConversionDefaultOptions.USE_PDFTOCAIRO, | |
"paths_only": True, | |
} | |
try: | |
image_paths = await asyncio.to_thread( | |
convert_from_path, **options | |
) | |
return image_paths | |
except Exception as err: | |
logging.error(f"Error converting PDF to images: {err}") | |
async def process_page( | |
image: str, | |
model: litellmmodel, | |
temp_directory: str = "", | |
input_token_count: int = 0, | |
output_token_count: int = 0, | |
prior_page: str = "", | |
semaphore: Optional[asyncio.Semaphore] = None, | |
) -> Tuple[str, int, int, str]: | |
"""Process a single page of a PDF""" | |
# If semaphore is provided, acquire it before processing the page | |
if semaphore: | |
async with semaphore: | |
return await process_page( | |
image, | |
model, | |
temp_directory, | |
input_token_count, | |
output_token_count, | |
prior_page, | |
) | |
image_path = os.path.join(temp_directory, image) | |
# Get the completion from LiteLLM | |
try: | |
completion = await model.completion( | |
image_path=image_path, | |
maintain_format=True, | |
prior_page=prior_page, | |
) | |
formatted_markdown = format_markdown(completion.content) | |
input_token_count += completion.input_tokens | |
output_token_count += completion.output_tokens | |
prior_page = formatted_markdown | |
return formatted_markdown, input_token_count, output_token_count, prior_page | |
except Exception as error: | |
logging.error(f"{Messages.FAILED_TO_PROCESS_IMAGE} Error:{error}") | |
return "", input_token_count, output_token_count, "" | |
async def process_pages_in_batches( | |
images: List[str], | |
concurrency: int, | |
model: litellmmodel, | |
temp_directory: str = "", | |
input_token_count: int = 0, | |
output_token_count: int = 0, | |
prior_page: str = "", | |
): | |
# Create a semaphore to limit the number of concurrent tasks | |
semaphore = asyncio.Semaphore(concurrency) | |
# Process each page in parallel | |
tasks = [ | |
process_page( | |
image, | |
model, | |
temp_directory, | |
input_token_count, | |
output_token_count, | |
prior_page, | |
semaphore, | |
) | |
for image in images | |
] | |
# Wait for all tasks to complete | |
return await asyncio.gather(*tasks) | |
================================================ | |
FILE: py_zerox/pyzerox/processor/text.py | |
================================================ | |
import re | |
# Package imports | |
from ..constants.patterns import Patterns | |
def format_markdown(text: str) -> str: | |
"""Format markdown text by removing markdown and code blocks""" | |
formatted_markdown = re.sub(Patterns.MATCH_MARKDOWN_BLOCKS, r"\1", text) | |
formatted_markdown = re.sub(Patterns.MATCH_CODE_BLOCKS, r"\1", formatted_markdown) | |
return formatted_markdown | |
================================================ | |
FILE: py_zerox/pyzerox/processor/utils.py | |
================================================ | |
import os | |
import re | |
from typing import Optional, Union, Iterable | |
from urllib.parse import urlparse | |
import aiofiles | |
import aiohttp | |
from PyPDF2 import PdfReader, PdfWriter | |
from ..constants.messages import Messages | |
# Package Imports | |
from ..errors.exceptions import ResourceUnreachableException, PageNumberOutOfBoundError | |
async def download_file( | |
file_path: str, | |
temp_dir: str, | |
) -> Optional[str]: | |
"""Downloads a file from a URL or local path to a temporary directory.""" | |
local_pdf_path = os.path.join(temp_dir, os.path.basename(file_path)) | |
if is_valid_url(file_path): | |
async with aiohttp.ClientSession() as session: | |
async with session.get(file_path) as response: | |
if response.status != 200: | |
raise ResourceUnreachableException() | |
async with aiofiles.open(local_pdf_path, "wb") as f: | |
await f.write(await response.read()) | |
else: | |
async with aiofiles.open(file_path, "rb") as src, aiofiles.open( | |
local_pdf_path, "wb" | |
) as dst: | |
await dst.write(await src.read()) | |
return local_pdf_path | |
def is_valid_url(string: str) -> bool: | |
"""Checks if a string is a valid URL.""" | |
try: | |
result = urlparse(string) | |
return all([result.scheme, result.netloc]) and result.scheme in [ | |
"http", | |
"https", | |
] | |
except ValueError: | |
return False | |
def create_selected_pages_pdf(original_pdf_path: str, select_pages: Union[int, Iterable[int]], | |
save_directory: str, suffix: str = "_selected_pages", | |
sorted_pages: bool = True) -> str: | |
""" | |
Creates a new PDF with only the selected pages. | |
:param original_pdf_path: Path to the original PDF file. | |
:type original_pdf_path: str | |
:param select_pages: A single page number or an iterable of page numbers (1-indexed). | |
:type select_pages: int or Iterable[int] | |
:param save_directory: The directory to store the new PDF. | |
:type save_directory: str | |
:param suffix: The suffix to add to the new PDF file name, defaults to "_selected_pages". | |
:type suffix: str, optional | |
:param sorted_pages: Whether to sort the selected pages, defaults to True. | |
:type sorted_pages: bool, optional | |
:return: Path the new PDF file | |
""" | |
file_name = os.path.splitext(os.path.basename(original_pdf_path))[0] | |
# Write the new PDF to a temporary file | |
selected_pages_pdf_path = os.path.join(save_directory, f"{file_name}{suffix}.pdf") | |
# Ensure select_pages is iterable, if not, convert to list | |
if isinstance(select_pages, int): | |
select_pages = [select_pages] | |
if sorted_pages: | |
# Sort the pages for consistency | |
select_pages = sorted(list(select_pages)) | |
with open(original_pdf_path, "rb") as orig_pdf, open(selected_pages_pdf_path, "wb") as new_pdf: | |
# Read the original PDF | |
reader = PdfReader(stream=orig_pdf) | |
total_pages = len(reader.pages) | |
# Validate page numbers | |
invalid_page_numbers = [] | |
for page in select_pages: | |
if page < 1 or page > total_pages: | |
invalid_page_numbers.append(page) | |
## raise error if invalid page numbers | |
if invalid_page_numbers: | |
raise PageNumberOutOfBoundError(extra_info={"input_pdf_num_pages":total_pages, | |
"select_pages": select_pages, | |
"invalid_page_numbers": invalid_page_numbers}) | |
# Create a new PDF writer | |
writer = PdfWriter(fileobj=new_pdf) | |
# Add only the selected pages | |
for page_number in select_pages: | |
writer.add_page(reader.pages[page_number - 1]) | |
writer.write(stream=new_pdf) | |
return selected_pages_pdf_path | |
================================================ | |
FILE: py_zerox/scripts/__init__.py | |
================================================ | |
================================================ | |
FILE: py_zerox/scripts/pre_install.py | |
================================================ | |
# pre_install.py | |
import subprocess | |
import sys | |
import platform | |
def run_command(command): | |
try: | |
result = subprocess.run(command, shell=True, text=True, capture_output=True) | |
result.check_returncode() | |
return result.stdout | |
except subprocess.CalledProcessError as e: | |
raise RuntimeError(e.stderr.strip()) | |
def install_package(command, package_name): | |
try: | |
output = run_command(command) | |
print(output) | |
return output | |
except RuntimeError as e: | |
raise RuntimeError(f"Failed to install {package_name}: {e}") | |
def check_and_install(): | |
try: | |
# Check and install Poppler | |
try: | |
run_command("pdftoppm -h") | |
except RuntimeError: | |
if platform.system() == "Darwin": # macOS | |
install_package("brew install poppler", "Poppler") | |
elif platform.system() == "Linux": # Linux | |
install_package( | |
"sudo apt-get update && sudo apt-get install -y poppler-utils", | |
"Poppler", | |
) | |
else: | |
raise RuntimeError( | |
"Please install Poppler manually from https://poppler.freedesktop.org/" | |
) | |
except RuntimeError as err: | |
print(f"Error during installation: {err}", file=sys.stderr) | |
sys.exit(1) | |
if __name__ == "__main__": | |
check_and_install() | |
================================================ | |
FILE: py_zerox/tests/test_noop.py | |
================================================ | |
def test_noop(): | |
assert 1 == 1 | |
================================================ | |
FILE: shared/systemPrompt.txt | |
================================================ | |
Convert the following document to markdown. | |
Return only the markdown with no explanation text. Do not include delimiters like '''markdown or '''. | |
RULES: | |
- You must include all information on the page. Do not exclude headers, footers, or subtext. | |
- Charts & infographics must be interpreted to a markdown format. Prefer table format when applicable. | |
- Images without text must be replaced with [Description of image](image.png) | |
- For tables with double headers, prefer adding a new column. | |
- Logos should be wrapped in square brackets. Ex: [Coca-Cola] | |
- Prefer using ☐ and ☑ for check boxes. | |
================================================ | |
FILE: shared/test.json | |
================================================ | |
[ | |
{ | |
"file": "0001.png", | |
"expectedKeywords": [ | |
[ | |
"Department of the Treasury", | |
"Internal Revenue Service", | |
"U.S. Individual Income Tax Return", | |
"2023", | |
"OMB No. 1545-0074", | |
"IRS Use Only", | |
"Do not write or staple in this space.", | |
"For the year", | |
"Jan. 1", | |
"Dec. 31, 2023", | |
"other tax year beginning", | |
"See separate instructions", | |
"Your first name and middle initial", | |
"JOSEPH R", | |
"Last name", | |
"BIDEN JR", | |
"Your social security number", | |
"If joint return, spouse's first name and middle initial", | |
"JILL T", | |
"Spouse's social security number", | |
"Home address (number and street)", | |
"If you have a P.O. box, see instructions", | |
"Apt. no.", | |
"City, town, or post office", | |
"If you have a foreign address, also complete spaces below", | |
"Foreign country name", | |
"Foreign province/state/county", | |
"Foreign postal code", | |
"Presidential Election Campaign", | |
"Check here if you, or your spouse if filing jointly, want $3 to go to this fund", | |
"Checking a box below will not change your tax or refund", | |
"Check only one box", | |
"Single", | |
"Married filing jointly (even if only one had income)", | |
"Married filing separately (MFS)", | |
"Head of Household (HoH)", | |
"Qualifying surviving spouse (QSS)", | |
"If you checked the MFS box, enter the name of your spouse", | |
"If you checked the HOH or QSS box, enter the child's name if the qualifying person is a child but not your dependent", | |
"Digital Assets", | |
"At any time during 2023", | |
"receive (as a reward, award, or payment for property or services)", | |
"sell, exchange, or otherwise dispose of a digital asset", | |
"or a financial interest in a digital asset", | |
"See instructions", | |
"Standard Deduction", | |
"Someone can claim", | |
"You as a dependent", | |
"Your spouse as a dependent", | |
"Spouse itemizes on a separate return or you were a dual-status alien", | |
"Age/Blindness", | |
"Were born before January 2, 1959", | |
"Are blind", | |
"Is blind", | |
"If more than four dependents", | |
"If more than four dependents, see Instr. and check here", | |
"First name Last name", | |
"Social security number", | |
"Relationship to you", | |
"Check the box if qualifies for (see instr.)", | |
"Child tax credit", | |
"Credit for other dependents", | |
"Total amount from Form(s) W-2, box 1 (see instructions)", | |
"STMT 1", | |
"485,985" | |
] | |
] | |
}, | |
{ | |
"file": "0002.pdf", | |
"expectedKeywords": [ | |
[ | |
"Deloitte", | |
"Quality System Audit for BioTech Innovations (Pty) Ltd", | |
"02 October 2024", | |
"67 River Rd", | |
"Kensington", | |
"Johannesburg", | |
"Gauteng", | |
"2094", | |
"South Africa", | |
"06h30", | |
"Contact Person", | |
"Kathy Margaret", | |
"+14 22 045 4952", | |
"Opening Meeting Agenda", | |
"Introductions", | |
"Review of audit agenda", | |
"Confirmation of availability for required persons", | |
"Anna Pojawis", | |
"Tyler Maran", | |
"Kathy Margaret", | |
"Mark Ding", | |
"CTO", | |
"CEO", | |
"Associate", | |
"Eng", | |
"[email protected]", | |
"[email protected]", | |
"[email protected]", | |
"[email protected]", | |
"QAC Auditor", | |
"David Thompson", | |
"Lead Quality Auditor", | |
"NTA Services on behalf of BioTech Innovations", | |
"Biopharmaceuticals", | |
"Page 1 of 7", | |
"DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC", | |
"450 Oceanview Drive", | |
"Suite 200", | |
"Santa Monica", | |
"CA 90405", | |
"(800) 555-1234", | |
"(310) 555-7890", | |
"(310) 555-4567", | |
"www.qaconsultants.com", | |
"[email protected]" | |
] | |
] | |
}, | |
{ | |
"file": "0003.pdf", | |
"expectedKeywords": [ | |
[ | |
"Short Time Overload", | |
"Insulation Resistance", | |
"Endurance", | |
"Damp Heat with Load", | |
"Solderability", | |
"Dielectric Withstanding Voltage", | |
"Temperature Coefficient", | |
"Pulse Overload", | |
"Resistance To Solvent", | |
"Terminal Strength", | |
"Carbon Film Leader Resistor", | |
"Environmental Characteristics", | |
"Rated Continuous Working Voltage", | |
"Storage Temperature", | |
"JIS-C-5201-1 5.5", | |
"JIS-C-5201-1 5.6", | |
"JIS-C-5201-1 7.10", | |
"JIS-C-5201-1 7.9", | |
"JIS-C-5201-1 6.5", | |
"JIS-C-5201-1 5.7", | |
"Resistance value at room temperature", | |
"Temperature+100°C", | |
"JIS-C-5201-1 5.8", | |
"JIS-C-5201-1 6.9", | |
"Direct Load for 10 seconds", | |
"In the direction off the terminal leads", | |
"working voltage for 1000", | |
"overload voltage for 5 seconds", | |
"10000 cycles with 1 second", | |
"and 25 seconds", | |
"Trichroethane", | |
"70±2°C", | |
"100V", | |
"DC", | |
"40±2°C", | |
"245±5°C", | |
"RCWV*2.5", | |
"4 times RCWV for 10000 cycles", | |
">1000MΩ", | |
"100KΩ±3%", | |
"100KΩ±5%", | |
"90% min. Coverage", | |
"350ppm", | |
"500ppm", | |
"100KΩ", | |
"700ppm", | |
"1500ppm", | |
"No deterioration of coatings and markings", | |
"Tensile: 2.5 kg" | |
] | |
] | |
}, | |
{ | |
"file": "0004.pdf", | |
"expectedKeywords": [ | |
[ | |
"Improving", | |
"throught", | |
"Han", | |
"Abstract", | |
"Howard", | |
"community", | |
"530%", | |
"language", | |
"Wang", | |
"environments", | |
"confronted", | |
"technique", | |
"same", | |
"accurate", | |
"Methodology", | |
"dilution", | |
"illustrated", | |
"continually", | |
"stems" | |
], | |
[ | |
"Gemma", | |
"repeated", | |
"pronounced", | |
"designer", | |
"Llama", | |
"7.68±2.41s", | |
"significant", | |
"reiteration", | |
"22.50±11.19s", | |
"generate", | |
"reasoning", | |
"question", | |
"larger", | |
"2.0%", | |
"context", | |
"deviation", | |
"3.8B", | |
"ratio", | |
"unnecessary", | |
"Repetition", | |
"Experimentation", | |
"Conclusion", | |
"Transformers", | |
"PyTorch", | |
"24GB", | |
"Abdin", | |
"530%", | |
"Xu", | |
"github" | |
], | |
[ | |
"References", | |
"Jacobs", | |
"Nguyen", | |
"2404.14219", | |
"Abhimanyu", | |
"Ahmad", | |
"Yang", | |
"herd", | |
"Team", | |
"Riviere", | |
"Mesnard", | |
"Bobak", | |
"practical", | |
"Lei", | |
"186345", | |
"Frontiers", | |
"Jingsen", | |
"Jiakai", | |
"autonomous", | |
"compromising", | |
"quality", | |
"ICLR", | |
"Saizheng", | |
"HotpotQA", | |
"explainable", | |
"Natural", | |
"Representations", | |
"2023", | |
"Narasimhan", | |
"Eleventh" | |
] | |
] | |
}, | |
{ | |
"file": "0005.png", | |
"expectedKeywords": [ | |
[ | |
"Quest", | |
"Diagnostics", | |
"Maternal", | |
"Insurance", | |
"DEPENDENT", | |
"Fasting", | |
"Ultrasound", | |
"QuestDiagnostics.com/MLCP", | |
"Medicaid", | |
"hyperGly-hCG", | |
"Sequential", | |
"Stepwise", | |
"SST", | |
"MSAFP", | |
"30294", | |
"Quad", | |
"Penta", | |
"LMP", | |
"Ethnic", | |
"fetuses", | |
"insulin-dependent", | |
"Trisomy", | |
"Down", | |
"Syndrome", | |
"Donor", | |
"cigarettes", | |
"Nuchal", | |
"Ultrasonographer", | |
"QD20330K" | |
] | |
] | |
}, | |
{ | |
"file": "0006.png", | |
"expectedKeywords": [ | |
[ | |
"T-Mobile", | |
"Monthly", | |
"100469352", | |
"Recurring", | |
"****9541", | |
"MasterCard", | |
"Equipment", | |
"Installment", | |
"Plan", | |
"(XXX)-49X-5XXX", | |
"$00.00", | |
"KickBack", | |
"2GB", | |
"AutoPay", | |
"my.t-mobile.com", | |
"t-mobile.com/pay", | |
"*PAY (*XXX)", | |
"1004693520967854630205189400463125091" | |
] | |
] | |
}, | |
{ | |
"file": "0007.png", | |
"expectedKeywords": [ | |
[ | |
"LABCORP", | |
"3106932528", | |
"Farzam", | |
"044-494-4741-0", | |
"(310) 849-7991", | |
"GENESEE", | |
"1619295474", | |
"Fasting", | |
"rflx", | |
"25-Hydroxy", | |
"x10E3/uL", | |
"Neutrophils", | |
"Lymphs (Absolute)", | |
"Baso (Absolute)", | |
"Not Estab.", | |
"Creatinine", | |
"Africn", | |
"mL/min/1.73", | |
"02/16/19 1809 ET", | |
"800-859-6046" | |
] | |
] | |
}, | |
{ | |
"file": "0008.png", | |
"expectedKeywords": [ | |
[ | |
"LABCORP LCLS BULK", | |
"3106932528", | |
"2 of 4", | |
"Farzam", | |
"Potassium", | |
"mmol/L", | |
"A/G Ratio", | |
"AST (SGOT)", | |
"High", | |
"Abnormal", | |
"UA/M w/rflx", | |
"IU/L", | |
"1.005 - 1.030", | |
"Negative/Trace", | |
"Semi-Qn", | |
"None seen/Few", | |
"Result 1", | |
"02/16/19 1809 ET", | |
"FINAL REPORT", | |
"800-859-6046", | |
"© 1995-2019" | |
] | |
] | |
}, | |
{ | |
"file": "0009.png", | |
"expectedKeywords": [ | |
[ | |
"3106932528", | |
"Page 3 of 4", | |
"LabCorp", | |
"04/11/1992", | |
"044-494-4741-0", | |
"02/13/2019 1000 Local", | |
"50,000-100,000", | |
"Triglycerides", | |
"Hemoglobin A1c", | |
"4.8 - 5.6", | |
"Glycemic", | |
"uIU/mL", | |
"25-Hydroxy", | |
"Low", | |
"Institute", | |
"25-OH", | |
"Bischoff-Ferrari", | |
":1911-30", | |
"Please Note:", | |
"Therapeutic", | |
"02/16/19 1809 ET", | |
"confidential", | |
"800-859-6046", | |
"© 1995-2019" | |
] | |
] | |
}, | |
{ | |
"file": "0010.png", | |
"expectedKeywords": [ | |
[ | |
"02/16/2019 6:09:27 PM", | |
"3106932528", | |
"Farzam", | |
"WEISER, CHERYL", | |
"60070006294", | |
"REFERENCE INTERVAL", | |
"Ferritin, Serum", | |
"ng/mL", | |
"15 - 150", | |
"13112 Evening Creek", | |
"92128-4108", | |
"Dir: Jenny Galloway", | |
"800-859-6046", | |
"858-668-3700", | |
"02/16/19 1809 ET", | |
"4 of 4", | |
"confidential", | |
"800-859-6046", | |
"America® Holdings" | |
] | |
] | |
}, | |
{ | |
"file": "0011.png", | |
"expectedKeywords": [ | |
[ | |
"Bill of Lading", | |
"21099992723", | |
"123 Pick Up Street", | |
"Business without Dock or Forklift", | |
"800-866-4870", | |
"15 minutes", | |
"[email protected]", | |
"[email protected]", | |
"Pallet", | |
"48x40x48 in", | |
"300 Lbs", | |
"declared value", | |
"per", | |
"TOTAL WEIGHT", | |
"14706(c)(1)(A)", | |
"certify", | |
"Shipper Signature", | |
"Freight Loaded", | |
"Carrier Signature", | |
"Pickup Date", | |
"LTL Only", | |
"123 Delivery Street", | |
"Vancouver", | |
"BCV5K 0A4", | |
"Canada", | |
"9:00AM", | |
"5:00PM", | |
"Protect From Freeze", | |
"EXAMPLE CARRIER", | |
"Seal Number", | |
"Freight Charges Term", | |
"3rd Party", | |
"BOL", | |
"POD" | |
] | |
] | |
}, | |
{ | |
"file": "0012.png", | |
"expectedKeywords": [ | |
[ | |
"UniformSoftware", | |
"transport business name", | |
"ABN", | |
"xx xxx xxx xxx", | |
"Tel", | |
"Cust A/c", | |
"Due Date", | |
"Inv. Date", | |
"Invoice #", | |
"1/1/2017", | |
"ref#1", | |
"MASCOT", | |
"SYDNENHAM", | |
"4PCS", | |
"0.500", | |
"187", | |
"$11.00", | |
"$0.88", | |
"$11.00", | |
"EAST BOTANY", | |
"PORT BOTANY", | |
"4.070", | |
"6459", | |
"1659", | |
"5/15/2017", | |
"44PCS", | |
"Please remit payment to", | |
"bank name", | |
"xxx-xxx", | |
"TOTAL", | |
"155.00", | |
"15.50", | |
"182.90" | |
] | |
] | |
}, | |
{ | |
"file": "0013.pdf", | |
"expectedKeywords": [ | |
[ | |
"MSFT", | |
"AAPL", | |
"10.6%", | |
"Portfolio Value", | |
"$545k", | |
"Allocation", | |
"82%/18% Stocks/Bonds", | |
"Cash", | |
"5.75", | |
"US Stocks", | |
"78.65", | |
"Non-US Stocks", | |
"3.45", | |
"Bonds", | |
"12.12", | |
"Other/Not Clsfd", | |
"0.03", | |
"Portfolio Construction", | |
"78.9%", | |
"2.4%", | |
"18.1%", | |
"0.5%", | |
"Cash", | |
"ETFs", | |
"Mutual Funds", | |
"Individual Stocks", | |
"Total Stock Holdings", | |
"634", | |
"Total Bond Holdings", | |
"9,748" | |
], | |
[ | |
"Your Equity Allocation", | |
"82%", | |
"Individual Stocks 96%", | |
"ETFs 4%", | |
"25% - 35%", | |
"600", | |
"Diversification Analysis", | |
"Owning multiple funds does not always produce the anticipated diversification benefits.", | |
"Equity Style", | |
"Large", | |
"Mid", | |
"Small", | |
"20", | |
"13", | |
"32", | |
"15", | |
"Sectors", | |
"20.62", | |
"28.73", | |
"Consumer Def", | |
"Bmark%", | |
"1.45" | |
], | |
[ | |
"Tax Transition/Overlap Analysis", | |
"$138,494", | |
"$19,943", | |
"PFIZER INC", | |
"($453)", | |
"2026", | |
"RIO TINTO PLC SPONSORED ADR", | |
"ISHARES 5-10 YEAR IG CORP BOND ETF", | |
"NXST", | |
"$1,198.95", | |
"$11.631.26", | |
"PROCTER & GAMBLE CO", | |
"PG", | |
"$283", | |
"($928)", | |
"1/31/2023" | |
] | |
] | |
}, | |
{ | |
"file": "0014.png", | |
"expectedKeywords": [ | |
[ | |
"Last quote update:", | |
"2013-11-30 05:20:51", | |
"54,659", | |
"(9.2%)", | |
"14.41%", | |
"Profit/Loss", | |
"66,087", | |
"0.36%", | |
"0.17%", | |
"60,105", | |
"If [Target-Actual]", | |
"Emerging", | |
"Real Estate", | |
"Healthcare", | |
"Technology", | |
"8%", | |
"Defensive", | |
"Sensitivity", | |
"DELL", | |
"Years-Current", | |
"161,200", | |
"-6", | |
"627", | |
"0.43%", | |
"29,455", | |
"138,898", | |
"53,601", | |
"13", | |
"0.08%", | |
"0.17%", | |
"13.10%", | |
"MTD", | |
"US Broad", | |
"CAD", | |
"XIU.TO" | |
] | |
] | |
}, | |
{ | |
"file": "0015.png", | |
"expectedKeywords": [ | |
[ | |
"通行费", | |
"湖北增值税电子普通发票", | |
"04201700112", | |
"499099660821", | |
"12636666022332927910", | |
"校 验 码", | |
"开票日期", | |
"2018年03月23日", | |
"密 码 区", | |
"030243319>1*+9239+></<59+3-", | |
"786-646/16<248>/-/029029>746", | |
"7>44<97929379677-955315>+-", | |
"6/53<13+8*010369194565>-5/04", | |
"武汉经济技术开发区车城大道7号 84289348", | |
"经营租货通行费", | |
"鄂AHG248", | |
"通行日期起", | |
"20180212", | |
"通行日期止", | |
"20180212", | |
"¥286.23", | |
"税率", | |
"3%", | |
"¥294.84", | |
"湖北随岳南高速公路有限公司", | |
"91420000753416406R", | |
"发票专用章", | |
"收 款 人", | |
"龙梦媛", | |
"复 核:", | |
"陈煜", | |
"贰佰玖拾肆元捌角贰分", | |
"武汉市经济开发区17C1地块东和中心B栋1601号027-83458755" | |
] | |
] | |
}, | |
{ | |
"file": "0016.pdf", | |
"expectedKeywords": [ | |
[ | |
"Valori Nutrizionali", | |
"Nutrition Facts", | |
"Nährwerte", | |
"Valores Nutricionales", | |
"Energia/Energy/Energie/Valor energético", | |
"Kj 2577/Kcal 616", | |
"di cui acidi grassi saturi/of which saturates/davon gesättigte Fettsäuren/de las cuales saturadas", | |
"8.3g", | |
"di cui zuccheri/of which sugars/davon Zucker/de los cuales azúcar", | |
"5.1 g", | |
"Proteine/Protein/Eiweiß/Proteínas", | |
"IT Ingredienti", | |
"100%", | |
"PEANUT BUTTER", | |
"BØWLPRØS", | |
"latte e sesamo", | |
"8 054145 812068", | |
"[email protected]", | |
"www.bowlpros.com", | |
"Da consumarsi preferibilmente entro il/Best before/Mindestens haltbar bis/Consumir preferentemente antes del" | |
] | |
] | |
}, | |
{ | |
"file": "0017.pdf", | |
"expectedKeywords": [ | |
[ | |
"御 見 積 書", | |
"⻑崎北郵便局", | |
"書類番号", | |
"202410-01439[01]", | |
"下記の通り御見積申し上げます", | |
"何卒御用命下さる様お願い申し上げます", | |
"発行日", | |
"⻑崎北郵便局(親時計更新)", | |
"60日間", | |
"90日間", | |
"荷造運賃", | |
"812-0026", | |
"福岡県福岡市博多区上川端町8-18", | |
"092-281-0020", | |
"092-281-0112", | |
"取付工事費", | |
"御見積金額", | |
"712,000", | |
"品目コード", | |
"親時計4回線壁掛型 タイマー・チャイム", | |
"単価", | |
"金額", | |
"※設置工事費含む", | |
"※キャンペーン期間中の為設置工事費無料です。", | |
"KM-82TC-4P", | |
"標準価格計", | |
"割引合計額", | |
"御了承ください", | |
"特注品の内容に関しては営業担当者へ確認ください" | |
] | |
] | |
}, | |
{ | |
"file": "0018.pdf", | |
"expectedKeywords": [ | |
[ | |
"Tesla Inc.", | |
"10-Q", | |
"Texas", | |
"I.R.S. Employer Identification No.", | |
"91-2197729", | |
"Zip Code", | |
"78725", | |
"Registrant’s telephone number, including area code", | |
"(512) 516-8177", | |
"Title of each class", | |
"Common stock", | |
"Name of each exchange on which registered", | |
"The Nasdaq Global Select Market" | |
], | |
[ | |
"Balance Sheets", | |
"Current assets", | |
"September 30, 2024", | |
"December 31, 2023", | |
"Cash and cash equivalents", | |
"18,111", | |
"16398", | |
"Short-term investments", | |
"Accounts receivable", | |
"Inventory", | |
"Total assets", | |
"119,852", | |
"106,618", | |
"Liabilities", | |
"Current liabilities", | |
"Accounts payable", | |
"14,654", | |
"14,431", | |
"Total liabilities", | |
"49,142", | |
"43,009", | |
"Equity", | |
"$0.001 par value; 100 shares authorized; no shares issued and outstanding", | |
"Total liabilities and equity", | |
"119,852", | |
"106,618" | |
], | |
[ | |
"Consolidated Statements of Operations", | |
"(in millions, except per share data)", | |
"(unaudited)", | |
"Revenues", | |
"Three Months Ended September 30,", | |
"Net income", | |
"2,183", | |
"1,878", | |
"4,821", | |
"7,031", | |
"Net income attributable to common stockholders", | |
"2,167", | |
"1,853", | |
"4,774", | |
"7,069" | |
], | |
[ | |
"Consolidated Statements of Comprehensive Income", | |
"Comprehensive income attributable to common stockholders", | |
"2,620", | |
"1,571", | |
"4,903", | |
"6,738", | |
"(289)", | |
"(343)" | |
] | |
] | |
}, | |
{ | |
"file": "0019.png", | |
"expectedKeywords": [ | |
[ | |
"Walmart", | |
"win $1000", | |
"7N5N1V1XCQDQ", | |
"317-851-1102", | |
"Mgr", | |
"JAMIE BROOKSHIRE", | |
"882 S. STATE ROAD 136", | |
"GREENWOOD", | |
"IN 46143", | |
"05483", | |
"TATER TOTS", | |
"001312000025", | |
"2.96", | |
"SNACK BARS", | |
"002190848816", | |
"4.98", | |
"VOIDED ENTRY", | |
"HRI CL CHS", | |
"GALE", | |
"000000000003K", | |
"32.00", | |
"BAGELS", | |
"001376402801", | |
"4.66", | |
"TOTAL", | |
"144.02", | |
"CASH", | |
"150.02", | |
"CHANGE DUE", | |
"6.00", | |
"ITEMS SOLD 26", | |
"0783 5080 4072 3416 2496 6", | |
"04/27/19", | |
"12:59:46", | |
"Scan with Walmart app to save receipt" | |
] | |
] | |
}, | |
{ | |
"file": "0020.png", | |
"expectedKeywords": [ | |
[ | |
"ZESTADO EXPRESS", | |
"ABN", | |
"16 112 221 123", | |
"Bill To", | |
"Custom Board Makeers", | |
"Administration Centre", | |
"12 Salvage Road", | |
"Acaacia Ridge BC QLD 4110", | |
"Australia", | |
"Issue Date", | |
"8th December 2021", | |
"Account No.", | |
"101234", | |
"Invoice Amount", | |
"$5,270.00 AUD", | |
"Pay By", | |
"15th December 2021", | |
"Waybill No.", | |
"012345A", | |
"GC12345", | |
"[email protected]", | |
"Surfboards", | |
"1,010 kg", | |
"Kahului Maui Hawaii Port", | |
"Maui Surf Shop", | |
"13' x 18\" x 3\"", | |
"Eco High Performance Mini Mals", | |
"$1,980.00", | |
"4321-A1 XL Custom Supreme Light Stand Up", | |
"13' x 18\" x 3\"", | |
"$1,035.00", | |
"$1,210.00" | |
] | |
] | |
}, | |
{ | |
"file": "0021.png", | |
"expectedKeywords": [ | |
[ | |
"RZECZPOSPOLITA", | |
"BRANDT", | |
"27.06.1988 CRIVITZ", | |
"4c. STAROSTA POLICKI", | |
"880627", | |
"00359/19/3211", | |
"8806670172", | |
"AM/B1/B", | |
"PL" | |
] | |
] | |
}, | |
{ | |
"file": "0022.png", | |
"expectedKeywords": [ | |
[ | |
"Ohio", | |
"LICENSE", | |
"STRICKLAND", | |
"Mike Rankin", | |
"Registrar BMV", | |
"JANE Q", | |
"9900TL5467900302", | |
"LICENSE NO.", | |
"TL545786", | |
"07-09-1962", | |
"04-01-2009", | |
"07-09-2012", | |
"ENDORS", | |
"07-09-1962", | |
"BRO", | |
"POWER OF ATTY", | |
"LIFE SUSTAINING", | |
"EQUIPMENT" | |
] | |
] | |
}, | |
{ | |
"file": "0023.png", | |
"expectedKeywords": [ | |
[ | |
"DRIVER", | |
"Tennessee", | |
"123456789", | |
"02/11/2026", | |
"DOB", | |
"02/11/1974", | |
"ISS", | |
"02/11/2019", | |
"REST", | |
"5'-05", | |
"1234567890123456", | |
"SAMPLE", | |
"JANICE", | |
"123 MAIN STREET", | |
"APT. 1", | |
"NASHVILLE", | |
"37210", | |
"DL" | |
] | |
] | |
}, | |
{ | |
"file": "0024.png", | |
"expectedKeywords": [ | |
[ | |
"CALIFORNIA", | |
"1986", | |
"N8685798", | |
"4641 Hayvenhurst", | |
"91316", | |
"Blk", | |
"Brn", | |
"8-29-58", | |
"PRE LIC EXP", | |
"CLASS 3", | |
"CORRECTIVE", | |
"SECTION 12804", | |
"Michael Joe Jackson", | |
"4-28-83", | |
"clckjw", | |
"AHIJ" | |
] | |
] | |
}, | |
{ | |
"file": "0025.png", | |
"expectedKeywords": [ | |
[ | |
"NEW YORK", | |
"LEARNER PERMIT", | |
"Mark J.F. Schroeder", | |
"Commissioner", | |
"987 654 321", | |
"BLU", | |
"DOB", | |
"Issued", | |
"10/31/2026", | |
"Michelle M. Motorist", | |
"MOTORIST", | |
"MICHELLE", | |
"2345 ANYWHERE STREET", | |
"12222", | |
"U18 UNTIL", | |
"OCT 21", | |
"U21 UNTIL", | |
"OCT 31 03", | |
"123456789" | |
] | |
] | |
}, | |
{ | |
"file": "0026.png", | |
"expectedKeywords": [ | |
[ | |
"California", | |
"DRIVER LICENSE", | |
"11234568", | |
"IMA", | |
"2570 24TH STREET", | |
"ANYTOWN", | |
"95818", | |
"DOB", | |
"08/31/1977", | |
"RSTR", | |
"DONOR", | |
"VETERAN", | |
"BRN", | |
"WGT", | |
"125 lb", | |
"00/00/0000NNNAN/ANFD/YY", | |
"08/31/2009", | |
"Cardholder", | |
"0831977" | |
] | |
] | |
}, | |
{ | |
"file": "0027.png", | |
"expectedKeywords": [ | |
[ | |
"Pennsylvania", | |
"IDENTIFICATION", | |
"visitPA.com", | |
"99 999 999", | |
"DUPS", | |
"01/07/1973", | |
"ANDREW JASON", | |
"123 MAIN STREET", | |
"HARRISBURG", | |
"17101-0000", | |
"01/31/2026", | |
"01/07/2022", | |
"HGT", | |
"1234567890123", | |
"456789012345", | |
"Andrew", | |
"Sample" | |
] | |
] | |
}, | |
{ | |
"file": "0028.png", | |
"expectedKeywords": [ | |
[ | |
"CALIFORNIA", | |
"1970", | |
"Ronald J. Thomas", | |
"ADMINISTRATOR", | |
"David Franklin Thomas", | |
"5798 Olive St", | |
"Calif", | |
"95969", | |
"W106438", | |
"Gry", | |
"COLOR EYES", | |
"Blu", | |
"DATE OF BIRTH", | |
"Aug 20, 1892", | |
"Corrective", | |
"D. F. Thomas", | |
"6,000 LBS", | |
"Paradise", | |
"8-4-65" | |
] | |
] | |
}, | |
{ | |
"file": "0029.png", | |
"expectedKeywords": [ | |
[ | |
"CALIFORNIA", | |
"OPERATING", | |
"1984", | |
"W0209369", | |
"James Scott Garner", | |
"35 Oakmont Dr", | |
"90049", | |
"Blk", | |
"Brn", | |
"6-3", | |
"PRE LIC EXP", | |
"4-7-23", | |
"CORRECTIVE", | |
"CONDITIONS", | |
"CLASS 3", | |
"SECTION 12804", | |
"James S. Garner", | |
"3-25-80", | |
"Gln rc", | |
"LAMINATE" | |
] | |
] | |
}, | |
{ | |
"file": "0030.png", | |
"expectedKeywords": [ | |
[ | |
"CALIFORNIA", | |
"DRIVER", | |
"RENEWAL", | |
"BIRTHDAY", | |
"N2287802", | |
"Shanaberger", | |
"1541 Beloit Ave", | |
"#208", | |
"90025", | |
"Brn", | |
"HEIGHT", | |
"5-6", | |
"130", | |
"CORRECTIVE", | |
"CONDITIONS", | |
"CLASS 3", | |
"SECTION 12804", | |
"08-21-80", | |
"Tor mw", | |
"LAMINATE" | |
] | |
] | |
}, | |
{ | |
"file": "0031.png", | |
"expectedKeywords": [ | |
[ | |
"SIGNATURE", | |
"TITULAIRE", | |
"FIRMA DEL TITULAR", | |
"PASAPORTE", | |
"UNITED STATES OF AMERICA", | |
"Codigo", | |
"546844936", | |
"Apellidos", | |
"ABRENICA", | |
"Date de naissance", | |
"Lugar de nacimiento", | |
"NEW YORK", | |
"06 Jun 2016", | |
"Autoridad", | |
"SEE PAGE 27", | |
"<USAAABRENICA<<JARED<MICHAEL", | |
"5468449363USA0102100M2106054275193173<681306" | |
] | |
] | |
}, | |
{ | |
"file": "0032.png", | |
"expectedKeywords": [ | |
[ | |
"ENDORSEMENTS AND LIMITATIONS", | |
"OBSERVATIONS BEGINNING", | |
"MENTIONS ET RESTRICTIONS", | |
"l'intention", | |
"GK141569", | |
"CANADA", | |
"Pays émetteur", | |
"passeport", | |
"MANN", | |
"Prénoms", | |
"JASKARAN SINGH", | |
"CANADIENNE", | |
"naissance", | |
"JNAOIA", | |
"délivrance", | |
"TORONTO", | |
"CANMANN<<JASKARAN<<SINGH", | |
"GK141569<8CAN8607294M2707202", | |
"ED197265" | |
] | |
] | |
}, | |
{ | |
"file": "0033.png", | |
"expectedKeywords": [ | |
[ | |
"Assinatura", | |
"Ce passeport", | |
"caso de incapacidad", | |
"AA000000", | |
"REPÚBLICA FEDERATIVA DO BRASIL", | |
"PAÍS EMISSOR", | |
"SOBRENOME", | |
"FARIAS DOS SANTOS", | |
"NOME", | |
"RODRIGO", | |
"BRASILEIRO(A)", | |
"16 MAR/MAR 2004", | |
"BRASÍLIA/DF", | |
"AMANDA FARIAS DOS SANTOS", | |
"Res. CNJ 131/11, Art. 13.", | |
"P<BRAFARIAS<DOS<SANTOS<<RODRIGO", | |
"AA000000<0BRA0403162M2507053" | |
] | |
] | |
}, | |
{ | |
"file": "0034.png", | |
"expectedKeywords": [ | |
[ | |
"ENDORSEMENTS AND LIMITATIONS", | |
"PAGE 5 (IF APPLICABLE)", | |
"MENTIONS ET RESTRICTIONS", | |
"HK444152", | |
"CANADA", | |
"Issuing Country", | |
"WITTMACK", | |
"Prénoms", | |
"BRIAN FREDRICK", | |
"Date de naissance", | |
"01 NOV 47", | |
"CONSORT CAN", | |
"Date d'expiration", | |
"MISSISSAUGA", | |
"P<CANWITTMACK<<BRIAN<FREDRICK", | |
"HK444152<5CAN4711018M2606130", | |
"EGD69494" | |
] | |
] | |
}, | |
{ | |
"file": "0035.png", | |
"expectedKeywords": [ | |
[ | |
"OBSERVATIONS OFFICIELLES (11)", | |
"UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND", | |
"518242591", | |
"Surname/Nom (1)", | |
"BRITISH CITIZEN", | |
"CROYDON", | |
"Date of expiry", | |
"24 APR / AVR 24", | |
"Holder's signature", | |
"P<GBRWEBB<<JAMES<ROBERT", | |
"5182425917GBR7702174M2404244" | |
] | |
] | |
}, | |
{ | |
"file": "0036.png", | |
"expectedKeywords": [ | |
[ | |
"RESIDENZA", | |
"TORINO (TO)", | |
"COLORE DEGLI OCCHI", | |
"MARRONI", | |
"REPUBBLICA ITALIANA", | |
"Tipo. Type. Type.", | |
"Codice Paese.", | |
"YA8116396", | |
"TREVISAN", | |
"Nome. Given Names. Prénoms. (2)", | |
"FELTRE (BL)", | |
"MINISTRO AFFARI ESTERI", | |
"E COOPERAZIONE INTERNAZIONALE", | |
"Firma del titolare", | |
"P<ITATREVISAN<<MARCO", | |
"YA81163966ITA6602129M2507097" | |
] | |
] | |
}, | |
{ | |
"file": "0037.png", | |
"expectedKeywords": [ | |
[ | |
"We the People", | |
"insure domestic Tranquility", | |
"Constitution for the United States of America", | |
"SIGNATURE OF BEARER", | |
"UNITED STATES OF AMERICA", | |
"Código", | |
"910239248", | |
"Apellidos", | |
"OBAMA", | |
"Date de naissance", | |
"17 Jan 1964", | |
"ILLINOIS, U.S.A.", | |
"Authority / Autorité / Autoridad", | |
"SEE PAGE 51", | |
"P<USABOBAMA<<MICHELLE", | |
"9102392482USA6401171F1812051900781200<129676" | |
] | |
] | |
}, | |
{ | |
"file": "0038.png", | |
"expectedKeywords": [ | |
[ | |
"Of the United States", | |
"PASSPORT", | |
"PASSEPORT", | |
"PASAPORTE", | |
"Code / Code / Código", | |
"488839667", | |
"VOLD", | |
"STEPHEN HANSL", | |
"Nationality / Nationalité / Nacionalidad", | |
"WASHINGTON, U.S.A.", | |
"21 May 2012", | |
"United States Department of State", | |
"Mentions Spéciale", | |
"SEE PAGE 51", | |
"P<USAVOLD<<STEPHEN<HANSL", | |
"4888396671USA6008156M220520112117147143<509936" | |
] | |
] | |
}, | |
{ | |
"file": "0039.png", | |
"expectedKeywords": [ | |
[ | |
"insure domestic Tranquility", | |
"Constitution for the United States of America.", | |
"PASSPORT", | |
"PASSEPORT", | |
"PASAPORTE", | |
"USA", | |
"Type / Type / Tipo", | |
"963545637", | |
"JOHN", | |
"15 Mar 1996", | |
"Fecha de expedición", | |
"United States Department of State", | |
"Endorsements", | |
"Mentions Spéciales", | |
"Anotaciones", | |
"SEE PAGE 17", | |
"P<USAJOHN<<DOE", | |
"9635456374USA9603150M27041402O2113962<804330" | |
] | |
] | |
}, | |
{ | |
"file": "0040.png", | |
"expectedKeywords": [ | |
[ | |
"OBSERVATIONS OFFICIELLES (11)", | |
"UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND", | |
"Code/Code", | |
"925600253", | |
"UK SPECIMEN", | |
"Prénoms (2)", | |
"ANGELA ZOE", | |
"Nationality", | |
"Nationalité", | |
"CROYDON", | |
"16 JUL / JUIL 10", | |
"Holder's signature", | |
"P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<<", | |
"9256002538GBR8809117F2007162" | |
] | |
] | |
} | |
] | |
================================================ | |
FILE: shared/outputs/0001.md | |
================================================ | |
# Form 1040 | |
Department of the Treasury - Internal Revenue Service | |
## U.S. Individual Income Tax Return | |
## 2023 | |
OMB No. 1545-0074 | |
IRS Use Only - Do not write or staple in this space. | |
For the year Jan. 1 – Dec. 31, 2023, or other tax year beginning \_\_\_, ending \_\_\_ | |
See separate instructions. | |
Your first name and middle initial | |
JOSEPH R. | |
Last name | |
BIDEN JR. | |
Your social security number | |
If joint return, spouse's first name and middle initial | |
JILL T. | |
Last name | |
BIDEN | |
Spouse's social security number | |
Home address (number and street). If you have a P.O. box, see instructions. | |
Apt. no. | |
City, town, or post office. If you have a foreign address, also complete spaces below. | |
State | |
ZIP Code | |
Foreign country name | |
Foreign province/state/country | |
Foreign postal code | |
**Presidential Election Campaign** | |
Check here if you, or your spouse if filing jointly, want $3 to go to this fund. Checking a box below will not change your tax or refund. | |
☑ You ☑ Spouse | |
### Filing Status | |
Check only one box. | |
□ Single | |
☑ Married filing jointly (even if only one had income) | |
□ Married filing separately (MFS) | |
□ Head of Household (HoH) | |
□ Qualifying surviving spouse (QSS) | |
If you checked the MFS box, enter the name of your spouse. If you checked the HOH or QSS box, enter the child's name if the qualifying person is a child but not your dependent | |
### Digital Assets | |
At any time during 2023, did you (a) receive (as a reward, award, or payment for property or services); or (b) sell, exchange, or otherwise dispose of a digital asset (or a financial interest in a digital currency)? (See instructions.) | |
□ Yes ☑ No | |
### Standard Deduction | |
**Someone can claim:** | |
□ You as a dependent | |
□ Your spouse as a dependent | |
□ Spouse itemizes on a separate return or you were a dual-status alien | |
### Age/Blindness | |
**You:** ☑ Were born before January 2, 1959 □ Are blind | |
**Spouse:** ☑ Was born before January 2, 1959 □ Is blind | |
### Dependents | |
If more than four dependents, see Instr. and check here □ | |
| (see instructions): (1) First Name Last Name | (2) Social security number | (3) Relationship to you | (4) Check the box if qualifies for (see instr.): Child tax credit | (4) Check the box if qualifies for (see instr.): Credit for other dependents | | |
| -------------------------------------------- | -------------------------- | ----------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------- | | |
| | | | □ | □ | | |
| | | | □ | □ | | |
| | | | □ | □ | | |
| | | | □ | □ | | |
- **1a** Total amount from Form(s) W-2, box 1 (see instructions) STMT 1 **1a** 485,985. | |
================================================ | |
FILE: shared/outputs/0002.md | |
================================================ | |
# Deloitte. | |
## Quality System Audit for BioTech Innovations (Pty) Ltd | |
### Opening Meeting Sign-in Sheet | |
**Audit Date:** 02 October 2024 | |
**Time:** 06h30 | |
**Supplier:** BioTech Innovations (Pty) Ltd; 67 River Rd, Kensington, Johannesburg, Gauteng, 2094 South Africa. | |
**Contact Person:** Kathy Margaret | |
**Phone Number:** +14 22 045 4952 | |
**Opening Meeting Agenda:** | |
- Introductions | |
- Review of audit agenda | |
- Confirmation of availability for required persons | |
**Opening Meeting Attendees:** | |
| No. | Print Name | Job Title | Email | Signature | | |
| --- | -------------- | --------- | --------------------------- | ----------- | | |
| 1 | Anna Pojawis | CTO | [email protected] | [Signature] | | |
| 2 | Tyler Maran | CEO | [email protected] | [Signature] | | |
| 3 | Kathy Margaret | Associate | [email protected] | [Signature] | | |
| 4 | Mark Ding | Eng | [email protected] | [Signature] | | |
| 5 | | | | | | |
**QAC Auditor:** David Thompson, Lead Quality Auditor, NTA Services on behalf of BioTech Innovations (Biopharmaceuticals). | |
--- | |
Page 1 of 7 | |
**DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC** | |
450 Oceanview Drive, Suite 200 - Santa Monica, CA 90405 - PHONE (800) 555-1234 (310) 555-7890 - FAX (310) 555-4567 | |
Website: [www.qaconsultants.com](http://www.qaconsultants.com) - Email: [email protected] | |
================================================ | |
FILE: shared/outputs/0003.md | |
================================================ | |
# [RS] | |
## Carbon Film Leader Resistor - Resistor | |
## Environmental Characteristics | |
| Item | Requirement | Test Method | | |
| ------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | | |
| Short Time Overload | ±(0.75\*±0.05Ω) | JIS-C-5201-1 5.5 RCWV\*2.5 or Max. overload voltage for 5 seconds | | |
| Insulation Resistance | >1000MΩ | JIS-C-5201-1 5.6 Apply 100VDC for 1 minute | | |
| Endurance | ±(3%\*0.05Ω) | JIS-C-5201-1 7.10 70±2°C, Max. working voltage for 1000 hrs with 1.5 hrs "ON" and 0.5 hrs "OFF" | | |
| Damp Heat with Load | 100KΩ±3% 100KΩ±5% | JIS-C-5201-1 7.9 40±2°C, 90-95% R.H. Max. working voltage for 1000 hrs with 1.5 hrs "ON" and 0.5 hrs "OFF" | | |
| Solderability | 90% min. Coverage | JIS-C-5201-1 6.5 245±5°C for 3 seconds | | |
| Dielectric Withstanding Voltage | By Type | JIS-C-5201-1 5.7 Apply Max. Overload Voltage for 1 minute | | |
| Temperature Coefficient | < 100KΩ +350ppm~-500ppm 100KΩ~1MΩ-0ppm~-700ppm > 1 MΩ-0ppm~-1500ppm | Resistance value at room temperature and room Temperature+100°C | | |
| Pulse Overload | ±(1%\*±0.05Ω) | JIS-C-5201-1 5.8 4 times RCWV for 10000 cycles with 1 second "ON" and 25 seconds "OFF" | | |
| Resistance To Solvent | No deterioration of coatings and markings | JIS-C-5201-1 6.9 Trichroethane for 1 min. with ultrasonic | | |
| Terminal Strength | Tensile: 2.5 kg | Direct Load for 10 seconds In the direction off the terminal leads | | |
## Rated Continuous Working Voltage(RCWV) = √(P\*R) | |
## Storage Temperature: 25±3°C; Humidity < 80% RH | |
================================================ | |
FILE: shared/outputs/0004.md | |
================================================ | |
Focused ReAct: Improving ReAct through Reiterate and Early Stop | |
================================================================================ | |
**Shuoqui Li** | |
Carnegie Mellon University | |
[email protected] | |
**Han Xu** | |
University of Illinois at Urbana-Champaign | |
[email protected] | |
**Haipeng Chen** | |
William & Mary | |
[email protected] | |
--- | |
**Abstract** | |
Large language models (LLMs) have significantly improved their reasoning and decision-making capabilities, as seen in methods like ReAct. However, despite its effectiveness in tackling complex tasks, ReAct faces two main challenges: losing focus on the original question and becoming stuck in action loops. To address these issues, we introduce Focused ReAct, an enhanced version of the ReAct paradigm that incorporates reiteration and early stop mechanisms. These improvements help the model stay focused on the original query and avoid repetitive behaviors. Experimental results show accuracy gains of 18% to 530% and a runtime reduction of up to 34% compared to the original ReAct method. | |
1 Introduction | |
---------------- | |
Recent advancements in large language models (LLMs) have enabled more sophisticated techniques for reasoning and decision-making. One such technique, the ReAct framework (Reason+Act), has gained popularity for its dual approach of alternating between reasoning and action (Yao et al., 2023). This combination allows ReAct to excel in handling complex tasks by better adapting to dynamic environments (Wang et al., 2024). | |
Despite its strengths in general question-answering (QA), ReAct sometimes falls short in delivering accurate results, as demonstrated in Figure 1. When confronted with lengthy or intricate questions, the model—paired with the ReAct framework—may lose track of the original question, leading to incorrect responses. Additionally, the ReAct method can get trapped in repetitive action loops, continuously performing the same action without progressing toward a solution. To address these challenges, we introduce Focused ReAct, an enhanced version of the original ReAct method specifically designed for QA tasks. | |
2 Methodology | |
---------------- | |
This section outlines the two core features of the Focused ReAct method: reiterate and early stop. | |
2.1 Reiterate to Solve Context Loss | |
As indicated in the introduction, The first challenge stems from the extended reasoning process, where the relevance of the original question diminishes as additional reasoning and actions are appended. To resolve this, the concept of reiterate is introduced. In this approach, the original question is restated at the beginning of each reasoning step in the ReAct cycle. This technique is illustrated in the green-highlighted region of Figure 2. | |
By reiterating the original question at each step, the model continually emphasizes the user’s query, preventing it from being overshadowed by the increasingly long context that ReAct tends to create. This simple yet effective strategy mitigates the context dilution problem illustrated in Figure 1, ensuring that the output remains aligned with the user’s intent. | |
--- | |
 | |
Question: Howard Van Doren Shaw was the designer of which 1917 Progressive Era community? | |
Thought: 1917 is the year of the community, so I need to search Howard Van Doren Shaw and Progressive Era. | |
Action: Search [Howard Van Doren Shaw] | |
Observation: I found the following information in order to answer "Howard Van Doren Shaw was the designer of which 1917 Progressive Era community?" …... | |
Thought: Hereby I have provide enough information. Please give out the answer to "Howard Van Doren Shaw was the designer of which 1917 Progressive Era community?" with the format of …… | |
Action: Finish [Marktown] | |
Evaluation: True | |
Figure 2: The QA process by Focused ReAct for the same question, which applies reiteration (highlighted in yellow) and early stop (highlighted in red) to resolve the context loss and the repeated action issue. | |
2.2 Early Stop to Prevent Action Repetition | |
The second challenge, as outlined in the introduction, occurs when the model gets caught in repetitive loops, generating the same response without progressing toward the correct answer. To tackle this, we propose an early stop mechanism. It assumes that by the time a duplicate action occurs, sufficient information has been gathered. | |
When the program detects repeated actions, it triggers a termination request - highlighted in red in Figure 2 - instructing the model to generate a final answer based on the existing information. This approach prevents unnecessary repetition and helps the QA process arrive at an accurate response more efficiently. | |
3 Experimentation | |
We evaluate Focused ReAct against the ReAct baseline using the Gemma 2 2B (Team et al., 2024), Phi-3.5-mini 3.8B (Abdin et al., 2024) and Llama 3.1 8B (Dubey et al., 2024) models. The implementation uses the PyTorch and Transformers libraries¹, with experiments conducted on a single NVIDIA L4 GPU with 24GB of memory. The dataset consists of 150 QA tasks, randomly selected from HotPotQA (Yang et al., 2018). We measure accuracy as the ratio of correctly answered tasks to the total number of tasks, while runtime is recorded for the completion of each task. | |
Table 1 presents the accuracy comparison between the vanilla ReAct and Focused ReAct across the Gemma 2, Phi-3.5, and Llama 3.1 models. Focused ReAct demonstrates an 18%-530% improvement in accuracy. | |
| Model | ReAct | Focused ReAct | abs./rel. diff | | |
|---------------|-------|---------------|----------------| | |
| Gemma 2 2B | 2.0% | 12.6% | +10.6 / 530% | | |
| Phi-3.5-mini 3.8B | 22.0% | 26.0% | +4.0 / 18% | | |
| Llama 3.1 8B | 14.0% | 23.3% | +9.3 / 66% | | |
Table 2: Runtime Comparison (Average and Std) for ReAct vs. Focused ReAct | |
| Model | ReAct | Focused ReAct | abs./rel. diff | | |
|---------------|---------------|---------------|----------------| | |
| Gemma 2 2B | 11.68±2.66s | 7.68±2.41s | -4.0 / 34% | | |
| Phi-3.5-mini 3.8B | 23.23±8.42s | 22.50±11.19s | -0.73 / 3% | | |
| Llama 3.1 8B | 24.10±23.48s | 23.12±25.35s | -0.98 / 4% | | |
Table 2 summarizes the average runtime and standard deviation (std) for both the original ReAct and Focused ReAct methods. Models with fewer parameters show a 34% reduction in runtime, while models with larger parameter sizes exhibit no significant decrease. This discrepancy may be attributed to the fact that smaller models, with weaker reasoning capabilities, benefit more from Focused ReAct optimizations. In contrast, larger models are more robust at maintaining context and performing deeper reasoning, which may reduce the relative impact of Focused ReAct’s efficiency gains. As a result, the runtime benefits are less pronounced compared to smaller models. | |
4 Conclusion | |
This paper identifies two common issues with the ReAct method in QA: losing focus on the original question during extended reasoning and becoming stuck in repetitive action loops. To overcome these problems, we propose Focused ReAct, which incorporates reiteration and early stop to improve upon the ReAct framework. Compared to the original ReAct method, the new approach achieves accuracy improvements between 18% and 530%, along with a reduction in runtime of up to 34%. | |
For future work, we plan to extend Focused ReAct to a broader range of tasks and scenarios, evaluate its generalizability and robustness, and explore techniques to further accelerate its performance (Xu et al., 2024). | |
¹Our code implementation and experiments are available at https://github.com/vmd3i/Focused-ReAct. | |
## References | |
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. *arXiv preprint arXiv:2404.14219*. | |
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. *arXiv preprint arXiv:2407.21783*. | |
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhuptiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. *arXiv preprint arXiv:2408.00118*. | |
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents. *Frontiers of Computer Science, 18*(6):186345. | |
Han Xu, Jingyang Ye, Yutong Li, and Haipeng Chen. 2024. Can speculative sampling accelerate react without compromising reasoning quality? In *The Second Tiny Papers Track at ICLR 2024*. | |
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*. | |
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In *The Eleventh International Conference on Learning Representations*. | |
================================================ | |
FILE: shared/outputs/0005.md | |
================================================ | |
# Quest Diagnostics | |
## Maternal Serum Screening | |
### BILL TO: | |
□ My Account | |
□ Insurance Provided | |
□ Lab Card/Select | |
□ Patient | |
### PRINT PATIENT NAME (LAST, FIRST, MIDDLE) | |
### REGISTRATION \# (IF APPLICABLE) | |
### DATE OF BIRTH | |
### SEX | |
### LAB REFERENCE | |
### CELL PHONE | |
### PATIENT ID \# / MRN | |
### PATIENT PHONE | |
### PATIENT EMAIL ADDRESS | |
### PRINT NAME OF INSURED/RESPONSIBLE PARTY (LAST, FIRST, MIDDLE) - IF OTHER THAN PATIENT | |
### PATIENT STREET ADDRESS (OR INSURED/RESPONSIBLE PARTY) | |
### APT | |
### KEY | |
### CITY | |
### STATE | |
### ZIP | |
### PRIMARY INSURANCE | |
### RELATIONSHIP TO INSURED: | |
□ SELF | |
□ SPOUSE | |
□ DEPENDENT | |
### PRIMARY INSURANCE CO. NAME | |
### MEMBER / INSURED ID NO. \# | |
### GROUP \# | |
### INSURANCE ADDRESS | |
### ACCOUNT | |
### CITY | |
### STATE | |
### ZIP | |
### ACCOUNT \# | |
### NAME | |
### ADDRESS | |
### CITY, STATE, ZIP | |
### TELEPHONE \# | |
### DATE COLLECTED | |
### TIME | |
□ AM | |
□ PM | |
### TOTAL VOL/hrs | |
\_\_\_ ML \_\_\_ HR | |
□ Fasting | |
□ Non Fasting | |
### NPI/UPIN ORDERING/SUPERVISING PHYSICIAN AND/OR PAYERS (MUST BE INDICATED) | |
### ADDIT'L PHYS.: Dr. | |
### NPI/UPIN | |
### NON-PHYSICIAN PROVIDER: | |
### NAME | |
### I.D.# | |
### □ Fax Result to: | |
### Send Duplicate Report to: | |
### Client # OR NAME: | |
### ADDRESS | |
### CITY | |
### STATE | |
### ZIP | |
### DID YOU KNOW | |
- Reflex Tests Are Performed At An Additional Charge. | |
- PSC Appointment Website And Telephone Number Information Listed On The Back. | |
- Each Sample Should Be Labeled With At Least Two Patient Identifiers At Time Of Collection. | |
### ICD Diagnosis Codes Are Mandatory. Fill in the applicable fields below. | |
### ABN required for tests with these symbols | |
Medicare Limited Coverage Tests | |
- @ - May not be covered for the reported diagnosis. | |
- F - Has a frequency limit or usage coverage. | |
- & - As a blood donor screening test/experimental kit. | |
- B - Has both donor and frequency/medical coverage limitations. | |
Provide signed ABN when necessary | |
### Provide | |
- ICD Diagnosis Code(s) | |
- ABN when necessary | |
### Visit QuestDiagnostics.com/MLCP for Medicare coverage guidelines | |
### ICD Codes (enter all that apply) | |
Many payers (including Medicaid) have medical necessity requirements. You should only order those tests which are medically necessary for the diagnosis and treatment of the patient. | |
### 1st TRIMESTER SCREENING ♦ # (1st Trimester Screening does not detect oNTDs) | Red Top SST - 1 Tube | |
**16020** □ 1st Trimester Screen hyperGly-hCG (PAPP-A, h-hCG) (9.0-13.9 wks gestation) | |
@ **16145** □ 1st Trimester Screen, hCG (PAPP-A, hCG) (10.0-13.9 wks gestation) | |
### INTEGRATED/SEQUENTIAL SCREENING | |
@ **16131** □ Sequential Integrated Screen **Part 1** (PAPP-A, hCG) # ♦ (10.0-13.9 weeks gestation) | |
@ **16133** □ Sequential Integrated Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation) | |
**Speciment # from Part 1** | |
@ **16463** □ Stepwise Sequential Screen **Part 1** (PAPP-A, hCG) # ♦ (10.0-13.9 weeks gestation) | |
@ **16465** □ Stepwise Sequential Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation) | |
**Speciment # from Part 1** | |
**16148** □ Integrated Screen **Part 1** (PAPP-A) # ♦ (NT required) (9.0-13.9 weeks gestation) | |
@ **16150** □ Integrated Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation) | |
**Speciment # from Part 1** | |
**16165** □ Serum Integrated Screen **Part 1** (PAPP-A) # (NT not required) (9.0-13.9 weeks gestation) | |
@ **16167** □ Serum Integrated Screen **Part 2** (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation) | |
**Speciment # from Part 1** | |
### 2nd TRIMESTER SCREENING # | Red Top SST - 1 Tube | |
@ **5059** □ Maternal Serum AFP (MSAFP) (15.0-22.9 weeks gestation) | |
**Screens for open neural tube detects (oNTDs) only** | |
@ **30294** □ Quad Screen (AFP, hCG, uE3, DIA) (15.0-22.9 weeks gestation) | |
@ **15934** □ Penta Screen (AFP, hCG, uE3, DIA) (15.0-22.9 wks gestation) | |
### THIS INFORMATION IS REQUIRED FOR ALL TESTS ~ CALL 866-GENEINFO IF YOU HAVE ANY QUESTIONS | |
Date of Birth: \_\_/\_\_/\_\_ | |
Collection Date: \_\_/\_\_/\_\_ | |
Maternal Weight: \_\_LBS | |
### # THIS INFORMATION IS REQUIRED FOR PART 1 OF INTEGRATED/SEQUENTIAL SCREENING, 1ST AND 2ND TRIMESTER SCREENING | Red Top SST - 1 Tube | |
Estimated Date of Delivery (EDD): \_\_/\_\_/\_\_ determined by: ☐ Ultrasound ☐ Last Menstrual Period (LMP) ☐ Physical Exam | |
Mother's Ethnic Origin: ☐ African American ☐ Asian ☐ Caucasian ☐ Hispanic ☐ Other: \_\_ | |
Number of Fetuses: ☐ One ☐ Two ☐ More than 2 | How many fetuses? \_\_ | |
| Yes | No | | | |
| --- | --- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | |
| ☐ | ☐ | Patient is an insulin-dependent diabetic prior to pregnancy | | |
| ☐ | ☐ | This is a repeat specimen for this pregnancy (Repeat testing following a screen positive result for Down syndrome or Trisomy 18 is **NOT** recommended) | | |
| ☐ | ☐ | History of neural tube defect If yes, explain: \_\_ | | |
| ☐ | ☐ | Previous pregnancy with Down Syndrome | | |
| ☐ | ☐ | Pregnancy is from a donor egg Age of Donor at time of Egg Retrieval: \_\_ | | |
| ☐ | ☐ | Patient currently smokes cigarettes | | |
| | | **Other Relevant Clinical Information:** | | |
### ♦ THIS INFORMATION IS REQUIRED FOR 1st TRIMESTER SCREENING AND PART 1 INTEGRATED/SEQUENTIAL SCREENING. | |
Ultrasound date \_\_/\_\_/\_\_ | |
Ultrasonographer's name \_\_/\_\_/\_\_ | |
**Nuchal Translucency Measurement Credentialing Agency (required, check one box)** | |
☐ NTQR Ultrasonographer's ID# \_\_ | Location ID# \_\_ | Reading Physician ID# \_\_ | |
☐ FMF Ultrasonographer's ID# \_\_ | |
☐ Other (List) \_\_ | ID# \_\_ | |
Quest, Quest Diagnostics, the associated logo and all associated Quest Diagnostics marks are the trademarks of Quest Diagnostics Incorporated. © Quest Diagnostics Incorporated. All rights reserved. QD20330K. Revised 8/20. | |
================================================ | |
FILE: shared/outputs/0006.md | |
================================================ | |
# Monthly Statement | |
**Statement for** | |
FIRST NAME LAST NAME | |
**Account number** | |
100469352 | |
**Bill close date** | |
Feb 14, 2024 | |
### FIRST NAME LAST NAME | |
### ADDRESS | |
### CITY, STATE, ZIP CODE | |
## Balance | |
| Description | Amount | | |
| -------------------------------- | ----------- | | |
| Previous balance | $95.80 | | |
| Credits and one time charges | ($95.80) | | |
| Payments received | ($0.00) | | |
| **Balance forward - Credit** | **($0.00)** | | |
| Current charges | | | |
| Recurring | $69.23 | | |
| Other | $41.47 | | |
| **Total amount due by 02/28/24** | **$110.70** | | |
Your bill is scheduled for an automatic payment on 02/28/24 using MasterCard \*\*\*\*9541. | |
_"Change from last month" does not include changes to taxes and fees unless associated with changes in service plan, Equipment Installment Plan, or Lease._ | |
## Current charges | |
| Account and lines | Recurring | Other | Change from last month | | |
| -------------------- | ----------- | ---------- | ---------------------- | | |
| Account | $25.00 | - | $95.80 ▼ | | |
| (XXX)-49X-5XXX | - | $28.10 | - | | |
| (XXX)-49X-5XXX | - | $1.50 | - | | |
| (XXX)-49X-5XXX | - | - | $26.60 ▲ | | |
| (XXX)-49X-5XXX | - | - | $23.10 ▲ | | |
| (XXX)-49X-5XXX | $20.23 | - | $20.03 ▲ | | |
| (XXX)-49X-5XXX | - | $7.12 | $10.17 ▲ | | |
| New - (XXX)-496-5XXX | $24.00 | $4.75 | - | | |
| (XXX)-496-5XXX | - | - | - | | |
| 3 additional lines | - | $00.00 | 15.50 ▲ | | |
| **Subtotal** | **$69.23** | **$41.47** | - | | |
| **Total** | **$110.70** | | - | | |
## Bill highlights | |
Follow numbers throughout bill. | |
- **1** **_You had usage charges._** | |
- **2** **_Your plan changed._** | |
- **3** **_An Equipment Installment Plan (EIP) monthly charge was billed for the first time._** | |
- **i** **_Your billing address has changed._** | |
- **4** **_One or more of your lines did not receive KickBack because they exceeded 2GB._** | |
- **i** **_You're getting an AutoPay discount for using AutoPay!_** | |
- **i** **_Visit my.t-mobile.com or the T-Mobile App to pay your bill online, manage your account and get product support._** | |
**Questions?** For more information visit my.t-mobile.com. | |
Please detach this portion and return with your payment. Please make sure address shows through window. | |
**T-Mobile** | |
**Statement for:** FIRST NAME LAST NAME | |
**Account number:** 100469352 | |
**Pay online:** t-mobile.com/pay | |
**Pay by phone:** *PAY (*XXX) | |
**Scan to pay** | |
| Total amount due by 02/28/24 | Amount enclosed | | |
| ---------------------------- | --------------- | | |
| **$110.70** | **AutoPay** | | |
T-MOBILE | |
PO BOX 8668 | |
CITY, STATE, ZIP CODE | |
□ **Sign up for AutoPay** - Check box and complete reverse side. | |
□ **If you changed your address** - Check box and record new address on the reverse side. | |
1004693520967854630205189400463125091 | |
================================================ | |
FILE: shared/outputs/0007.md | |
================================================ | |
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 1 of 4 | |
TO: Michael Farzam MD | |
# LabCorp | Patient Report | |
**Specimen ID:** 044-494-4741-0 | |
**Control ID:** 60070006294 | |
**Acct #:** 04275945 | |
**Phone:** (310) 849-7991 | |
**Rte:** 00 | |
**WEISER, CHERYL** | |
444 N GENESEE AVE | |
LOS ANGELES CA 90036 | |
(213) 400-3914 | |
**Michael Farzam MD** | |
258 North Bowling Green Way | |
LOS ANGELES CA 90049 | |
### Patient Details | |
**DOB:** 04/11/1992 | |
**Age(y/m/d):** 026/10/02 | |
**Gender:** F | |
**SSN:** | |
**Patient ID:** | |
### Specimen Details | |
**Date collected:** 02/13/2019 1000 Local | |
**Date received:** 02/13/2019 | |
**Date entered:** 02/13/2019 | |
**Date reported:** 02/16/2019 1809 ET | |
### Physician Details | |
**Ordering:** M FARZAM | |
**Referring:** | |
**ID:** | |
**NPI:** 1619295474 | |
**General Comments & Additional Information** | |
**Total Volume:** Not Provided | |
**Fasting:** Yes | |
**Ordered Items** | |
CBC With Differential/Platelet; Comp. Metabolic Panel (14); UA/M w/rflx Culture, Routine; Lipid Panel; Vitamin B12 and Folate; Hemoglobin A1c; Thyroxine (T4) Free, Direct, S; TSH; Vitamin D, 25-Hydroxy; Uric Acid; Iron; Ferritin, Serum | |
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB | | |
| ---------------------------------- | ------- | -------- | ----------- | ------------------ | --- | | |
| **CBC With Differential/Platelet** | | | | | | | |
| WBC | 7.5 | | x10E3/uL | 3.4 - 10.8 | 01 | | |
| RBC | 4.24 | | x10E6/uL | 3.77 - 5.28 | 01 | | |
| Hemoglobin | 12.8 | | g/dL | 11.1 - 15.9 | 01 | | |
| Hematocrit | 38.4 | | % | 34.0 - 46.6 | 01 | | |
| MCV | 91 | | fL | 79 - 97 | 01 | | |
| MCH | 30.2 | | pg | 26.6 - 33.0 | 01 | | |
| MCHC | 33.3 | | g/dL | 31.5 - 35.7 | 01 | | |
| RDW | 13.2 | | % | 12.3 - 15.4 | 01 | | |
| Platelets | 202 | | x10E3/uL | 150 - 379 | 01 | | |
| Neutrophils | 41 | | % | Not Estab. | 01 | | |
| Lymphs | 48 | | % | Not Estab. | 01 | | |
| Monocytes | 8 | | % | Not Estab. | 01 | | |
| Eos | 3 | | % | Not Estab. | 01 | | |
| Basos | 0 | | % | Not Estab. | 01 | | |
| Neutrophils (Absolute) | 3.1 | | x10E3/uL | 1.4 - 7.0 | 01 | | |
| **Lymphs (Absolute)** | **3.5** | **High** | x10E3/uL | 0.7 - 3.1 | 01 | | |
| Monocytes (Absolute) | 0.6 | | x10E3/uL | 0.1 - 0.9 | 01 | | |
| Eos (Absolute) | 0.2 | | x10E3/uL | 0.0 - 0.4 | 01 | | |
| Baso (Absolute) | 0.0 | | x10E3/uL | 0.0 - 0.2 | 01 | | |
| Immature Granulocytes | 0 | | % | Not Estab. | 01 | | |
| Immature Grans (Abs) | 0.0 | | x10E3/uL | 0.0 - 0.1 | 01 | | |
| **Comp. Metabolic Panel (14)** | | | | | | | |
| Glucose | 76 | | mg/dL | 65 - 99 | 01 | | |
| **BUN** | **32** | **High** | mg/dL | 6 - 20 | 01 | | |
| Creatinine | 0.74 | | mg/dL | 0.57 - 1.00 | 01 | | |
| eGFR If NonAfricn Am | 112 | | mL/min/1.73 | >59 | | | |
| eGFR If Africn Am | 129 | | mL/min/1.73 | >59 | | | |
| **BUN/Creatinine Ratio** | **43** | **High** | | 9 - 23 | | | |
--- | |
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 1 of 4 | |
This document contains private and confidential health information protected by state and federal law. | |
If you have received this document in error, please call 800-859-6046 | |
© 1995-2019 Laboratory Corporation of America® Holdings | |
All Rights Reserved - Enterprise Report Version: 1.00 | |
================================================ | |
FILE: shared/outputs/0008.md | |
================================================ | |
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 2 of 4 | |
TO: Michael Farzam MD | |
# LabCorp | Patient Report | |
**Patient: WEISER, CHERYL** | |
**DOB:** 04/11/1992 | |
**Patient ID:** | |
**Control ID:** 60070006294 | |
**Specimen ID:** 044-494-4741-0 | |
**Date collected:** 02/13/2019 1000 Local | |
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB | | |
| ------------------------------------------------------------------- | ------------------ | ------------ | ------ | ------------------ | --- | | |
| Sodium | 136 | | mmol/L | 134 - 144 | 01 | | |
| Potassium | 3.9 | | mmol/L | 3.5 - 5.2 | 01 | | |
| Chloride | 102 | | mmol/L | 96 - 106 | 01 | | |
| Carbon Dioxide, Total | 21 | | mmol/L | 20 - 29 | 01 | | |
| Calcium | 9.2 | | mg/dL | 8.7 - 10.2 | 01 | | |
| Protein, Total | 7.4 | | g/dL | 6.0 - 8.5 | 01 | | |
| Albumin | 4.7 | | g/dL | 3.5 - 5.5 | 01 | | |
| Globulin, Total | 2.7 | | g/dL | 1.5 - 4.5 | | | |
| A/G Ratio | 1.7 | | | 1.2 - 2.2 | | | |
| Bilirubin, Total | 0.3 | | mg/dL | 0.0 - 1.2 | 01 | | |
| Alkaline Phosphatase | 92 | | IU/L | 39 - 117 | 01 | | |
| **AST (SGOT)** | **41** | **High** | IU/L | 0 - 40 | 01 | | |
| **ALT (SGPT)** | **83** | **High** | IU/L | 0 - 32 | 01 | | |
| **UA/M w/rflx Culture, Routine** | | | | | | | |
| Urinalysis Gross Exam | | | | | 01 | | |
| Specific Gravity | 1.019 | | | 1.005 - 1.030 | 01 | | |
| pH | 5.5 | | | 5.0 - 7.5 | 01 | | |
| **Urine-Color** | **Brown** | **Abnormal** | | Yellow | 01 | | |
| **Appearance** | **Cloudy** | **Abnormal** | | Clear | 01 | | |
| **WBC Esterase** | **1+** | **Abnormal** | | Negative | 01 | | |
| **Protein** | **2+** | **Abnormal** | | Negative/Trace | 01 | | |
| Glucose | Negative | | | Negative | 01 | | |
| Ketones | Negative | | | Negative | 01 | | |
| **Occult Blood** | **3+** | **Abnormal** | | Negative | 01 | | |
| Bilirubin | Negative | | | Negative | 01 | | |
| Urobilinogen, Semi-Qn | 0.2 | | mg/dL | 0.2 - 1.0 | 01 | | |
| Nitrite, Urine | Negative | | | Negative | 01 | | |
| Microscopic Examination<br>See below: | | | | | 01 | | |
| WBC | 0-5 | | /hpf | 0 - 5 | 01 | | |
| **RBC** | **11-30** | **Abnormal** | /hpf | 0 - 2 | 01 | | |
| Epithelial Cells (non renal) | 0-10 | | /hpf | 0 - 10 | 01 | | |
| **Crystals** | **Present** | **Abnormal** | | N/A | 01 | | |
| Crystal Type | Amorphous Sediment | | | N/A | 01 | | |
| Mucus Threads | Present | | | Not Estab. | 01 | | |
| Bacteria | Few | | | None seen/Few | 01 | | |
| Urinalysis Reflex<br>This specimen has reflexed to a Urine Culture. | | | | | 01 | | |
| Urine Culture, Routine<br>Final report | | | | | 01 | | |
| Result 1 | | | | | | | |
--- | |
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 2 of 4 | |
This document contains private and confidential health information protected by state and federal law. | |
If you have received this document in error, please call 800-859-6046 | |
© 1995-2019 Laboratory Corporation of America® Holdings | |
All Rights Reserved - Enterprise Report Version: 1.0 | |
================================================ | |
FILE: shared/outputs/0009.md | |
================================================ | |
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 3 of 4 | |
TO: Michael Farzam MD | |
# LabCorp | Patient Report | |
**Patient: WEISER, CHERYL** | |
**DOB:** 04/11/1992 | |
**Patient ID:** | |
**Control ID:** 60070006294 | |
**Specimen ID:** 044-494-4741-0 | |
**Date collected:** 02/13/2019 1000 Local | |
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB | | |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- | ------ | ------------------ | --- | | |
| Mixed urogenital flora<br>50,000-100,000 colony forming units per mL | | | | | 01 | | |
| **Lipid Panel** | | | | | | | |
| **Cholesterol, Total** | **202** | **High** | mg/dL | 100 - 199 | 01 | | |
| Triglycerides | 48 | | mg/dL | 0 - 149 | 01 | | |
| HDL Cholesterol | 82 | | mg/dL | >39 | 01 | | |
| VLDL Cholesterol Cal | 10 | | mg/dL | 5 - 40 | 01 | | |
| **LDL Cholesterol Calc** | **110** | **High** | mg/dL | 0 - 99 | 01 | | |
| **Vitamin B12 and Folate** | | | | | | | |
| **Vitamin B12** | **>1999** | **High** | pg/mL | 232 - 1245 | 01 | | |
| Folate (Folic Acid), Serum<br>Note:<br>A serum folate concentration of less than 3.1 ng/mL is considered to represent clinical deficiency. | 5.1 | | ng/mL | >3.0 | 01 | | |
| **Hemoglobin A1c** | | | | | | | |
| Hemoglobin A1c<br>Please Note:<br>- Prediabetes: 5.7 - 6.4<br>- Diabetes: >6.4<br>- Glycemic control for adults with diabetes: <7.0 | 4.8 | | % | 4.8 - 5.6 | 01 | | |
| **Thyroxine (T4) Free, Direct, S** | | | | | | | |
| T4,Free(Direct) | 1.07 | | ng/dL | 0.82 - 1.77 | 01 | | |
| TSH | 2.200 | | uIU/mL | 0.450 - 4.500 | 01 | | |
| **Vitamin D, 25-Hydroxy**<br>**Vitamin D deficiency has been defined by the Institute of Medicine and an Endocrine Society practice guideline as a level of serum 25-OH vitamin D less than 20 ng/mL (1,2). The Endocrine Society went on to further define vitamin D insufficiency as a level between 21 and 29 ng/mL (2).<br>1. IOM (Institute of Medicine). 2010. Dietary reference intakes for calcium and D. Washington DC: The National Academies Press.<br>2. Holick MF, Binkley NC, Bischoff-Ferrari HA, et al. Evaluation, treatment, and prevention of vitamin D deficiency: an Endocrine Society clinical practice guideline. JCEM. 2011 Jul; 96(7):1911-30.** | **10.7** | **Low** | ng/mL | 30.0 - 100.0 | 01 | | |
| **Uric Acid** | | | | | | | |
| Uric Acid<br>Please Note:<br>Therapeutic target for gout patients: <6.0 | 2.8 | | mg/dL | 2.5 - 7.1 | 01 | | |
--- | |
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 3 of 4 | |
This document contains private and confidential health information protected by state and federal law. | |
If you have received this document in error, please call 800-859-6046 | |
© 1995-2019 Laboratory Corporation of America® Holdings | |
All Rights Reserved - Enterprise Report Version: 1.00 | |
================================================ | |
FILE: shared/outputs/0010.md | |
================================================ | |
02/16/2019 6:09:27 PM | FROM: LABCORP LCLS BULK | TO: 3106932528 | LABCORP | Page 4 of 4 | |
TO: Michael Farzam MD | |
# LabCorp | Patient Report | |
**Patient: WEISER, CHERYL** | |
**DOB:** 04/11/1992 | |
**Patient ID:** | |
**Control ID:** 60070006294 | |
**Specimen ID:** 044-494-4741-0 | |
**Date collected:** 02/13/2019 1000 Local | |
| TESTS | RESULT | FLAG | UNITS | REFERENCE INTERVAL | LAB | | |
| --------------- | ------ | ---- | ----- | ------------------ | --- | | |
| Iron | 123 | | ug/dL | 27 - 159 | 01 | | |
| Ferritin, Serum | 44 | | ng/mL | 15 - 150 | 01 | | |
01 SO | |
LabCorp San Diego | |
13112 Evening Creek Dr So Ste 200, San Diego, CA | |
92128-4108 | |
Dir: Jenny Galloway, MD | |
For inquiries, the physician may contact **Branch: 800-859-6046 Lab: 858-668-3700** | |
--- | |
Date Issued: 02/16/19 1809 ET | **FINAL REPORT** | Page 4 of 4 | |
This document contains private and confidential health information protected by state and federal law. | |
If you have received this document in error, please call 800-859-6046 | |
© 1995-2019 Laboratory Corporation of America® Holdings | |
All Rights Reserved - Enterprise Report Version: 1.00 | |
================================================ | |
FILE: shared/outputs/0011.md | |
================================================ | |
# Bill of Lading | |
**Ship Date** September 13, 2021 | |
**Bill of Lading Number:** 21099992723 | |
--- | |
## Ship From | |
**Example Pick Up Company** | |
123 Pick Up Street | |
Vancouver, BCV5K 0A4, Canada | |
**Location Type:** Business without Dock or Forklift | |
(123) 456-7890 (Example Pick Up Contact) | |
**Pickup Hours:** 9:00AM to 5:00PM | |
**SID#:** N/A | |
**Special Handling:** Protect From Freeze | |
--- | |
## Ship To | |
**Example Delivery Company** | |
123 Delivery Street | |
Toronto, ONM1B 0A1, Canada | |
**Location Type:** Business with Dock or Forklift | |
(123) 456-7890 (Example Consignee) | |
**Delivery Hours:** 8:00AM to 5:00PM | |
**CID#:** N/A | |
**Special Handling:** N/A | |
--- | |
## Third Party Freight Charges Bill To | |
**Freightera Logistics Inc.** | |
408 - 55 Water Street, Office 8036 | |
Vancouver, BC V6B 1A1 | |
800-866-4870 | |
--- | |
## Carrier Information | |
**Carrier Name:** EXAMPLE CARRIER | |
**Trailer Number:** | |
**Seal Number(s):** | |
**Pro Number:** N/A | |
**Quote#:** N/A | |
**Customer PO#:** N/A | |
**Freight Charges Term** (prepaid unless otherwise marked) | |
- [ ] Prepaid | |
- [ ] Collect | |
- [x] 3rd Party | |
- [ ] Master BOL with underlying BOLs | |
**Please note:** Email carrier invoices, BOL, POD, any accessorial document to <[email protected]>. Additional accessorials MUST be approved by Freightera Dispatch at email <[email protected]> or call (800) 886-4870 Ext. 2. | |
--- | |
## Special Instructions | |
- **Shipper:** Please call 15 minutes before pickup | |
--- | |
## Handling Unit | |
| Qty | Type | Wt | Hzmt | Non-Stackable? | Description | NMFC | Class | | |
|-----|-------|-----|------|----------------|-----------------|------|-------| | |
| 1 | Pallet| 300 lb | No | No (48x40x48 in) | Example Goods | | | | |
**TOTAL WEIGHT 300 Lbs** | |
--- | |
Where the rate is dependent on value, shippers are required to state specifically in writing the agreed or declared value on the property as follows: | |
"The agreed or declared value of property is specifically stated by shipper to be not exceeding _______ per _______." | |
Received, subject to individually determined rates or contracts that have been agreed upon in writing between carrier and shipper, if applicable, otherwise to the rates, classification and rules that have been established by the carrier and are available to the shipper, on request, and to all applicable state and federal regulations. | |
**Note:** Liability Limitation for loss or damage in this shipment may be applicable. See 49 U.S.C. -14706(c)(1)(A) and (B). | |
--- | |
This is to certify the above named materials are properly classified, packaged, marked, and labeled, and are in proper condition for transportation according to the applicable regulations of the DOT. | |
| Freight Loaded | Freight Counted | | |
|----------------|-----------------| | |
| By Shipper | By Shipper | | |
| By Driver | By Driver/pallets said to contain | | |
| | By Driver Pieces| | |
Carrier acknowledges receipt of packages and required placards. Carrier certifies emergency response information was made available and/or carrier has DOT emergency response guidebook or equivalent documentation in the vehicle. Property described above is in good order, except as noted. | |
**Shipper Signature** ______________________ **Date** ___________ | |
**Carrier Signature** ______________________ **Pickup Date** ___________ | |
================================================ | |
FILE: shared/outputs/0012.md | |
================================================ | |
# UniformSoftware | |
**transport business name** | |
P.O. Box xxx, City | |
State / Province | |
ABN xx xxx xxx xxx | |
Tel: xx xxxx xxxx | |
--- | |
## INVOICE | |
| Client | Name | | |
|--------|------| | |
| | Address | | |
| | City State ZIP | | |
| | Email | | |
--- | |
| Date | Cust Ref | From | To | Descrip | Cubic | Weight | Rate | Fuel Levy | Extras | Total | | |
|-----------|----------|-------------|------------|---------|-------|--------|-------|-----------|--------|-------| | |
| 1/1/2017 | ref#1 | MASCOT | SYDENHAM | 4PCS | 0.500 | 187 | $11.00| $0.88 | | $11.00| | |
| 2/2/2017 | ref#2 | ALEXANDRIA | WARRIEWOOD | 4PCS | 5.000 | 1086 | $12.00| $0.96 | | $12.00| | |
| 3/3/2017 | ref#3 | EAST BOTANY | WARRIEWOOD | 1PC | 0.600 | 117 | $13.00| $1.04 | | $13.00| | |
| 4/4/2017 | ref#4 | SYDENHAM | WARRIEWOOD | 4PCS | 0.700 | 1317 | $14.00| $1.12 | | $14.00| | |
| 5/5/2017 | ref#1 | PORT BOTANY | WARRIEWOOD | 3PCS | 0.500 | 102 | $15.00| $1.20 | | $15.00| | |
| 6/6/2017 | ref#2 | PORT BOTANY | EAST BOTANY| 7PCS | 0.300 | 102 | $16.00| $1.28 | | $16.00| | |
| 7/7/2017 | ref#3 | PORT BOTANY | WARRWOOD | 4PCS | 6.940 | 1659 | $17.00| $1.36 | | $17.00| | |
| 8/8/2017 | ref#4 | PORT BOTANY | SOMERSBY | 6PCS | 4.600 | 6459 | $18.00| $1.44 | | $18.00| | |
| 9/9/2017 | ref#1 | MASCOT | TEMPE | 15PCS | 1.700 | 821 | $19.00| $1.52 | | $19.00| | |
| 5/15/2017 | ref#2 | ALEXANDRIA | WARRIEWOOD | 44PCS | 1.480 | 374 | $20.00| $1.60 | | $20.00| | |
--- | |
**Please remit payment to:** | |
Bank: bank name | |
BSB: xxx-xxx | |
A/c No.: xxxxxxxxxx | |
Name: account name | |
--- | |
| SUBTOTAL | 155.00 | | |
|----------|--------| | |
| GST 10.00% | 15.50 | | |
| TOTAL | 182.90 | | |
================================================ | |
FILE: shared/outputs/0013.md | |
================================================ | |
## Executive Summary | |
### Key Observations | |
1. **Concentration risk in Microsoft Corp (MSFT) and Apple Inc (AAPL); combining for 10.6% of the equity allocation.** | |
2. **International equity allocation falls below Clark Capital’s target range.** | |
3. **Mid cap stocks overweight, leaving Small cap stocks underweight.** | |
4. **Overweight Healthcare, leaving Consumer Discretionary underweight relative to the benchmark weight.** | |
5. **Fixed Income has shorter duration than current Clark Capital positioning — Limiting income generation potential.** | |
6. **Fixed Income has a concentrated maturity schedule between 0-3 years.** | |
- **Portfolio Value: $545k** | |
- **Allocation: 82%/18% Stocks/Bonds** | |
- **Profile: Growth** | |
### Asset Allocation | |
- **Cash:** 5.75% | |
- **US Stocks:** 78.65% | |
- **Non-US Stocks:** 3.45% | |
- **Bonds:** 12.12% | |
- **Other/Not Clsfd:** 0.03% | |
### Portfolio Construction | |
- **Cash** | |
- **ETFs** | |
- **Mutual Funds** | |
- **Individual Stocks** | |
**Total Stock Holdings: 634** | |
**Total Bond Holdings: 9,748** | |
--- | |
*For one-on-one use with a client’s financial advisor only. Please see end disclosures for important information.* | |
Page 3 | |
## Your Equity Allocation – 82% | |
### Key Observations | |
1. **Individual Stocks 96%, ETFs 4%** | |
2. **Size:** Mid cap stocks overweight, leaving Small cap stocks underweight | |
3. **Sectors:** Overweight Healthcare, leaving Consumer Discretionary underweight relative to the benchmark weight. | |
4. **International:** 4% of equity – Lower than Clark Capital’s target range of 25%-35%. | |
5. **Direct and indirect stock holdings in the portfolio total over 600.** | |
### Diversification Analysis | |
#### Some Portfolio Overlap – Specific Concentration Risk in MSFT and AAPL | |
1. Owning multiple funds does not always produce the anticipated diversification benefits. Several securities (e.g. Microsoft, Apple, Meta Platforms) are held directly and by an additional fund. | |
2. Fund overlap exacerbates concentration concern within the portfolio, Microsoft Corp (MSFT) and Apple Inc (AAPL) combine for 10.6% of the equity allocation, creating excessive exposure to single stock fluctuations. | |
--- | |
**Equity Style** | |
| | Value | Blend | Growth | | |
|--------|-------|-------|--------| | |
| Large | 20 | 13 | 32 | | |
| Mid | 15 | 15 | 2 | | |
| Small | 2 | 1 | 0 | | |
--- | |
**Sectors:** | |
- **Cyclical** | |
- Basic Matls: 3.00% (Bmark 2.43%) | |
- Consumer Cycl: 5.10% (Bmark 11.01%) | |
- Financial Svs: 10.90% (Bmark 12.77%) | |
- Real Estate: 1.62% (Bmark 2.52%) | |
- **Sensitive** | |
- Commun Svs: 11.36% (Bmark 8.39%) | |
- Energy: 3.37% (Bmark 3.91%) | |
- Industrials: 9.95% (Bmark 8.72%) | |
- Technology: 27.90% (Bmark 28.95%) | |
- **Defensive** | |
- Consumer Def: 6.75% (Bmark 6.24%) | |
- Healthcare: 16.56% (Bmark 12.68%) | |
- Utilities: 3.49% (Bmark 2.38%) | |
- **Not Classified:** 0.00% (Bmark 0.00%) | |
**Geographic:** | |
- **Americas** | |
- Portfolio: 96.86% (Bmark 95.30%) | |
- North America: 96.55% (Bmark 95.31%) | |
- Latin America: 0.31% (Bmark 0.00%) | |
- **Greater Europe** | |
- Portfolio: 2.25% (Bmark 3.24%) | |
- United Kingdom: 0.15% (Bmark 0.65%) | |
- Europe-Developed: 1.96% (Bmark 2.56%) | |
- Europe-Emerging: 0.00% (Bmark 0.00%) | |
- Africa/Middle East: 0.14% (Bmark 0.03%) | |
- **Greater Asia** | |
- Portfolio: 0.89% (Bmark 1.45%) | |
- Japan: 0.23% (Bmark 0.94%) | |
- Australasia: 0.00% (Bmark 0.32%) | |
- Asia-Developed: 0.51% (Bmark 0.19%) | |
- Asia-Emerging: 0.15% (Bmark 0.00%) | |
- **Not Classified:** 0.00% (Bmark 0.00%) | |
--- | |
*Benchmark indicated is automatically customized by Morningstar based on the broad asset allocation of your portfolio. For benchmark detail, please see information in end disclosures.* | |
--- | |
For one-on-one use with a client’s financial advisor only. Please see end disclosures for important information. | |
# Tax Transition/Overlap Analysis | |
**Objective** | |
Distribute realized gains out over multiple calendar years | |
**Market Value: $138,494** | |
**Unrealized Gains: $19,943** | |
| Security Name | Ticker | Units | Cost | Value | Gain/Loss | 2024 | 2025 | 2026 | | |
|----------------------------------------|--------|-------|--------|--------|-----------|------|------|------| | |
| PFIZER INC | PFE | 23.00 | $1,114.91 | $662 | ($453) | ($453) | | | | |
| VERIZON COMMUNICATIONS INC | VZ | 24.00 | $1,371.14 | $905 | ($466) | ($466) | | | | |
| YUM CHINA HOLDINGS INC | YUMC | 16.00 | $956.83 | $679 | ($278) | ($278) | | | | |
| FOX CORP CL A | FOXA | 28.00 | $1,132.52 | $830 | ($302) | ($302) | | | | |
| ROBERT HALF INC | RHI | 9.00 | $654.92 | $391 | ($264) | ($264) | | | | |
| BIO RAD LABS INC CL A | BIO | 2.00 | $819.31 | $645 | ($174) | ($174) | | | | |
| MEDTRONIC PLC | MDT | 16.00 | $1,560.10 | $1,238 | ($323) | ($323) | | | | |
| MODERNA INC | MRNA | 13.00 | $1,570.51 | $1,293 | ($278) | ($278) | | | | |
| HF SINCLAIR CORP | DINO | 14.00 | $908.61 | $778 | ($131) | ($131) | | | | |
| RIO TINTO PLC SPONSORED ADR | RIO | 9.00 | $770.59 | $670 | ($100) | ($100) | | | | |
| ARCHER DANIELS MIDLAND COMPANY | ADM | 16.00 | $1,303.25 | $1,156 | ($146) | ($146) | | | | |
| ISHARES 5-10 YEAR IG CORP BOND ETF | IGIB | 118.00 | $6,905.54 | $6,136 | ($770) | ($770) | | | | |
| ISHARES 3-7YR TREASURY BOND ETF | IEI | 79.00 | $8,181.57 | $7,953 | ($228) | ($228) | | | | |
| NEXSTAR MEDIA GROUP INC | NXST | 7.00 | $1,198.95 | $1,097 | ($102) | ($102) | | | | |
| AFFILIATED MANAGERS GROUP INC | AMG | 4.00 | $843.50 | $805 | ($38) | ($38) | | | | |
| COGNIZANT TECHNOLOGY SOLUTIONS CORP CL A | CTSH | 17.00 | $1,341.30 | $1,284 | ($57) | ($57) | | | | |
| LABORATORY CORP OF AMER HOLDINGS NEW | LH | 6.00 | $1,415.64 | $1,364 | ($52) | ($52) | | | | |
| UNUM GROUP | UNM | 32.00 | $1,456.47 | $1,209 | ($247) | ($209) | | | | |
| PIMCO MORTGAGE OPPTY'S & BOND INSTL CL | PMZIX | 311.55 | $2,930.57 | $2,957 | $26 | $26 | | | | |
| NVIDIA CORP | NVDA | 5.00 | $2,477.52 | $2,576 | $98 | $98 | | | | |
| PIMCO ENHANCED SHORT MATURITY ACTIVE ETF | MINT | 117.00 | $11,631.26 | $11,674 | $44 | $44 | | | | |
| ARCH CAPITAL GROUP LTD | ACGL | 7.00 | $393.87 | $418 | $24 | $24 | | | | |
| CONSOLIDATED EDISON INC | ED | 17.00 | $52.45 | $76 | $24 | $24 | | | | |
| KINGSWAY FINL SUPERMATION HOLDINGS INC | KNDX | 1.00 | $92.15 | $100 | $8 | $8 | | | | |
| TAIWAN SEMICON MFG CO LTD SPON ADR | TSM | 20.00 | $2,213.93 | $2,287 | $74 | $74 | | | | |
| CISCO SYSTEMS INC | CSCO | 38.00 | $1,381.08 | $1,970 | $589 | $589 | | | | |
| ELECTRONIC ARTS INC | EA | 7.00 | $912.42 | $945 | $33 | $33 | | | | |
| DEVON ENERGY CORP NEW | DVN | 14.00 | $803.98 | $835 | $31 | $31 | | | | |
| MANULIFE FINANCIAL CORP | MFC | 81.00 | $1,561.98 | $1,592 | $30 | $30 | | | | |
| PROCTER & GAMBLE CO | PG | 9.00 | $1,284.62 | $1,484 | $200 | $200 | | | | |
| TEXAS INSTRUMENTS INC | TXN | 22.00 | $3,540.00 | $3,840 | $300 | $300 | | | | |
| GILEAD SCIENCES INC | GILD | 36.00 | $2,633.50 | $2,916 | $283 | $283 | | | | |
**Tax Transition** | |
This approach illustrates how Clark Capital would attempt to maximize the amount of assets immediately managed within a proposed investment, while spreading realized gains out over time. Gain estimates are relevant to incoming securities only and do not reflect gain/loss from regular trading of Clark Capital investments. | |
In the first year it is possible to target specific tickers for liquidation/incorporation, a percentage of the remaining unrealized gains, or a dollar value. Subsequent years will target identified tickers, unless otherwise indicated. The approach demonstrated here targets specific tickers for liquidation/incorporation into our investment models. | |
Positions that are held are monitored on an ongoing basis in partnership with Clark Capital and the financial advisor. If there is a desire to liquidate ahead of schedule, client direction would be required. | |
Gain/loss estimates are based on cost basis data provided to Clark Capital. Actual gains/loss at time of liquidation will vary. Upon arrival, an updated proposed tax transition plan will be prepared and discussed with the financial advisor and Clark Capital’s Tax Transition Specialist. **The final plan will likely vary from the illustration shown here.** | |
--- | |
**As of 1/31/2023** | |
For one-on-one use with a client’s financial advisor only. Please see end disclosures for important information. | |
================================================ | |
FILE: shared/outputs/0014.md | |
================================================ | |
# Portfolio Slicer - Dashboard | |
_Last quote update: 2013-11-30 05:20:51 ET_ | |
## Portfolio | |
- Hers | |
- Hers-Tax | |
- His-CDN | |
- His-Tax | |
- Joint | |
## Report Currency | |
- **Original** | |
- CAD | |
- USD | |
--- | |
### Total Value | |
**597,298** | |
- Profit / Loss: 75,651 (14.51%) | |
- Capital Gain: 66,087 (12.45%) | |
- Dividends: 9,564 (2.06%) | |
- Deposits / Withdrawals: 68,800 | |
- Exchange Rate Impact: 22,886 (3.98%) | |
- Mgmt Fee: 1,031 (0.17%) | |
### Allocation Target | |
| Allocation | Target | Actual | $ | | |
|-----------------|--------|--------|-----------| | |
| US Broad | 30% | 35% | 211,409 | | |
| CDN Broad | 32% | 32% | 191,530 | | |
| Real Estate | 10% | 9% | 53,902 | | |
| Emerging | 10% | 9% | 53,902 | | |
| Near Cash | 5% | 5% | 29,865 | | |
| Gold | 5% | 4% | 23,868 | | |
| Cash | 5% | 5% | 29,825 | | |
| **Grand Total** | **100%**| **100%**| **594,659**| | |
### Allocation | |
- US Broad: 35% | |
- CDN Broad: 32% | |
- Real Estate: 9% | |
- Emerging: 9% | |
- Near Cash: 5% | |
- Gold: 4% | |
- Cash: 5% | |
### Sectors | |
- Financial: 40% | |
- Industrials: 13% | |
- Technology: 9% | |
- Health Care: 7% | |
- Materials: 7% | |
- Consumer: 7% | |
- Real Estate: 6% | |
- Energy: 5% | |
- Other: 6% | |
### Sensitivity | |
- Cyclic: 21% | |
- Defensive: 15% | |
- Sensitive: 64% | |
### Currency | |
- CAD: 36% | |
- USD: 64% | |
--- | |
### Holdings % | |
- XFN.TO: 23% | |
- XIU.TO: 19% | |
- BND: 8% | |
- VWO: 7% | |
- VTI: 6% | |
- AMZN: 5% | |
- GLD: 5% | |
- UBA: 4% | |
- COST: 4% | |
- Other: 19% | |
### Top 10 Winners YTD | |
| Symbol | Profit/Loss | | |
|--------|-------------| | |
| XFN.TO | 26,337 | | |
| AMZN | 9,214 | | |
| VTI | 7,543 | | |
| PFE | 7,414 | | |
| MSFT | 5,143 | | |
| XIU.TO | 5,027 | | |
| C | 5,427 | | |
| AMD | 3,524 | | |
| DELL | 3,001 | | |
| COST | 2,791 | | |
### Top 10 Losers YTD | |
| Symbol | Profit/Loss | | |
|--------|-------------| | |
| GLD | -8,264 | | |
| BND | -2,000 | | |
| VNQ | -1,561 | | |
| GTY | -714 | | |
| HD | -471 | | |
| UBA | 314 | | |
| WMT | 699 | | |
| GE | 1,218 | | |
### Top 10 Dividends YTD | |
| Symbol | Dividends | | |
|--------|-----------| | |
| XFN.TO | 3,192 | | |
| XIU.TO | 2,400 | | |
| UBA | 1,240 | | |
| PFE | 1,206 | | |
| VWO | 918 | | |
| MSFT | 610 | | |
| VTI | 494 | | |
| TGT | 422 | | |
| COST | 121 | | |
--- | |
## Portfolio Overview | |
| Portfolio | Deposits | Book Value | Equity Value | Cash Value | Total Value | Realized Cap Gain | Unrealized Cap Gain | Cap Gain | Dividends | Profit | Cap Gain Last Day | Y/Y Mgmt Fee % | | |
|-----------|----------|------------|--------------|------------|-------------|-------------------|---------------------|----------|-----------|--------|------------------|----------------| | |
| His-CDN | 161,200 | 152,431 | 24,122 | 214,801 | 38,248 | 38,248 | 38,248 | 15,353 | 15,353 | 53,601 | 667 | 0.43% | | |
| Hers | 95,000 | 70,707 | 13,658 | 103,659 | 939 | 10,165 | 10,165 | 15,767 | 7,926 | 2,110 | 0 | 0.00% | | |
| Joint | 70,200 | 70,200 | 10,365 | 80,565 | 1,515 | 1,515 | 1,515 | 10,757 | 7,926 | 2,110 | 0 | 0.00% | | |
| Hers-Tax | 70,200 | 70,200 | 10,365 | 80,565 | 1,515 | 1,515 | 1,515 | 10,757 | 7,926 | 2,110 | 0 | 0.00% | | |
| His-Tax | 24,200 | 22,888 | 3,477 | 26,365 | 1,454 | 1,454 | 1,454 | 1,454 | 1,454 | 1,454 | 0 | 0.00% | | |
| **Grand Total** | **458,400** | **437,948** | **542,638** | **54,659** | **597,298** | **1,454** | **104,690** | **106,144** | **32,754** | **138,898** | **2,155** | **0.17%** | | |
================================================ | |
FILE: shared/outputs/0015.md | |
================================================ | |
| 01,10,042001700112,07007198,286.23,20180323,12636666022332927910,1992, | | |
| --- | | |
| 二维码信息:湖北增值税普通发票 | | |
| 通行费 | | |
| 机器编号:499099660821 | | |
| 名称:武汉市车城物流有限公司 | | |
| 纳税人识别号:914201007483062457 | | |
| 地址、电话:武汉经济技术开发区车城大道7号 84289348 | | |
| 开户行及账号:中国农业银行股份有限公司武汉开发区支行 17-071201040004598 | | |
| 密码区:030243319>1*+9*239+></<59+3-786-646/16<248>/-029029>746*7>44<97*929379677-955315>*+-6/53<13+8*010369194565>-5/04 | | |
| 项目名称:*经营租赁*通行费 | | |
| 车牌号:鄂AHG248 | | |
| 类型:货车 | | |
| 通行日期起:20180212 | | |
| 通行日期止:20180212 | | |
| 金额:286.23 | | |
| 税率:3% | | |
| 税额:8.59 | | |
| 合计:¥286.23 | | |
| 价税合计(大写):贰佰玖拾肆元捌角贰分 | | |
| (小写):¥294.82 | | |
| 销售方名称:湖北随岳南高速公路有限公司 | | |
| 纳税人识别号:91420000753416406R | | |
| 地址、电话:武汉开发区C7C1地块东合中心B栋1601号 027-83458755 | | |
| 开户行及账号:民生银行武汉光谷口支行0514014170001889 | | |
| 备注: | | |
| 收款人:龙梦媛 | | |
| 复核:陈煜 | | |
| 开票人:尹晨 | | |
| 销售方(章):湖北随岳南高速公路有限公司 91420000753416406R 发票专用章 | | |
| 发票代码:042001700112 | | |
| 发票号码:07007198 | | |
| 开票日期:2018年03月23日 | | |
| 校验码:12636 66602 23329 27910 | | |
================================================ | |
FILE: shared/outputs/0016.md | |
================================================ | |
| Valori Nutrizionali/Nutrition Facts/ | per/per/pro/por | | |
|-----------------------------------|-----------------| | |
| Energia/Energy/Energie/Valor energético | Kj 2577/Kcal 616 | | |
| Grassi/Fat/Fett/Grasas | 49.9 g | | |
| di cui acidi grassi saturi/of which saturates/davon gesättigte Fettsäuren/de las cuales saturadas | 8.3 g | | |
| Carboidrati/Carbohydrate/Kohlenhydrate/Hidratos de carbono | 12.0 g | | |
| di cui zuccheri/of which sugars/davon Zucker/de los cuales azúcar | 5.1 g | | |
| Fibre/Fibre/Ballaststoffe/Fibra alimentaria | 8.3 g | | |
| Proteine/Protein/Eiweiß/Proteínas | 24.8 g | | |
| Sale/Salt/Salz/Sal | 0.0 g | | |
**IT Ingredienti:** 100% Arachidi | |
**EN Ingredients:** 100% Peanuts | |
**DE Zutaten:** 100% Erdnüsse | |
**ES Ingredientes:** 100% Cacahuetes | |
**220 g** | |
--- | |
**100% PEANUT** | |
**PEANUT BUTTER** | |
**ORIGIN:** Argentina | |
Può contenere tracce di altra frutta a guscio, soia, latte e sesamo/May contain traces of other nuts, soya, milk and sesame/Kann Spuren von anderen Nüssen, Soja, Milch und Sesam enthalten/Puede contener trazas de otros frutos secos, soja, leche y sésamo | |
Conservare in luogo fresco e asciutto/Store in a cool and dry place/Kühl und trocken lagern/Conservar en un lugar fresco y seco | |
Prodotto e confezionato per Bowlpros Srl v.le E. Caldara 24. 20122 Milano (MI) Italia nello stabilimento di via Ferrovia 110. | |
80040 San Gennaro Vesuviano (NA) | |
<[email protected]> | |
<www.bowlpros.com> | |
Da consumarsi preferibilmente entro il/Best before/Mindestens haltbar bis/Consumir preferentemente antes del | |
 | |
受渡期間: | |
- 受注後: 60日間 | |
- 見積有効期間: 90日間 | |
代金: | |
- 消費税(別途) | |
- 配線工事費(別途) | |
- 調整費(含む) | |
- 取付工事費(含む) | |
申受条件: | |
- 荷造運賃(含む) | |
〒812-0026 | |
福岡県福岡市博多区上川端町8-18 | |
TEL: 092-281-0020 FAX: 092-281-0112 | |
シチズンTIC株式会社 | |
--- | |
御見積金額 ¥712,000 - | |
| 品目コード | 製品名 | 数量 | 単位 | 単価 | 金額 | | |
|-------------|--------|------|------|------|------| | |
| KM-82TC-4P | 親時計4回線壁掛型 タイム・チャイム | 1 | 台 | 712,000 | 712,000 | | |
※設置工事費含む。 | |
※キャンペーン期間中の為設置工事費無料です。 | |
--- | |
| 標準価格計 | 712,000 | | |
|-------------|---------| | |
| 割引合計額 | | | |
| 総合計 | 712,000 | | |
--- | |
※受注製作品・特注品が含まれている場合、御発注後にキャンセル又は仕様変更が発生した場合別途費用を御請求させていただきます。御了承ください。 | |
受注製作品・特注品の内容に関しては営業担当者へ確認ください。 | |
1/1 | |
================================================ | |
FILE: shared/outputs/0018.md | |
================================================ | |
# UNITED STATES | |
# SECURITIES AND EXCHANGE COMMISSION | |
Washington, D.C. 20549 | |
# FORM 10-Q | |
(Mark One) | |
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 | |
For the quarterly period ended September 30, 2024 | |
OR | |
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 | |
For the transition period from ______ to ______ | |
Commission File Number: 001-34756 | |
# Tesla, Inc | |
(Exact name of registrant as specified in its charter) | |
| Texas | 91-2197729 | | |
| (State or other jurisdiction of incorporation or organization) | (I.R.S. Employer Identification No.) | | |
1 Tesla Road | |
Austin, Texas | |
(Address of principal executive offices) | |
78725 | |
(Zip Code) | |
(512) 516-8177 | |
(Registrant’s telephone number, including area code) | |
Securities registered pursuant to Section 12(b) of the Act: | |
| Title of each class | Trading Symbol(s) | Name of each exchange on which registered | | |
|---------------------|-------------------|------------------------------------------| | |
| Common stock | TSLA | The Nasdaq Global Select Market | | |
Indicate by check mark whether the registrant (1) has | |
[Table of Contents](#) | |
**PART I. FINANCIAL INFORMATION** | |
**ITEM 1. FINANCIAL STATEMENTS** | |
**Tesla, Inc.** | |
**Consolidated Balance Sheets** | |
*(in millions, except per share data)* | |
*(unaudited)* | |
| | September 30, 2024 | December 31, 2023 | | |
|---|---|---| | |
| **Assets** | | | | |
| **Current assets** | | | | |
| Cash and cash equivalents | $18,111 | $16,398 | | |
| Short-term investments | 15,537 | 12,696 | | |
| Accounts receivable, net | 3,313 | 3,508 | | |
| Inventory | 14,530 | 13,626 | | |
| Prepaid expenses and other current assets | 4,888 | 3,388 | | |
| **Total current assets** | 56,379 | 49,616 | | |
| Operating lease vehicles, net | 5,380 | 5,989 | | |
| Solar energy systems, net | 5,040 | 5,229 | | |
| Property, plant and equipment, net | 36,116 | 29,725 | | |
| Operating lease right-of-use assets | 4,867 | 4,180 | | |
| Digital assets, net | 184 | 184 | | |
| Intangible assets, net | 158 | 178 | | |
| Goodwill | 253 | 253 | | |
| Deferred tax assets | 6,486 | 6,733 | | |
| Other non-current assets | 4,989 | 4,531 | | |
| **Total assets** | $119,852 | $106,618 | | |
| **Liabilities** | | | | |
| **Current liabilities** | | | | |
| Accounts payable | $14,654 | $14,431 | | |
| Accrued liabilities and other | 10,601 | 9,080 | | |
| Deferred revenue | 3,031 | 2,864 | | |
| Current portion of debt and finance leases | 2,291 | 2,373 | | |
| **Total current liabilities** | 30,577 | 28,748 | | |
| Debt and finance leases, net of current portion | 5,405 | 2,857 | | |
| Deferred revenue, net of current portion | 3,350 | 3,251 | | |
| Other long-term liabilities | 9,810 | 8,153 | | |
| **Total liabilities** | 49,142 | 43,009 | | |
| Commitments and contingencies (Note 10) | 70 | 242 | | |
| Redeemable noncontrolling interests in subsidiaries | | | | |
| **Equity** | | | | |
| Stockholders’ equity | | | | |
| Preferred stock; $0.001 par value; 100 shares authorized; no shares issued and outstanding | — | — | | |
| Common stock; $0.001 par value; 6,000 shares authorized; 3,207 and 3,185 shares issued and outstanding as of September 30, 2024 and December 31, 2023, respectively | 3 | 3 | | |
| Additional paid-in capital | 37,286 | 34,892 | | |
| Accumulated other comprehensive loss | (14) | (143) | | |
| Retained earnings | 32,656 | 27,882 | | |
| **Total stockholders’ equity** | 69,931 | 62,634 | | |
| Noncontrolling interests in subsidiaries | 709 | 733 | | |
| **Total liabilities and equity** | $119,852 | $106,618 | | |
The accompanying notes are an integral part of these consolidated financial statements. | |
4 | |
# Tesla, Inc | |
## Consolidated Statements of Operations | |
### (in millions, except per share data) | |
### (unaudited) | |
| | Three Months Ended September 30, | Nine Months Ended September 30, | | |
|--------------------------------|----------------------------------|---------------------------------| | |
| | 2024 | 2023 | 2024 | 2023 | | |
| **Revenues** | | | | | | |
| Automotive sales | $18,831 | $18,582 | $53,821 | $57,879 | | |
| Automotive regulatory credits | 739 | 554 | 2,071 | 1,357 | | |
| Automotive leasing | 446 | 489 | 1,380 | 1,620 | | |
| **Total automotive revenues** | 20,016 | 19,625 | 57,272 | 60,856 | | |
| Energy generation and storage | 2,376 | 1,559 | 7,025 | 4,597 | | |
| Services and other | 2,790 | 2,166 | 7,686 | 6,153 | | |
| **Total revenues** | 25,182 | 23,350 | 71,983 | 71,606 | | |
| **Cost of revenues** | | | | | | |
| Automotive sales | 15,743 | 15,656 | 45,602 | 47,919 | | |
| Automotive leasing | 247 | 301 | 761 | 972 | | |
| **Total automotive cost of revenues** | 15,990 | 15,957 | 46,363 | 48,891 | | |
| Energy generation and storage | 1,651 | 1,178 | 5,157 | 3,770 | | |
| Services and other | 2,544 | 2,037 | 7,192 | 5,723 | | |
| **Total cost of revenues** | 20,185 | 19,172 | 58,712 | 58,384 | | |
| **Gross profit** | 4,997 | 4,178 | 13,271 | 13,222 | | |
| **Operating expenses** | | | | | | |
| Research and development | 1,039 | 1,161 | 3,264 | 2,875 | | |
| Selling, general and administrative | 1,186 | 1,253 | 3,837 | 3,520 | | |
| Restructuring and other | 55 | - | 677 | - | | |
| **Total operating expenses** | 2,280 | 2,414 | 7,778 | 6,395 | | |
| **Income from operations** | 2,717 | 1,764 | 5,493 | 6,827 | | |
| Interest income | 429 | 282 | 1,127 | 733 | | |
| Interest expense | (92) | (38) | (254) | (95) | | |
| Other (expense) income, net | (270) | 37 | (142) | 317 | | |
| **Income before income taxes** | 2,784 | 2,045 | 6,224 | 7,782 | | |
| Provision for income taxes | 601 | 167 | 1,403 | 751 | | |
| **Net income** | 2,183 | 1,878 | 4,821 | 7,031 | | |
| Net income (loss) attributable to noncontrolling interests and redeemable noncontrolling interests in subsidiaries | 16 | 25 | 47 | (38) | | |
| **Net income attributable to common stockholders** | $2,167 | $1,853 | $4,774 | $7,069 | | |
| **Net income per share of common stock attributable to common stockholders** | | | | | | |
| Basic | $0.68 | $0.58 | $1.51 | $2.23 | | |
| Diluted | $0.62 | $0.53 | $1.38 | $2.03 | | |
| Weighted average shares used in computing net income per share of common stock | | | | | | |
| Basic | 3,198 | 3,176 | 3,192 | 3,171 | | |
| Diluted | 3,497 | 3,493 | 3,489 | 3,481 | | |
The accompanying notes are an integral part of these consolidated financial statements. | |
5 | |
# Tesla, Inc | |
## Consolidated Statements of Comprehensive Income | |
### (in millions) | |
### (unaudited) | |
| | Three Months Ended September 30, | Nine Months Ended September 30, | | |
|--------------------------------|----------------------------------|---------------------------------| | |
| | 2024 | 2023 | 2024 | 2023 | | |
| **Net income** | $ 2,183 | $ 1,878 | $ 4,821 | $ 7,031 | | |
| **Other comprehensive income (loss):** | | | | | | |
| Foreign currency translation adjustment | 445 | (289) | 121 | (343) | | |
| Unrealized net gain on investments, net of tax | 8 | 7 | 8 | 8 | | |
| Net loss realized and included in net income | — | — | — | 4 | | |
| **Comprehensive income** | 2,636 | 1,596 | 4,950 | 6,700 | | |
| **Less: Comprehensive income (loss) attributable to noncontrolling interests and redeemable noncontrolling interests in subsidiaries** | 16 | 25 | 47 | (38) | | |
| **Comprehensive income attributable to common stockholders** | $ 2,620 | $ 1,571 | $ 4,903 | $ 6,738 | | |
The accompanying notes are an integral part of these consolidated financial statements. | |
--- | |
6 | |
Table of Contents | |
--- | |
Tesla, Inc. | |
Consolidated Statements of Cash Flows | |
(in millions) | |
(unaudited) | |
| | Nine Months Ended September 30, | | |
|---|---|---| | |
| | 2024 | 2023 | | |
**Cash Flows from Operating Activities** | |
Net income | $ 4,821 | $ 7,031 | |
Adjustments to reconcile net income to net cash provided by operating activities: | |
- Depreciation, amortization and impairment | 3,872 | 3,435 | |
- Stock-based compensation | 1,420 | 1,328 | |
- Inventory and purchase commitments write-downs | 247 | 361 | |
- Foreign currency transaction net unrealized loss (gain) | 197 | (317) | |
- Deferred income taxes | 418 | (316) | |
- Non-cash interest and other operating activities | 83 | 94 | |
Changes in operating assets and liabilities: | |
- Accounts receivable | 144 | 377 | |
- Inventory | (1,107) | (1,953) | |
- Operating lease vehicles | (82) | (1,858) | |
- Prepaid expenses and other assets | (2,639) | (1,992) | |
- Accounts payable, accrued and other liabilities | 2,504 | 1,922 | |
- Deferred revenue | 231 | 774 | |
Net cash provided by operating activities | 10,109 | 8,886 | |
**Cash Flows from Investing Activities** | |
Purchases of property and equipment excluding finance leases, net of sales | (8,556) | (6,592) | |
Purchases of solar energy systems, net of sales | (6) | — | |
Purchases of investments | (20,797) | (13,221) | |
Proceeds from maturities of investments | 17,975 | 8,959 | |
Proceeds from sales of investments | 200 | 138 | |
Business combinations, net of cash acquired | — | (64) | |
Net cash used in investing activities | (11,184) | (10,780) | |
**Cash Flows from Financing Activities** | |
Proceeds from issuances of debt | 4,360 | 2,526 | |
Repayments of debt | (1,783) | (887) | |
Proceeds from exercises of stock options and other stock issuances | 788 | 548 | |
Principal payments on finance leases | (291) | (340) | |
Debt issuance costs | (6) | (23) | |
Distributions paid to noncontrolling interests in subsidiaries | (76) | (105) | |
Payments for buy-outs of noncontrolling interests in subsidiaries | (124) | (17) | |
Net cash provided by financing activities | 2,868 | 1,702 | |
Effect of exchange rate changes on cash and cash equivalents and restricted cash | (8) | (142) | |
Net increase (decrease) in cash and cash equivalents and restricted cash | 1,785 | (334) | |
Cash and cash equivalents and restricted cash, beginning of period | 17,189 | 16,924 | |
Cash and cash equivalents and restricted cash, end of period | $ 18,974 | $ 16,590 | |
**Supplemental Non-Cash Investing and Financing Activities** | |
Acquisitions of property and equipment included in liabilities | $ 2,727 | $ 1,717 | |
Leased assets obtained in exchange for finance lease liabilities | $ 32 | $ 1 | |
Leased assets obtained in exchange for operating lease liabilities | $ 1,232 | $ 1,548 | |
The accompanying notes are an integral part of these consolidated financial statements. | |
--- | |
9 | |
================================================ | |
FILE: shared/outputs/0019.md | |
================================================ | |
[Walmart] | |
**See back of receipt for your chance to win $1000 ID#: TN5NV1VXCQDQ** | |
317-851-1102 Mrg: JAMIE BROOKSHIRE | |
882 S. STATE ROAD 135 | |
GREENWOOD, IN 46143 | |
| Item Code | Description | Qty | Price | | |
|---------------|---------------------------|-----|-------| | |
| 05483 | TATER TOTS | | 2.96 | | |
| 0071436 | OPI | | 1.88 | | |
| 001320020062 | F | | 5.88 | | |
| 003120065 | SNACK BARS | | 5.88 | | |
| 003120248164 | HRI CL CHS | | 5.88 | | |
| 003120065000 | HRI CL CHS | | 5.88 | | |
| 001201254 | VOIDED ENTRY | | | | |
| 003120053000 | HRI 12 U SG | | 5.88 | | |
| 0074316 | PEANUT BUTTER | | 3.18 | | |
| 001376420528 | ACCESSORY | | 2.96 | | |
| 0000000000 | BTS DRY BLON | | 3.28 | | |
| 002246021090 | TR HS FRM 4 | | 32.00 | | |
| 003178201896 | GV SLIDERS | | 2.74 | | |
| 003178201526 | BAGELS | | 2.50 | | |
| 003178201286 | CHEEZE IT | | 4.00 | | |
| 003201563 | RITZ WLS 4.5 | | 2.78 | | |
| 004800078 | RUFFLES | | 2.78 | | |
| 004800092914 | GV HNY GRMS | | 2.50 | | |
--- | |
**SUBTOTAL**: 139.24 | |
**TAX 1**: 7.00% | |
**TOTAL**: 141.02 | |
**CASH TEND**: 150.00 | |
**CHANGE DUE**: 6.00 | |
**# ITEMS SOLD**: 26 | |
**TC#**: 0783 6080 4072 3416 2495 6 | |
**Date**: 04/27/19 | |
**Time**: 12:59:46 | |
--- | |
Scan with Walmart app to save receipts | |
================================================ | |
FILE: shared/outputs/0020.md | |
================================================ | |
# ZESTADO EXPRESS | |
**ABN:** 16 112 221 123 | |
Mercure Tower, Floors 1 – 10 | |
10 Queen Street, King George Square | |
Brisbane City QLD 4000 | |
Australia | |
**Bill To:** | |
Custom Board Makers | |
Administration Centre | |
12 Salvage Road | |
Acacia Ridge BC QLD 4110 | |
Australia | |
**SHIPPING INVOICE 10112** | |
Issue Date: 8th December 2021 | |
Account No: 101234 | |
Invoice Amount: $5,270.00 AUD | |
Please Pay By: 15th December 2021 | |
**Page: 1 of 2** | |
--- | |
## Waybill No: 012345A | |
### Main Shipping Information | |
**Sender** | |
**Customer Reference:** GC12345 | |
Booking Contact: Zac Rider | |
Booking Phone: 61 7 4321 1234 | |
Email: <[email protected]> | |
Type of Goods: Surfboards | |
Total Pieces: 360 | |
Gross Weight: 1,010 kg | |
Place of Discharge: Port of Brisbane | |
Shipped Date: 12-Nov-2021 | |
Place of Delivery: Kahului Maui Hawaii Port | |
Delivered Date: 23-Nov-2021 | |
**Recipient** | |
Maui Surf Shop | |
12 Haleakala Drive | |
West Maui | |
Lahaina HI 96761 | |
United States | |
**Description of Shipped Items** | |
| Description | Qty | Dimensions | | |
|-------------|-----|------------| | |
| 4321-A1 XL Custom Supreme Light Stand Up | 150 | 13' x 18" x 3' | | |
| 4421-B1 Custom Deluxe Longboards | 120 | 10' x 20" x 3' | | |
| 4231-C1 Eco High Performance Mini Mals | 90 | 7' x 18" x 2' | | |
**Description of Charges** | |
| Description | Amount | | |
|-------------|--------| | |
| Ocean Freight Charge | $1,500.00 | | |
| Insurance Cover | $250.00 | | |
| Terminal Handling | $55.00 | | |
**Charges** | |
| Description | Amount | | |
|-------------|--------| | |
| Customs Tax | $100.00 | | |
| Customs Duties | $75.00 | | |
**TOTAL** | |
$1,980.00 | |
--- | |
## Waybill No: 012346B | |
### Main Shipping Information | |
**Sender** | |
**Customer Reference:** SC12367 | |
Booking Contact: Reece Gnarly | |
Booking Phone: 61 7 4331 1234 | |
Email: <[email protected]> | |
Type of Goods: Surfboards | |
Total Pieces: 225 | |
Gross Weight: 850 kg | |
Place of Discharge: Port of Brisbane | |
Shipped Date: 12-Nov-2021 | |
Place of Delivery: Port of Long Beach | |
Delivered Date: 5-Dec-2021 | |
**Recipient** | |
Long Beach Surf Shop | |
150 Foam Avenue | |
Shark Beach CA 90760 | |
United States | |
**Description of Shipped Items** | |
| Description | Qty | Dimensions | | |
|-------------|-----|------------| | |
| 4321-A1 XL Custom Supreme Light Stand Up | 150 | 13' x 18" x 3' | | |
| 4123-D1 Finnastic Funboards | 75 | 7' x 18" x 2' | | |
**Description of Charges** | |
| Description | Amount | | |
|-------------|--------| | |
| Ocean Freight Charge | $980.00 | | |
| Terminal Handling | $55.00 | | |
**Charges** | |
| Description | Amount | | |
|-------------|--------| | |
| Customs Tax | $100.00 | | |
| Customs Duties | $75.00 | | |
**TOTAL** | |
$1,210.00 | |
--- | |
Please login to your [shipping portal](#) to pay for the invoice. | |
**Thank you for your business!** | |
================================================ | |
FILE: shared/outputs/0021.md | |
================================================ | |
**PRAWO JAZDY** | |
**RZECZPOSPOLITA POLSKA** | |
1\. BRANDT<br> | |
2\. MARTIN<br> | |
3\. 27.06.1988 CRIVITZ<br> | |
4a\. 24.04.2019<br> | |
4b\. 10.09.2033<br> | |
4c\. STAROSTA POLICKI<br> | |
4d\. 880627<br> | |
5\. 00359/19/3211<br> | |
7\. | |
8806670172 | |
9\. AM/B1/B | |
PL | |
POLSKA | |
================================================ | |
FILE: shared/outputs/0022.md | |
================================================ | |
**Ohio** | |
**DRIVER LICENSE** | |
**DL** | |
**Class D** | |
TED STRICKLAND, GOVERNOR | |
Mike Rankin, Registrar BMV | |
9900TL5467900302 | |
USA | |
1\. PUBLIC<br> | |
2\. JANE Q<br> | |
5\. 1970 W BROAD ST<br> | |
COLUMBUS, OH 43223<br> | |
4d\. LICENSE NO.<br> | |
TL545786<br> | |
3\. BIRTHDATE<br> | |
07-09-1962<br> | |
4a\. ISSUE DATE<br> | |
04-01-2009<br> | |
9\. CLASS<br> | |
D<br> | |
4b\. EXPIRES<br> | |
07-09-2012<br> | |
9a\. ENDORS<br> | |
12\. RESTR<br> | |
A<br> | |
07-09-1962 | |
**Signature** | |
15\. Sex: F<br> | |
16\. Ht: 5-08<br> | |
17\. Wt: 130<br> | |
18\. Eyes: BRO<br> | |
19\. Hair: BRO<br> | |
**ORGAN DONOR OHIO** | |
HEALTHCARE<br> | |
POWER OF ATTY<br> | |
LIFE SUSTAINING<br> | |
EQUIPMENT<br> | |
================================================ | |
FILE: shared/outputs/0023.md | |
================================================ | |
# DRIVER LICENSE | |
## Tennessee | |
### THE VOLUNTEER STATE | |
USA | |
TN | |
**DL NO.** 123456789 | |
**EXP** 02/11/2026 | |
**DOB** 02/11/1974 | |
**ISS** 02/11/2019 | |
**CLASS** D | |
**END** NONE | |
**REST** 01 | |
**SEX** F | |
**HGT** 5'-05" | |
**EYES** BLU | |
**DD** 1234567890123456 | |
**SAMPLE** | |
**JANICE** | |
123 MAIN STREET | |
APT. 1 | |
NASHVILLE, TN 37210 | |
**Janice Sample** | |
**DL** | |
================================================ | |
FILE: shared/outputs/0024.md | |
================================================ | |
# CALIFORNIA | |
## EXPIRES ON BIRTHDAY | |
**1986** | |
## DRIVER LICENSE | |
- N8685798 | |
- Michael Joe Jackson | |
- 4641 Hayvenhurst | |
- Los Angeles Ca 91316 | |
| SEX | HAIR | EYES | HEIGHT | WEIGHT | DATE OF BIRTH | | |
| --- | ---- | ---- | ------ | ------ | ------------- | | |
| M | Blk | Brn | 5-9 | 120 | 8-29-58 | | |
**PRE LIC EXP** 85 | |
**OTHER ADDRESS** CLASS 3 | |
**MUST WEAR CORRECTIVE LENSES** □ | |
**SEE OVER FOR ANY OTHER CONDITIONS** | |
**SECTION 12804 VEHICLE CODE** | |
**X** Michael Joe Jackson | |
4-28-83 | clckjw | DMV | |
**DO NOT LAMINATE** | |
**AHIJ** | |
================================================ | |
FILE: shared/outputs/0025.md | |
================================================ | |
# NEW YORK STATE USA | |
## LEARNER PERMIT | |
### UNDER 21 | |
**Mark J.F. Schroeder** | |
Commissioner of Motor Vehicles | |
**ID** 987 654 321 | |
**Class** DJ | |
**Sex** F | |
**Eyes** BLU | |
**Height** 5'-08" | |
**DOB** 10/31/2003 | |
**Issued** 03/07/2022 | |
**Expires** 10/31/2026 | |
**E** NONE | |
**R** NONE | |
**Michelle M. Motorist** | |
**MOTORIST** | |
MICHELLE, MARIE | |
2345 ANYWHERE STREET | |
ALBANY, NY 12222 | |
**U18 UNTIL** OCT 21 | |
**U21 UNTIL** OCT 24 | |
**Organ Donor** | |
**OCT 31 03 123456789** | |
================================================ | |
FILE: shared/outputs/0026.md | |
================================================ | |
# California USA | DRIVER LICENSE | |
**DL** 11234568 | |
**EXP** 08/31/2014 | |
**LN** CARDHOLDER | |
**FN** IMA | |
2570 24TH STREET | |
ANYTOWN, CA 95818 | |
**DOB** 08/31/1977 | |
**RSTR** NONE | |
**CLASS** C | |
**END** NONE | |
**DONOR** | |
**VETERAN** | |
**SEX** F | |
**HGT** 5'-05" | |
**HAIR** BRN | |
**WGT** 125 lb | |
**EYES** BRN | |
**DD** 00/00/0000NNNAN/ANFD/YY | |
**ISS** 08/31/2009 | |
**Signature:** Ima Cardholder | |
**0831977** | |
================================================ | |
FILE: shared/outputs/0027.md | |
================================================ | |
# Pennsylvania | IDENTIFICATION CARD | |
visitPA.com | USA | |
**NOT FOR REAL ID PURPOSES** | |
**4d IDN:** 99 999 999 | |
**DUPS:** 00 | |
**3 DOB:** 01/07/1973 | |
**1** SAMPLE | |
**2** ANDREW JASON | |
**8** 123 MAIN STREET | |
APT. 1 | |
HARRISBURG, PA 17101-0000 | |
**4b EXP:** 01/31/2026 | |
**4a ISS:** 01/07/2022 | |
**15 SEX:** M | |
**18 EYES:** BRO | |
**16 HGT:** 5'-11" | |
**5 DD:** 1234567890123 | |
456789012345 | |
**ID** | |
**❤️ ORGAN DONOR** | |
**SAMPLE** | |
**Andrew Sample** | |
================================================ | |
FILE: shared/outputs/0028.md | |
================================================ | |
# CALIFORNIA LICENSE | |
**EXPIRES ON BIRTHDAY** | |
**1970** | |
ISSUED IN ACCORDANCE WITH THE CALIFORNIA VEHICLE CODE | |
**Ronald J. Thomas** | |
DRIVERS LICENSE ADMINISTRATOR | |
**DRIVER** | |
David Franklin Thomas | |
5798 Olive St | |
Paradise, Calif 95969 | |
**W106438** | |
| SEX | COLOR HAIR | COLOR EYES | HEIGHT | WEIGHT | MARRIED | | |
| --- | ---------- | ---------- | ------ | ------ | ------- | | |
| M | Gry | Blu | 6-0 | 205 | Yes | | |
| DATE OF BIRTH | AGE | PREVIOUS LICENSE | | |
| ------------- | --- | ---------------- | | |
| Aug 20, 1892 | 72 | Calif | | |
MUST WEAR | |
Corrective Lenses ☑ | |
SEE OVER FOR ANY | |
OTHER CONDITIONS □ | |
OTHER | |
ADDRESS | |
X D. F. Thomas | |
CLASS 3. MAY DRIVE 2 AXLE VEHICLE, EXCEPT BUS DESIGNED FOR MORE THAN 15 PASSENGERS. MAY TOW VEHICLE LESS THAN 6,000 LBS. GROSS. | |
**Office** Paradise | |
**Date** 8-4-65 | |
**MUST BE CARRIED WHEN OPERATING A MOTOR VEHICLE AND WHEN APPLYING FOR RENEWAL** | |
================================================ | |
FILE: shared/outputs/0029.md | |
================================================ | |
**CALIFORNIA | DRIVER LICENSE** | |
MUST BE CARRIED WHEN OPERATING A MOTOR VEHICLE AND WHEN APPLYING FOR RENEWAL | |
**EXPIRES ON BIRTHDAY** | |
**1984** | |
- W0209369 | |
- James Scott Garner | |
- 35 Oakmont Dr | |
- Los Angeles CA 90049 | |
| SEX | HAIR | EYES | HEIGHT | WEIGHT | PRE LIC EXP | | |
| --- | ---- | ---- | ------ | ------ | ----------- | | |
| M | Blk | Brn | 6-3 | 210 | 80 | | |
**DATE OF BIRTH** 4-7-23 | |
**MUST WEAR CORRECTIVE LENSES** □ | |
**SEE OVER FOR ANY OTHER CONDITIONS** | |
**OTHER ADDRESS** CLASS 3 | |
**SECTION 12804 VEHICLE CODE** | |
**X** James S. Garner | |
3-25-80 | Gln rc | |
**DO NOT LAMINATE** | |
================================================ | |
FILE: shared/outputs/0030.md | |
================================================ | |
# CALIFORNIA | DRIVER LICENSE | |
**MUST BE CARRIED WHEN OPERATING A MOTOR VEHICLE AND WHEN APPLYING FOR RENEWAL** | |
**EXPIRES ON BIRTHDAY** | |
- N2287802 | |
- Kenneth Wayne Shanaberger | |
- 1541 Beloit Ave #208 | |
- Los Angeles CA 90025 | |
| SEX | HAIR | EYES | HEIGHT | WEIGHT | PRE LIC EXP | | |
| --- | ---- | ---- | ------ | ------ | ----------- | | |
| M | Brn | Brn | 5-6 | 130 | 82 | | |
**DATE OF BIRTH** | |
**MUST WEAR CORRECTIVE LENSES** □ | |
**SEE OVER FOR ANY OTHER CONDITIONS** | |
**OTHER ADDRESS** CLASS 3 | |
**SECTION 12804 VEHICLE CODE** | |
**X** | |
08-21-80 | Tor mw | |
**DO NOT LAMINATE** | |
================================================ | |
FILE: shared/outputs/0031.md | |
================================================ | |
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR** | |
--- | |
**PASSPORT** | |
**PASSEPORT** | |
**PASAPORTE** | |
**UNITED STATES OF AMERICA** | |
**Type / Type / Tipo** P | |
**Code / Code / Codigo** USA | |
**Passport No. / No. du Passeport / No. de Pasaporte** 546844936 | |
**Surname / Nom / Apellidos** ABRENICA | |
**Given Names / Prénoms / Nombres** JARED MICHAEL | |
**Nationality / Nationalité / Nacionalidad** UNITED STATES OF AMERICA | |
**Date of birth / Date de naissance / Fecha de nacimiento** 10 Feb 2001 | |
**Place of birth / Lieu de naissance / Lugar de nacimiento** NEW YORK, U.S.A. | |
**Sex / Sexe / Sexo** M | |
**Date of issue / Date de délivrance / Fecha de expedición** 06 Jun 2016 | |
**Date of expiration / Date d'expiration / Fecha de expiración** 05 Jun 2021 | |
**Authority / Autorité / Autoridad** United States Department of State | |
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 27 | |
**USA** | |
P<USAABRENICA<<JARED<MICHAEL<<<<<<<<<<<<<<<< | |
5468449363USA0102100M2106054275193173<681306 | |
================================================ | |
FILE: shared/outputs/0032.md | |
================================================ | |
**ENDORSEMENTS AND LIMITATIONS** | |
This passport is valid for all countries unless otherwise specified. The bearer must comply with any visa or other entry regulations of the countries to be visited. | |
SEE OBSERVATIONS BEGINNING ON PAGE 5 (IF APPLICABLE) | |
**MENTIONS ET RESTRICTIONS** | |
Ce passeport est valable pour tous les pays, sauf indication contraire. Le titulaire doit se conformer aux formalités relatives aux visas ou aux autres formalités d'entrée des pays où il a l'intention de se rendre. | |
VOIR LES OBSERVATIONS DÉBUTANT À LA PAGE 5 (LE CAS ÉCHÉANT) | |
**Signature of bearer - Signature du titulaire** | |
GK141569 | |
--- | |
**CANADA** | |
**PASSPORT** | |
**PASSEPORT** | |
**Type/Type** P | |
**Issuing Country/Pays émetteur** CAN | |
**Passport No./N° de passeport** GK141569 | |
**Surname/Nom** MANN | |
**Given names/Prénoms** JASKARAN SINGH | |
**Nationality/Nationalité** CANADIAN / CANADIENNE | |
**Date of birth/Date de naissance** 18 FEB / FÉV 93 | |
**Sex/Sexe** M | |
**Place of birth/Lieu de naissance** MAUR NABHA, JNAOIA | |
**Date of issue/Date de délivrance** 17 FEB / FEB 18 | |
**Date of expiry/Date d'expiration** 16 FEB / FEB 28 | |
**Issuing Authority/Autorité de délivrance** TORONTO | |
P<CANMANN<<JASKARAN<<SINGH<<<<<<<<<<<<<<<<<< | |
GK141569<8CAN8607294M2707202<<<<<<<<<<<<<<00 | |
ED197265 | |
================================================ | |
FILE: shared/outputs/0033.md | |
================================================ | |
Assinatura do titular / Signature du titulaire | |
Bearer's signature / Firma del titular | |
Este passaporte deve ser assinado pelo titular, salvo em caso de incapacidade. | |
Ce passeport doit être signé par le titulaire, sauf en cas d'incapacité. | |
This passport must be signed, except where the bearer is unable to do so. | |
Este pasaporte debe ser firmado por el titular, salvo en caso de incapacidad. | |
AA000000 | |
--- | |
**REPÚBLICA FEDERATIVA DO BRASIL** | |
**PASSAPORTE** | |
**PASSPORT** | |
**TIPO/TYPE:** P | |
**PAÍS EMISSOR/ISSUING COUNTRY:** BRA | |
**PASSAPORTE Nº/PASSPORT No.:** AA000000 | |
**SOBRENOME/SURNAME:** FARIAS DOS SANTOS | |
**NOME/GIVEN NAMES:** RODRIGO | |
**NACIONALIDADE/NATIONALITY:** BRASILEIRO(A) | |
**DATA DO NASCIMENTO/DATE OF BIRTH:** 16 MAR/MAR 2004 | |
**IDENTIDADE Nº/PERSONAL No:** | |
**SEXO/SEX:** M | |
**NATURALIDADE/PLACE OF BIRTH:** BRASÍLIA/DF | |
**FILIAÇÃO/FILIATION:** | |
MARCOS JOSÉ DOS SANTOS | |
AMANDA FARIAS DOS SANTOS | |
O titular, enquanto menor, está autorizado pelos genitores, pelo prazo deste documento, a viajar | |
apenas com um dos pais, indistintamente: Res. CNJ 131/11, Art. 13. | |
**DATA DE EXPEDIÇÃO/DATE OF ISSUE:** 06 JUL/JUL 2015 | |
**VALIDO ATÉ/DATE OF EXPIRY:** 05 JUL/JUL 2025 | |
**AUTORIDADE/AUTHORITY:** DPAS/DPF | |
P<BRAFARIAS<DOS<SANTOS<<RODRIGO<<<<<<<<<<<<< | |
AA000000<0BRA0403162M2507053<<<<<<<<<<<<<<04 | |
================================================ | |
FILE: shared/outputs/0034.md | |
================================================ | |
**ENDORSEMENTS AND LIMITATIONS** | |
This passport is valid for all countries unless otherwise specified. The bearer must comply with any visa or other entry regulations of the countries to be visited. | |
SEE OBSERVATIONS BEGINNING ON PAGE 5 (IF APPLICABLE) | |
**MENTIONS ET RESTRICTIONS** | |
Ce passeport est valable pour tous les pays, sauf indication contraire. Le titulaire doit se conformer aux formalités relatives aux visas ou aux autres formalités d'entrée des pays où il a l'intention de se rendre. | |
VOIR LES OBSERVATIONS DÉBUTANT À LA PAGE 5 (LE CAS ÉCHÉANT) | |
[Signature] | |
**Signature of bearer - Signature du titulaire** | |
HK444152 | |
--- | |
**CANADA** | |
**PASSPORT** | |
**PASSEPORT** | |
**Type/Type** P | |
**Issuing Country/Pays émetteur** CAN | |
**Passport No./N° de passeport** HK444152 | |
**Surname/Nom** WITTMACK | |
**Given names/Prénoms** BRIAN FREDRICK | |
**Nationality/Nationalité** CANADIAN/CANADIENNE | |
**Date of birth/Date de naissance** 01 NOV 47 | |
**Sex/Sexe** M | |
**Place of birth/Lieu de naissance** CONSORT CAN | |
**Date of issue/Date de délivrance** 13 JUNE/JUIN 16 | |
**Date of expiry/Date d'expiration** 13 JUNE/JUIN 26 | |
**Issuing Authority/Autorité de délivrance** MISSISSAUGA | |
P<CANWITTMACK<<BRIAN<FREDRICK<<<<<<<<<<<<<< | |
HK444152<5CAN4711018M2606130<<<<<<<<<<<<<<06 | |
EGD69494 | |
================================================ | |
FILE: shared/outputs/0035.md | |
================================================ | |
**THIS PAGE IS RESERVED FOR OFFICIAL OBSERVATIONS** | |
**CETTE PAGE EST RÉSERVÉE AUX OBSERVATIONS OFFICIELLES (11)** | |
**THERE ARE NO OFFICIAL OBSERVATIONS** | |
--- | |
**UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND** | |
**PASSPORT** | |
**PASSEPORT** | |
**Type/Type** P | |
**Code/Code** GBR | |
**Passport No./Passeport No.** 518242591 | |
**Surname/Nom (1)** WEBB | |
**Given names/Prénoms (2)** JAMES ROBERT | |
**Nationality/Nationalité (3)** BRITISH CITIZEN | |
**Date of Birth/Date de naissance (4)** 17 FEB / FEV 77 | |
**Sex/Sexe (5)** M | |
**Place of birth/Lieu de naissance (6)** CROYDON | |
**Date of issue/Date de délivrance (7)** 24 OCT / OCT 13 | |
**Authority/Autorité (8)** IPS | |
**Date of expiry/Date d'expiration (9)** 24 APR / AVR 24 | |
**Holder's signature/Signature du titulaire (10)** [Signature] | |
P<GBRWEBB<<JAMES<ROBERT<<<<<<<<<<<<<<<<<<<<< | |
5182425917GBR7702174M2404244<<<<<<<<<<<<<<06 | |
================================================ | |
FILE: shared/outputs/0036.md | |
================================================ | |
**RESIDENZA / RESIDENCE / DOMICILE (11)** TORINO (TO) | |
**RESIDENZA / RESIDENCE / DOMICILE (11)** | |
**RESIDENZA / RESIDENCE / DOMICILE (11)** | |
**STATURA / HEIGHT / TAILLE (12)** 176 | |
**COLORE DEGLI OCCHI / COLOUR OF EYES / COULEUR DES YEUX (13)** MARRONI | |
--- | |
**REPUBBLICA ITALIANA** | |
**PASSAPORTO** | |
**PASSPORT** | |
**PASSEPORT** | |
**Tipo. Type. Type.** P | |
**Codice Paese. Code of Issuing State. Code du pays émetteur.** ITA | |
**Passaporto N. Passport No. Passeport N°.** YA8116396 | |
**Cognome. Surname. Nom. (1)** TREVISAN | |
**Nome. Given Names. Prénoms. (2)** MARCO | |
**Cittadinanza. Nationality. Nationalité. (3)** ITALIANA | |
**Data di nascita. Date of birth. Daté de naissance. (4)** 12 FEB / FEB 1966 | |
**Sesso. Sex. Sexe. (5)** M | |
**Luogo di nascita. Place of birth. Lieu de naissance. (6)** FELTRE (BL) | |
**Data di rilascio. Date of issue. Date de délivrance. (7)** 10 LUG / JUL 2015 | |
**Autorità. Authority. Autorité. (9)** | |
MINISTRO AFFARI ESTERI | |
E COOPERAZIONE INTERNAZIONALE | |
**Data di scadenza. Date of expiry. Date d'expiration. (8)** 09 LUG / JUL 2025 | |
**Firma del titolare. Holder's signature / Signature du titulaire. (10)** [Signature] | |
P<ITATREVISAN<<MARCO<<<<<<<<<<<<<<<<<<<<<<<< | |
YA81163966ITA6602129M2507097<<<<<<<<<<<<<<08 | |
================================================ | |
FILE: shared/outputs/0037.md | |
================================================ | |
# We the People | |
**Of the United States,** | |
in Order to form a more perfect Union, | |
establish Justice, insure domestic Tranquility, | |
provide for the common defence, | |
promote the general Welfare, and secure | |
the Blessings of Liberty to ourselves and | |
our Posterity, do ordain and establish this | |
Constitution for the United States of America. | |
[Signature] | |
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR** | |
--- | |
**UNITED STATES OF AMERICA** | |
**PASSPORT** | |
**PASSEPORT** | |
**PASAPORTE** | |
**Type / Type / Tipo** P | |
**Code / Code / Código** USA | |
**Passport No. / No de Passeport / No. de Pasaporte** 910239248 | |
**Surname / Nom / Apellidos** OBAMA | |
**Given Names / Prénoms / Nombres** MICHELLE | |
**Nationality / Nationalité / Nacionalidad** UNITED STATES OF AMERICA | |
**Date of birth / Date de naissance / Fecha de nacimiento** 17 Jan 1964 | |
**Place of birth / Lieu de naissance / Lugar de nacimiento** ILLINOIS, U.S.A. | |
**Sex / Sexe / Sexo** F | |
**Date of issue / Date de délivrance / Fecha de expedición** 06 Dec 2013 | |
**Date of expiration / Date d'expiration / Fecha de caducidad** 05 Dec 2018 | |
**Authority / Autorité / Autoridad** United States Department of State | |
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 51 | |
**USA** | |
P<USABOBAMA<<MICHELLE<<<<<<<<<<<<<<<<<<<<<<<< | |
9102392482USA6401171F1812051900781200<129676 | |
**USA** | |
================================================ | |
FILE: shared/outputs/0038.md | |
================================================ | |
# We the People | |
**Of the United States,** | |
in Order to form a more perfect Union, | |
establish Justice, insure domestic Tranquility, | |
provide for the common defence, | |
promote the general Welfare, and secure | |
the Blessings of Liberty to ourselves and | |
our Posterity, do ordain and establish this | |
Constitution for the United States of America. | |
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR** | |
--- | |
**UNITED STATES OF AMERICA** | |
**PASSPORT** | |
**PASSEPORT** | |
**PASAPORTE** | |
**Type / Type / Tipo** P | |
**Code / Code / Código** USA | |
**Passport No. / No de Passeport / No. de Pasaporte** 488839667 | |
**Surname / Nom / Apellidos** VOLD | |
**Given Names / Prénoms / Nombres** STEPHEN HANSL | |
**Nationality / Nationalité / Nacionalidad** UNITED STATES OF AMERICA | |
**Date of birth / Date de naissance / Fecha de nacimiento** 15 Aug 1960 | |
**Place of birth / Lieu de naissance / Lugar de nacimiento** WASHINGTON, U.S.A. | |
**Sex / Sexe / Sexo** M | |
**Date of issue / Date de délivrance / Fecha de expedición** 21 May 2012 | |
**Date of expiration / Date d'expiration / Fecha de caducidad** 20 May 2022 | |
**Authority / Autorité / Autoridad** United States Department of State | |
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 51 | |
**USA** | |
P<USAVOLD<<STEPHEN<HANSL<<<<<<<<<<<<<<<<<<<< | |
4888396671USA6008156M220520112117147143<509936 | |
**USA** | |
================================================ | |
FILE: shared/outputs/0039.md | |
================================================ | |
# We the People | |
**Of the United States,** | |
in Order to form a more perfect Union, | |
establish Justice, insure domestic Tranquility, | |
provide for the common defence, | |
promote the general Welfare, and secure | |
the Blessings of Liberty to ourselves and | |
our Posterity, do ordain and establish this | |
Constitution for the United States of America. | |
[Signature] | |
**SIGNATURE OF BEARER / SIGNATURE DU TITULAIRE / FIRMA DEL TITULAR** | |
--- | |
**UNITED STATES OF AMERICA** | |
**PASSPORT** | |
**PASSEPORT** | |
**PASAPORTE** | |
**Type / Type / Tipo** P | |
**Code / Code / Código** USA | |
**Passport No. / No de Passeport / No. de Pasaporte** 963545637 | |
**Surname / Nom / Apellidos** JOHN | |
**Given Names / Prénoms / Nombres** DOE | |
**Nationality / Nationalité / Nacionalidad** USA | |
**Date of birth / Date de naissance / Fecha de nacimiento** 15 Mar 1996 | |
**Place of birth / Lieu de naissance / Lugar de nacimiento** CALIFORNIA, U.S.A | |
**Sex / Sexe / Sexo** M | |
**Date of issue / Date de délivrance / Fecha de expedición** 14 Apr 2017 | |
**Date of expiration / Date d'expiration / Fecha de caducidad** 14 Apr 2027 | |
**Authority / Autorité / Autoridad** United States Department of State | |
**Endorsements / Mentions Spéciales / Anotaciones** SEE PAGE 17 | |
**USA** | |
P<USAJOHN<<DOE<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< | |
9635456374USA9603150M27041402O2113962<804330 | |
**USA** | |
================================================ | |
FILE: shared/outputs/0040.md | |
================================================ | |
**THIS PAGE IS RESERVED FOR OFFICIAL OBSERVATIONS** | |
**CETTE PAGE EST RÉSERVÉE AUX OBSERVATIONS OFFICIELLES (11)** | |
**THERE ARE NO OFFICIAL OBSERVATIONS** | |
--- | |
**UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND** | |
**PASSPORT** | |
**PASSEPORT** | |
**Type/Type** P | |
**Code/Code** GBR | |
**Passport No./Passeport No.** 925600253 | |
**Surname/Nom (1)** UK SPECIMEN | |
**Given names/Prénoms (2)** ANGELA ZOE | |
**Nationality/Nationalité (3)** BRITISH CITIZEN | |
**Date of birth/Date de naissance (4)** 11 SEP / SEPT 88 | |
**Sex/Sexe (5)** F | |
**Place of birth/Lieu de naissance (6)** CROYDON | |
**Date of issue/Date de délivrance (7)** 16 JUL / JUIL 10 | |
**Authority/Autorité (8)** IPS | |
**Date of expiry/Date d’expiration (9)** 16 JUL / JUIL 20 | |
**Holder's signature/Signature du titulaire (10)** A Specimen | |
P<GBRUK<SPECIMEN<<ANGELA<ZOE<<<<<<<<<<<<<<<< | |
9256002538GBR8809117F2007162<<<<<<<<<<<<<<06 | |
================================================ | |
FILE: .github/workflows/python-publish.yml | |
================================================ | |
# This workflow will upload a Python Package using Twine when a release is created | |
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries | |
# This workflow uses actions that are not certified by GitHub. | |
# They are provided by a third-party and are governed by | |
# separate terms of service, privacy policy, and support | |
# documentation. | |
name: Deploy Python Package | |
on: | |
release: | |
types: [published] | |
permissions: | |
contents: read | |
jobs: | |
deploy: | |
runs-on: ubuntu-latest | |
steps: | |
- uses: actions/checkout@v4 | |
- name: Set up Python | |
uses: actions/setup-python@v3 | |
with: | |
python-version: '3.x' | |
- name: Install dependencies | |
run: | | |
python -m pip install --upgrade pip | |
pip install build | |
- name: Build package | |
run: python -m build | |
- name: Publish package | |
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29 | |
with: | |
user: __token__ | |
password: ${{ secrets.PYPI_API_TOKEN }} | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment