Skip to content

Instantly share code, notes, and snippets.

@esafwan
Last active May 17, 2025 20:11
Show Gist options
  • Save esafwan/92b47ef88fcc8b84897251a169892529 to your computer and use it in GitHub Desktop.
Save esafwan/92b47ef88fcc8b84897251a169892529 to your computer and use it in GitHub Desktop.
Mistral OCR
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "ahC1s4OldYyM"
},
"source": [
"# OCR Cookbook\n",
"\n",
"---\n",
"\n",
"## Apply OCR to Convert Images into Text\n",
"\n",
"Optical Character Recognition (OCR) allows you to retrieve text data from images. With Mistral OCR, you can do this extremely fast and effectively, extracting text from hundreds and thousands of images (or PDFs).\n",
"\n",
"In this simple cookbook, we will extract text from a set of images using two methods:\n",
"- [Without Batch Inference](#scrollTo=qmXyB3rPlXQW): Looping through the dataset, extracting text from each image, and saving the result.\n",
"- [With Batch Inference](#scrollTo=jYfWYjzTmixB): Leveraging Batch Inference to extract text with a 50% cost reduction.\n",
"\n",
"---\n",
"\n",
"### Used\n",
"\n",
"- OCR\n",
"- Batch Inference"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sf84okJJmm7M"
},
"source": [
"### Setup\n",
"First, let's install `mistralai` and `datasets`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X1EBW_a6gRUD",
"outputId": "fb38d445-466b-447f-e49f-a991526d29fc"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: mistralai in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (1.5.0)\n",
"Requirement already satisfied: datasets in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (3.2.0)\n",
"Requirement already satisfied: eval-type-backport>=0.2.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from mistralai) (0.2.0)\n",
"Requirement already satisfied: httpx>=0.27.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from mistralai) (0.27.0)\n",
"Requirement already satisfied: jsonpath-python>=1.0.6 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from mistralai) (1.0.6)\n",
"Requirement already satisfied: pydantic>=2.9.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from mistralai) (2.9.2)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from mistralai) (2.8.2)\n",
"Requirement already satisfied: typing-inspect>=0.9.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from mistralai) (0.9.0)\n",
"Requirement already satisfied: filelock in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (3.13.1)\n",
"Requirement already satisfied: numpy>=1.17 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (1.26.2)\n",
"Requirement already satisfied: pyarrow>=15.0.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (15.0.0)\n",
"Requirement already satisfied: dill<0.3.9,>=0.3.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (0.3.7)\n",
"Requirement already satisfied: pandas in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (2.1.4)\n",
"Requirement already satisfied: requests>=2.32.2 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (2.32.3)\n",
"Requirement already satisfied: tqdm>=4.66.3 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (4.67.1)\n",
"Requirement already satisfied: xxhash in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (3.4.1)\n",
"Requirement already satisfied: multiprocess<0.70.17 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (0.70.15)\n",
"Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.9.0)\n",
"Requirement already satisfied: aiohttp in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (3.9.3)\n",
"Requirement already satisfied: huggingface-hub>=0.23.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (0.28.1)\n",
"Requirement already satisfied: packaging in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (23.2)\n",
"Requirement already satisfied: pyyaml>=5.1 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from datasets) (6.0.1)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from aiohttp->datasets) (1.3.1)\n",
"Requirement already satisfied: attrs>=17.3.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from aiohttp->datasets) (23.2.0)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from aiohttp->datasets) (1.4.1)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from aiohttp->datasets) (6.0.5)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from aiohttp->datasets) (1.9.4)\n",
"Requirement already satisfied: anyio in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from httpx>=0.27.0->mistralai) (3.7.1)\n",
"Requirement already satisfied: certifi in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from httpx>=0.27.0->mistralai) (2024.2.2)\n",
"Requirement already satisfied: httpcore==1.* in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from httpx>=0.27.0->mistralai) (1.0.4)\n",
"Requirement already satisfied: idna in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from httpx>=0.27.0->mistralai) (2.10)\n",
"Requirement already satisfied: sniffio in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from httpx>=0.27.0->mistralai) (1.3.0)\n",
"Requirement already satisfied: h11<0.15,>=0.13 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from httpcore==1.*->httpx>=0.27.0->mistralai) (0.14.0)\n",
"Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from pydantic>=2.9.0->mistralai) (0.6.0)\n",
"Requirement already satisfied: pydantic-core==2.23.4 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from pydantic>=2.9.0->mistralai) (2.23.4)\n",
"Requirement already satisfied: six>=1.5 in c:\\users\\di-co\\appdata\\roaming\\python\\python312\\site-packages (from python-dateutil>=2.8.2->mistralai) (1.16.0)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from requests>=2.32.2->datasets) (3.3.2)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from requests>=2.32.2->datasets) (2.2.1)\n",
"Requirement already satisfied: colorama in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from tqdm>=4.66.3->datasets) (0.4.6)\n",
"Requirement already satisfied: mypy-extensions>=0.3.0 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from typing-inspect>=0.9.0->mistralai) (1.0.0)\n",
"Requirement already satisfied: pytz>=2020.1 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from pandas->datasets) (2023.3.post1)\n",
"Requirement already satisfied: tzdata>=2022.1 in c:\\users\\di-co\\appdata\\local\\programs\\python\\python312\\lib\\site-packages (from pandas->datasets) (2023.3)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"[notice] A new release of pip is available: 24.0 -> 25.0.1\n",
"[notice] To update, run: python.exe -m pip install --upgrade pip\n"
]
}
],
"source": [
"!pip install mistralai datasets"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nTpiGWkpmvSb"
},
"source": [
"We can now set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AwG2kwfTlbW1"
},
"outputs": [],
"source": [
"from mistralai import Mistral\n",
"\n",
"api_key = \"API_KEY\"\n",
"client = Mistral(api_key=api_key)\n",
"ocr_model = \"mistral-ocr-latest\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qmXyB3rPlXQW"
},
"source": [
"## Without Batch\n",
"\n",
"As an example, let's use Mistral OCR to extract text from multiple images.\n",
"\n",
"We will use a dataset containing raw image data. To send this data via an image URL, we need to encode it in base64. For more information, please visit our [Vision Documentation](https://docs.mistral.ai/capabilities/vision/#passing-a-base64-encoded-image)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sNo8DZ7WbaBq"
},
"outputs": [],
"source": [
"import base64\n",
"from io import BytesIO\n",
"from PIL import Image\n",
"\n",
"def encode_image_data(image_data):\n",
" try:\n",
" # Ensure image_data is bytes\n",
" if isinstance(image_data, bytes):\n",
" # Directly encode bytes to base64\n",
" return base64.b64encode(image_data).decode('utf-8')\n",
" else:\n",
" # Convert image data to bytes if it's not already\n",
" buffered = BytesIO()\n",
" image_data.save(buffered, format=\"JPEG\")\n",
" return base64.b64encode(buffered.getvalue()).decode('utf-8')\n",
" except Exception as e:\n",
" print(f\"Error encoding image: {e}\")\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3m7d2yBImCfO"
},
"source": [
"For this demo, we will use a simple dataset containing numerous documents and scans in image format. Specifically, we will use the `HuggingFaceM4/DocumentVQA` dataset, loaded via the `datasets` library.\n",
"\n",
"We will download only 100 samples for this demonstration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZJ55QEifcgUq"
},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"n_samples = 100\n",
"dataset = load_dataset(\"HuggingFaceM4/DocumentVQA\", split=\"train\", streaming=True)\n",
"subset = list(dataset.take(n_samples))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8TEAOOVUmVNu"
},
"source": [
"With our subset of 100 samples ready, we can loop through each image to extract the text.\n",
"\n",
"We will save the results in a new dataset and export it as a JSONL file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OOUGkkrfce63",
"outputId": "58510dc0-8191-4ad6-fea4-b005f5198926"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 0%| | 0/100 [00:00<?, ?it/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 100/100 [02:13<00:00, 1.33s/it]\n"
]
}
],
"source": [
"from tqdm import tqdm\n",
"\n",
"ocr_dataset = []\n",
"for sample in tqdm(subset):\n",
" image_data = sample['image'] # 'image' contains the actual image data\n",
"\n",
" # Encode the image data to base64\n",
" base64_image = encode_image_data(image_data)\n",
" image_url = f\"data:image/jpeg;base64,{base64_image}\"\n",
"\n",
" # Process the image using Mistral OCR\n",
" response = client.ocr.process(\n",
" model=ocr_model,\n",
" document={\n",
" \"type\": \"image_url\",\n",
" \"image_url\": image_url,\n",
" }\n",
" )\n",
"\n",
" # Store the image data and OCR content in the new dataset\n",
" ocr_dataset.append({\n",
" 'image': base64_image,\n",
" 'ocr_content': response.pages[0].markdown # Since we are dealing with single images, there will be only one page\n",
" })"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8bcncJL0dAFk"
},
"outputs": [],
"source": [
"import json\n",
"\n",
"with open('ocr_dataset.json', 'w') as f:\n",
" json.dump(ocr_dataset, f, indent=4)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jYfWYjzTmixB"
},
"source": [
"Perfect, we have extracted all text from the 100 samples. However, this process can be made more cost-efficient using Batch Inference.\n",
"\n",
"## With Batch\n",
"\n",
"To use Batch Inference, we need to create a JSONL file containing all the image data and request information for our batch.\n",
"\n",
"Let's create a function called `create_batch_file` to handle this task by generating a file in the proper format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EHKPWyqVhFn_"
},
"outputs": [],
"source": [
"def create_batch_file(image_urls, output_file):\n",
" with open(output_file, 'w') as file:\n",
" for index, url in enumerate(image_urls):\n",
" entry = {\n",
" \"custom_id\": str(index),\n",
" \"body\": {\n",
" \"document\": {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": url\n",
" },\n",
" \"include_image_base64\": True\n",
" }\n",
" }\n",
" file.write(json.dumps(entry) + '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-gmWu5dGm79P"
},
"source": [
"The next step involves encoding the data of each image into base64 and saving the URL of each image that will be used."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "v_Fg_t-shHgj",
"outputId": "8fbe8b03-7d8a-4c13-96b0-98d56a4c4c82"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" 48%|████▊ | 48/100 [00:00<00:01, 41.07it/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 100/100 [00:04<00:00, 24.48it/s]\n"
]
}
],
"source": [
"image_urls = []\n",
"for sample in tqdm(subset):\n",
" image_data = sample['image'] # 'image' contains the actual image data\n",
"\n",
" # Encode the image data to base64 and add the url to the list\n",
" base64_image = encode_image_data(image_data)\n",
" image_url = f\"data:image/jpeg;base64,{base64_image}\"\n",
" image_urls.append(image_url)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h6q1JR73nIhe"
},
"source": [
"We can now create our batch file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7A3cRUP0gf6W"
},
"outputs": [],
"source": [
"batch_file = \"batch_file.jsonl\"\n",
"create_batch_file(image_urls, batch_file)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z4ME_pJCnM6-"
},
"source": [
"With everything ready, we can upload it to the API."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YGBd3kRfgnyq"
},
"outputs": [],
"source": [
"batch_data = client.files.upload(\n",
" file={\n",
" \"file_name\": batch_file,\n",
" \"content\": open(batch_file, \"rb\")},\n",
" purpose = \"batch\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9Q7dpJ07nVOO"
},
"source": [
"The file is uploaded, but the batch inference has not started yet. To initiate it, we need to create a job."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Y2iBH3Ikgzb7"
},
"outputs": [],
"source": [
"created_job = client.batch.jobs.create(\n",
" input_files=[batch_data.id],\n",
" model=ocr_model,\n",
" endpoint=\"/v1/ocr\",\n",
" metadata={\"job_type\": \"testing\"}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JE_DBRu4nbXz"
},
"source": [
"Our batch is ready and running!\n",
"\n",
"We can retrieve information using the following method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0dIph-xtg94n",
"outputId": "cf8adcc4-310c-435f-eab0-7fd2c706d4b9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Status: QUEUED\n",
"Total requests: 100\n",
"Failed requests: 0\n",
"Successful requests: 0\n",
"Percent done: 0.0%\n"
]
}
],
"source": [
"retrieved_job = client.batch.jobs.get(job_id=created_job.id)\n",
"print(f\"Status: {retrieved_job.status}\")\n",
"print(f\"Total requests: {retrieved_job.total_requests}\")\n",
"print(f\"Failed requests: {retrieved_job.failed_requests}\")\n",
"print(f\"Successful requests: {retrieved_job.succeeded_requests}\")\n",
"print(\n",
" f\"Percent done: {round((retrieved_job.succeeded_requests + retrieved_job.failed_requests) / retrieved_job.total_requests, 4) * 100}%\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZWMupK2Sng-5"
},
"source": [
"Let's automate this feedback loop and download the results once they are ready!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oarRyxv4jV6B",
"outputId": "1639a6a0-8a3a-450e-e11e-da9974cdded0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Status: SUCCESS\n",
"Total requests: 100\n",
"Failed requests: 0\n",
"Successful requests: 100\n",
"Percent done: 100.0%\n"
]
}
],
"source": [
"import time\n",
"from IPython.display import clear_output\n",
"\n",
"while retrieved_job.status in [\"QUEUED\", \"RUNNING\"]:\n",
" retrieved_job = client.batch.jobs.get(job_id=created_job.id)\n",
"\n",
" clear_output(wait=True) # Clear the previous output ( User Friendly )\n",
" print(f\"Status: {retrieved_job.status}\")\n",
" print(f\"Total requests: {retrieved_job.total_requests}\")\n",
" print(f\"Failed requests: {retrieved_job.failed_requests}\")\n",
" print(f\"Successful requests: {retrieved_job.succeeded_requests}\")\n",
" print(\n",
" f\"Percent done: {round((retrieved_job.succeeded_requests + retrieved_job.failed_requests) / retrieved_job.total_requests, 4) * 100}%\"\n",
" )\n",
" time.sleep(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UKH-dYmxhBjX",
"outputId": "36f505eb-e04e-45fe-cee4-abb9e3e24433"
},
"outputs": [
{
"data": {
"text/plain": [
"<Response [200 OK]>"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client.files.download(file_id=retrieved_job.output_file)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uGyDanavnq0C"
},
"source": [
"Done! With this method, you can perform OCR tasks in bulk in a very cost-effective way."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Mistral Generic API Calls:

curl --location "https://api.mistral.ai/v1/embeddings" \
     --header 'Content-Type: application/json' \
     --header 'Accept: application/json' \
     --header "Authorization: Bearer $MISTRAL_API_KEY" \
     --data '{
    "model": "mistral-embed",
    "input": ["Embed this sentence.", "As well as this one."]
  }'

Mistral OCR:

OCR and Document Understanding Document OCR processor The Document OCR (Optical Character Recognition) processor, powered by our latest OCR model mistral-ocr-latest, enables you to extract text and structured content from PDF documents.

Key features:

  • Extracts text content while maintaining document structure and hierarchy
  • Preserves formatting like headers, paragraphs, lists and tables
  • Returns results in markdown format for easy parsing and rendering
  • Handles complex layouts including multi-column text and mixed content
  • Processes documents at scale with high accuracy
  • Supports multiple document formats including PDF, images, and uploaded documents
  • The OCR processor returns both the extracted text content and metadata about the document structure, making it easy to work with the recognized content programmatically.
curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    "include_image_base64": true
  }' -o ocr_output.json

Or via base64:

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "data:application/pdf;base64,<base64_pdf>"
    },
    "include_image_base64": true
  }' -o ocr_output.json

Output Example:

{
    "pages": [
        {
            "index": 1,
            "markdown": "# d data from the target distribution, that is comparatively abundant, to predict model performance. Note that in this work, our focus is not to improve performance on the target but, rather, to estimate the accuracy on the target for a given classifier.\n\n[^0]\n[^0]:    * Work done in part while Saurabh Garg was interning at Google\n    ${ }^{1}$ Code is available at https://github.com/saurabhgarg1996/ATC_code.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 2,
            "markdown": "![img-0.jpeg](img-0.jpeg)\n\nFigure 1: Illustration of our proposed method ATC. Left: using source domain validation data, we identify a threshold on a score (e.g. negative entropy) computed on model confidence such that fraction of examples abovey, our work takes a step forward in positively answering the question raised in Deng \\& Zheng (2021); Deng et al. (2021) about a practical strategy to select a threshold that enables accuracy prediction with thresholded model confidence.",
            "images": [
                {
                    "id": "img-0.jpeg",
                    "top_left_x": 292,
                    "top_left_y": 217,
                    "bottom_right_x": 1405,
                    "bottom_right_y": 649,
                    "image_base64": "..."
                }
            ],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 3,
            "markdown": "ATC is simple to implement with existing frameworks, compatible with arbitrary model classes, and dominates other contemporary methods. Across several model architecturless, in our work, we only assume access to labeled data from the source domain presuming no access to labeled target domains or information about how to simulate them.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 4,
            "markdown": "Moreover, unlike the parallel work of Deng et al. (2021), we do not focus on methods that alter the training on source data to aid accuracy prediction on the target data. Chen et al. (2021b) propose an importance re-weighting based approach that leverages (additional) information about the axis along which distribution is shifting in formwhere we use FCN. Across all datasets, we observe that ATC achieves superior performance (lower MAE is better). For GDE post T and pre T estimates match since TS doesn't alter the argmax prediction. Results reported by aggregating MAE numbers over 4 different seeds. Values in parenthesis (i.e., $(\\cdot)$ ) denote standard deviation values.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        },
        {
            "index": 5,
            "markdown": "| Dataset | Shift | IM |  | AC |  | DOC |  | GDE | ATC-MC (Ours) |  | ATC-NE (Ours) |  |\n| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n|  |  | Pre T | Post T | Pre T | Post T | Pre T | Post T | Post T | Pre T | Post T | Pre T | Post T |\n| CIFAR10 | Natural | 7.14 | 6.20 | 10.25 | 7.06 | 7.68 | 6.35 | 5.74 | 4.02 | 3.85 | 3.76 | 3.38 |\n|  |  | (0.14) | (0.11) | (0.31) | (0.33) | (0.28) | (0.27) | (0.25) | (0.38) | (0.30) | (0.33) | (0.32) |\n|  | Synthetic | 12.62 | 10.75 | 16.50 | 11.91 | 13.93 | 11.20 | 7.97 | 5.66 | 5.03 | 4.87 | 3.63 |\n|  |  | (0.76) | (0.71) | (0.28) | (0.24) | (0.29) | (0.28) | (0.13) | (0.64) | (0.71) | (0.71) | (0.62) |\n| CIFAR100 | Synthetic | 12.77 | 12.34 | 16.89 | 12.73 | 11.18 | 9.63 | 12.00 | 5.61 | 5.55 | 5.65 | 5.76 |\n|  |  | (0.43) | (0.68) | (0.20) | (2.59) | (0.35) | (1.25) | (0.48) | (0.51) | (0.55) | (0.35) | (0.27) |\n| ImageNet200 | Natural | 12.63 | 7.99 | 23.08 | 7.22 | 15.40 | 6.33 | 5.00 | 4.60 | 1.80 | 4.06 | 1.38 |\n|  |  | (0.59) | (0.47) | (0.31) | (0.22) | (0.42) | (0.24) | (0.36) | (0.63) | (0.17) | (0.69) | (0.29) |\n|  | Synthetic | 20.17 | 11.74 | 33.69 | 9.51 | 25.49 | 8.61 | 4.19 | 5.37 | 2.78 | 4.53 | 3.58 |\n|  |  | (0.74) | (0.80) | (0.73) | (0.51) | (0.66) | (0.50) | (0.14) | (0.88) | (0.23) | (0.79) | (0.33) |\n| ImageNet | Natural | 8.09 | 6.42 | 21.66 | 5.91 | 8.53 | 5.21 | 5.90 | 3.93 | 1.89 | 2.45 | 0.73 |\n|  |  | (0.25) | (0.28) | (0.38) | (0.22) | (0.26) | (0.25) | (0.44) | (0.26) | (0.21) | (0.16) | (0.10) |\n|  | Synthetic | 13.93 | 9.90 | 28.05 | 7.56 | 13.82 | 6.19 | 6.70 | 3.33 | 2.55 | 2.12 | 5.06 |\n|  |  | (0.14) | (0.23) | (0.39) | (0.13) | (0.31) | (0.07) | (0.52) | (0.25) | (0.25) | (0.31) | (0.27) |\n| FMoW-WILDS | Natural | 5.15 | 3.55 | 34.64 | 5.03 | 5.58 | 3.46 | 5.08 | 2.59 | 2.33 | 2.52 | 2.22 |\n|  |  | (0.19) | (0.41) | (0.22) | (0.29) | (0.17) | (0.37) | (0.46) | (0.32) | (0.28) | (0.25) | (0.30) |\n| RxRx1-WILDS | Natural | 6.17 | 6.11 | 21.05 | 5.21 | 6.54 | 6.27 | 6.82 | 5.30 | 5.20 | 5.19 | 5.63 |\n|  |  | (0.20) | (0.24) | (0.31) | (0.18) | (0.21) | (0.20) | (0.31) | (0.30) | (0.44) | (0.43) | (0.55) |\n| Entity-13 | Same | 18.32 | 14.38 | 27.79 | 13.56 | 20.50 | 13.22 | 16.09 | 9.35 | 7.50 | 7.80 | 6.94 |\n|  |  | (0.29) | (0.53) | (1.18) | (0.58) | (0.47) | (0.58) | (0.84) | (0.79) | (0.65) | (0.62) | (0.71) |\n|  | Novel | 28.82 | 24.03 | 38.97 | 22.96 | 31.66 | 22.61 | 25.26 | 17.11 | 13.96 | 14.75 | 9.94 |\n|  |  | (0.30) | (0.55) | (1.32) | (0.59) | (0.54) | (0.58) | (1.08) | (0.93) | (0.64) | (0.78) |  |\n| Entity-30 | Same | 16.91 | 14.61 | 26.84 | 14.37 | 18.60 | 13.11 | 13.74 | 8.54 | 7.94 | 7.77 | 8.04 |\n|  |  | (1.33) | (1.11) | (2.15) | (1.34) | (1.69) | (1.30) | (1.07) | (1.47) | (1.38) | (1.44) | (1.51) |\n|  | Novel | 28.66 | 25.83 | 39.21 | 25.03 | 30.95 | 23.73 | 23.15 | 15.57 | 13.24 | 12.44 | 11.05 |\n|  |  | (1.16) | (0.88) | (2.03) | (1.11) | (1.64) | (1.11) | (0.51) | (1.44) | (1.15) | (1.26) | (1.13) |\n| NonLIVING-26 | Same | 17.43 | 15.95 | 27.70 | 15.40 | 18.06 | 14.58 | 16.99 | 10.79 | 10.13 | 10.05 | 10.29 |\n|  |  | (0.90) | (0.86) | (0.90) | (0.69) | (1.00) | (0.78) | (1.25) | (0.62) | (0.32) | (0.46) | (0.79) |\n|  | Novel | 29.51 | 27.75 | 40.02 | 26.77 | 30.36 | 25.93 | 27.70 | 19.64 | 17.75 | 16.90 | 15.69 |\n|  |  | (0.86) | (0.82) | (0.76) | (0.82) | (0.95) | (0.80) | (1.42) | (0.68) | (0.53) | (0.60) | (0.83) |\n| LIVING-17 | Same | 14.28 | 12.21 | 23.46 | 11.16 | 15.22 | 10.78 | 10.49 | 4.92 | 4.23 | 4.19 | 4.73 |\n|  |  | (0.96) | (0.93) | (1.16) | (0.90) | (0.96) | (0.99) | (0.97) | (0.57) | (0.42) | (0.35) | (0.24) |\n|  | Novel | 28.91 | 26.35 | 38.62 | 24.91 | 30.32 | 24.52 | 22.49 | 15.42 | 13.02 | 12.29 | 10.34 |\n|  |  | (0.66) | (0.73) | (1.01) | (0.61) | (0.59) | (0.74) | (0.85) | (0.59) | (0.53) | (0.73) | (0.62) |\n\nTable 4: Mean Absolute estimation Error (MAE) results for different datasets in our setup grouped by the nature of shift for ResNet model. 'Same' refers to same subpopulation shifts and 'Novel' refers novel subpopulation shifts. We include details about the target sets considered in each shift in Table 2. Post T denotes use of TS calibration on source. Across all datasets, we observe that ATC achieves superior performance (lower MAE is better). For GDE post T and pre T estimates match since TS doesn't alter the argmax prediction. Results reported by aggregating MAE numbers over 4 different seeds. Values in parenthesis (i.e., $(\\cdot)$ ) denote standard deviation values.",
            "images": [],
            "dimensions": {
                "dpi": 200,
                "height": 2200,
                "width": 1700
            }
        }
    ],
    "model": "mistral-ocr-2503-completion",
    "usage_info": {
        "pages_processed": 29,
        "doc_size_bytes": null
    }
}

OCR with uploaded PDF

You can also upload a PDF file and get the OCR results from the uploaded PDF.

Upload a file:

curl https://api.mistral.ai/v1/files \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -F purpose="ocr" \
  -F file="@uploaded_file.pdf"

Retrieve File:

curl -X GET "https://api.mistral.ai/v1/files/$id" \
     -H "Accept: application/json" \
     -H "Authorization: Bearer $MISTRAL_API_KEY"

id='00edaf84-95b0-45db-8f83-f71138491f23' object='file' size_bytes=3749788 created_at=1741023462 filename='uploaded_file.pdf' purpose='ocr' sample_type='ocr_input' source='upload' deleted=False num_lines=None

Get signed URL:

curl -X GET "https://api.mistral.ai/v1/files/$id/url?expiry=24" \
     -H "Accept: application/json" \
     -H "Authorization: Bearer $MISTRAL_API_KEY"

Get OCR results

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
        "type": "document_url",
        "document_url": "<signed_url>"
    },
    "include_image_base64": true
  }' -o ocr_output.json

Document understanding

The Document understanding capability combines OCR with large language model capabilities to enable natural language interaction with document content. This allows you to extract information and insights from documents by asking questions in natural language.

The workflow consists of two main steps:

Document Processing:

OCR extracts text, structure, and formatting, creating a machine-readable version of the document.

Language Model Understanding:

The extracted document content is analyzed by a large language model. You can ask questions or request information in natural language. The model understands context and relationships within the document and can provide relevant answers based on the document content.

Key capabilities:

  • Question answering about specific document content
  • Information extraction and summarization
  • Document analysis and insights
  • Multi-document queries and comparisons
  • Context-aware responses that consider the full document

Common use cases:

  • Analyzing research papers and technical documents
  • Extracting information from business documents
  • Processing legal documents and contracts
  • Building document Q&A applications
  • Automating document-based workflows

The examples below show how to interact with a PDF document using natural language:


curl https://api.mistral.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  -d '{
    "model": "mistral-small-latest",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "what is the last sentence in the document"
          },
          {
            "type": "document_url",
            "document_url": "https://arxiv.org/pdf/1805.04770"
          }
        ]
      }
    ],
    "document_image_limit": 8,
    "document_page_limit": 64
  }'

FAQ

Are there any limits regarding the OCR API? Yes, there are certain limitations for the OCR API. Uploaded document files must not exceed 50 MB in size and should be no longer than 1,000 pages.

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment