Last active
April 24, 2025 04:28
-
-
Save jkotchoff/ac0f0ecafaadf0cddb2f35fcb3ae42aa to your computer and use it in GitHub Desktop.
Extract structured data in JSON from a PDF using an LLM (like Claude) via the AWS SDK with Bedrock
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Query an LLM like Claude using AWS Bedrock to extract structured data from a PDF | |
# | |
# https://community.aws/content/2i4v2vZRb9YgL2RxkawPiF8f0lZ/using-document-chat-with-the-amazon-bedrock-converse-api?lang=en#read-a-document-and-add-it-to-a-message | |
# https://community.aws/content/2hWA16FSt2bIzKs0Z1fgJBwu589/generating-json-with-the-amazon-bedrock-converse-api | |
class PdfExtractor | |
require 'aws-sdk-bedrockruntime' | |
require 'json' | |
# https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html | |
CLAUDE_3_5_HAIKU_MODEL_ID = 'anthropic.claude-3-5-haiku-20241022-v1:0' | |
DEFAULT_PROMPT = <<~PROMPT | |
You are an expert at extracting invoice information from PDFs. | |
Using the extract_invoice_information tool, extract the invoice information. | |
Some important notes: | |
- Pay special attention to table structures that might contain invoice details | |
- Most PDF documents will only have one invoice, but some may have multiple hence the array return type | |
- All numeric values must be floats without commas. For example, $79,372.50 should be written as 79372.5 | |
If you're unsure about any value, mark it as null and explain why in the reasoning. | |
PROMPT | |
TOOL_LIST = [ | |
{ | |
tool_spec: { | |
name: "extract_invoice_information", | |
description: "Extract invoice information from a PDF.", | |
input_schema: { | |
json: { | |
type: "object", | |
properties: { | |
invoice_information: { | |
type: "array", | |
items: { | |
type: "object", | |
properties: { | |
invoice_date: { | |
type: "string", | |
description: "The date when the invoice was issued (YYYY-MM-DD format)" | |
}, | |
invoice_amount: { | |
type: "number", | |
description: "The total amount of the invoice" | |
}, | |
confidence: { | |
type: "string", | |
description: "Confidence level in the extraction", | |
enum: %w[high medium low] | |
}, | |
}, | |
required: %w[ | |
invoice_date | |
invoice_amount | |
confidence | |
] | |
} | |
} | |
}, | |
required: %w[invoice_information] | |
} | |
} | |
} | |
} | |
] | |
def initialize | |
@bedrock = Aws::BedrockRuntime::Client.new( | |
region: "us-west-2", | |
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil), | |
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil), | |
) | |
end | |
def analyze_pdf(pdf_url: "spec/fixtures/invoice-sample.pdf", prompt: DEFAULT_PROMPT, tool_list: TOOL_LIST, model_id: CLAUDE_3_5_HAIKU_MODEL_ID) | |
response = @bedrock.converse( | |
model_id: model_id, | |
messages: [ | |
{ | |
role: "user", | |
content: [ | |
{ | |
document: { | |
name: "Document 1", | |
format: "pdf", | |
source: { | |
bytes: File.read(pdf_url) | |
} | |
} | |
}, | |
{ text: prompt } | |
], | |
tool_config: { | |
tools: tool_list, | |
tool_choice: { | |
tool: { | |
name: tool_list.first[:tool_spec][:name] | |
} | |
} | |
} | |
} | |
] | |
) | |
response.output.message.content.first.tool_use.input.deep_symbolize_keys[:invoice_information] | |
rescue Aws::BedrockRuntime::Errors::ServiceError => e | |
Rails.logger.error("AWS Bedrock error: #{e.message}") | |
raise "Failed to analyze PDF: #{e.message}" | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment