Skip to content

Instantly share code, notes, and snippets.

@jkotchoff
Last active April 24, 2025 04:28
Show Gist options
  • Save jkotchoff/ac0f0ecafaadf0cddb2f35fcb3ae42aa to your computer and use it in GitHub Desktop.
Save jkotchoff/ac0f0ecafaadf0cddb2f35fcb3ae42aa to your computer and use it in GitHub Desktop.
Extract structured data in JSON from a PDF using an LLM (like Claude) via the AWS SDK with Bedrock
# Query an LLM like Claude using AWS Bedrock to extract structured data from a PDF
#
# https://community.aws/content/2i4v2vZRb9YgL2RxkawPiF8f0lZ/using-document-chat-with-the-amazon-bedrock-converse-api?lang=en#read-a-document-and-add-it-to-a-message
# https://community.aws/content/2hWA16FSt2bIzKs0Z1fgJBwu589/generating-json-with-the-amazon-bedrock-converse-api
class PdfExtractor
require 'aws-sdk-bedrockruntime'
require 'json'
# https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html
CLAUDE_3_5_HAIKU_MODEL_ID = 'anthropic.claude-3-5-haiku-20241022-v1:0'
DEFAULT_PROMPT = <<~PROMPT
You are an expert at extracting invoice information from PDFs.
Using the extract_invoice_information tool, extract the invoice information.
Some important notes:
- Pay special attention to table structures that might contain invoice details
- Most PDF documents will only have one invoice, but some may have multiple hence the array return type
- All numeric values must be floats without commas. For example, $79,372.50 should be written as 79372.5
If you're unsure about any value, mark it as null and explain why in the reasoning.
PROMPT
TOOL_LIST = [
{
tool_spec: {
name: "extract_invoice_information",
description: "Extract invoice information from a PDF.",
input_schema: {
json: {
type: "object",
properties: {
invoice_information: {
type: "array",
items: {
type: "object",
properties: {
invoice_date: {
type: "string",
description: "The date when the invoice was issued (YYYY-MM-DD format)"
},
invoice_amount: {
type: "number",
description: "The total amount of the invoice"
},
confidence: {
type: "string",
description: "Confidence level in the extraction",
enum: %w[high medium low]
},
},
required: %w[
invoice_date
invoice_amount
confidence
]
}
}
},
required: %w[invoice_information]
}
}
}
}
]
def initialize
@bedrock = Aws::BedrockRuntime::Client.new(
region: "us-west-2",
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil),
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil),
)
end
def analyze_pdf(pdf_url: "spec/fixtures/invoice-sample.pdf", prompt: DEFAULT_PROMPT, tool_list: TOOL_LIST, model_id: CLAUDE_3_5_HAIKU_MODEL_ID)
response = @bedrock.converse(
model_id: model_id,
messages: [
{
role: "user",
content: [
{
document: {
name: "Document 1",
format: "pdf",
source: {
bytes: File.read(pdf_url)
}
}
},
{ text: prompt }
],
tool_config: {
tools: tool_list,
tool_choice: {
tool: {
name: tool_list.first[:tool_spec][:name]
}
}
}
}
]
)
response.output.message.content.first.tool_use.input.deep_symbolize_keys[:invoice_information]
rescue Aws::BedrockRuntime::Errors::ServiceError => e
Rails.logger.error("AWS Bedrock error: #{e.message}")
raise "Failed to analyze PDF: #{e.message}"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment