Created
April 3, 2025 06:28
-
-
Save jkotchoff/ac0f0ecafaadf0cddb2f35fcb3ae42aa to your computer and use it in GitHub Desktop.
Extract structured data from a PDF using an LLM (like Claude) via the AWS SDK with Bedrock
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Query an LLM like Claude using AWS Bedrock to extract structured data from a PDF | |
class PdfExtractor | |
require 'aws-sdk-bedrockruntime' | |
require 'json' | |
# https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html | |
CLAUDE_3_5_HAIKU_MODEL_ID = 'anthropic.claude-3-5-haiku-20241022-v1:0' | |
DEFAULT_PROMPT = <<~PROMPT | |
You are an expert at extracting invoice information from PDFs. | |
Extract the following information from this document in JSON format | |
- date_of_invoice: The date when the invoice was issued (YYYY-MM-DD format) | |
- invoice_number: The invoice number | |
- invoice_amount: The total amount of the invoice | |
... | |
Some important notes: | |
- Pay special attention to table structures that might contain transaction details | |
- Always return an array of extracted data in the JSON - most PDF documents will only have one transaction, but some may have multiple | |
- convert all numbers from comma separated numbers like 424,405 to the integer like 424405 | |
If you're unsure about any value, mark it as null and explain why in the reasoning. | |
Return the JSON data stucture only | |
PROMPT | |
def initialize | |
@bedrock = Aws::BedrockRuntime::Client.new( | |
region: "us-west-2", | |
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil), | |
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil), | |
) | |
end | |
def analyze_pdf(pdf_url: "spec/fixtures/invoice-sample.pdf", prompt: DEFAULT_PROMPT, model_id: CLAUDE_3_5_HAIKU_MODEL_ID) | |
response = @bedrock.converse( | |
model_id: model_id, | |
messages: [ | |
{ | |
role: "user", | |
content: [ | |
{ | |
document: { | |
name: "Document 1", | |
format: "pdf", | |
source: { | |
bytes: File.read(pdf_url) | |
} | |
} | |
}, | |
{ text: prompt } | |
] | |
} | |
] | |
) | |
JSON.parse(response.output.message.content.first.text) | |
rescue Aws::BedrockRuntime::Errors::ServiceError => e | |
Rails.logger.error("AWS Bedrock error: #{e.message}") | |
raise "Failed to analyze PDF: #{e.message}" | |
rescue JSON::ParserError => e | |
Rails.logger.error("JSON parsing error: #{e.message}") | |
raise "Failed to parse Bedrock response: #{e.message}" | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment