Skip to content

Instantly share code, notes, and snippets.

@jkotchoff
Created April 3, 2025 06:28
Show Gist options
  • Save jkotchoff/ac0f0ecafaadf0cddb2f35fcb3ae42aa to your computer and use it in GitHub Desktop.
Save jkotchoff/ac0f0ecafaadf0cddb2f35fcb3ae42aa to your computer and use it in GitHub Desktop.
Extract structured data from a PDF using an LLM (like Claude) via the AWS SDK with Bedrock
# Query an LLM like Claude using AWS Bedrock to extract structured data from a PDF
class PdfExtractor
require 'aws-sdk-bedrockruntime'
require 'json'
# https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html
CLAUDE_3_5_HAIKU_MODEL_ID = 'anthropic.claude-3-5-haiku-20241022-v1:0'
DEFAULT_PROMPT = <<~PROMPT
You are an expert at extracting invoice information from PDFs.
Extract the following information from this document in JSON format
- date_of_invoice: The date when the invoice was issued (YYYY-MM-DD format)
- invoice_number: The invoice number
- invoice_amount: The total amount of the invoice
...
Some important notes:
- Pay special attention to table structures that might contain transaction details
- Always return an array of extracted data in the JSON - most PDF documents will only have one transaction, but some may have multiple
- convert all numbers from comma separated numbers like 424,405 to the integer like 424405
If you're unsure about any value, mark it as null and explain why in the reasoning.
Return the JSON data stucture only
PROMPT
def initialize
@bedrock = Aws::BedrockRuntime::Client.new(
region: "us-west-2",
access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID', nil),
secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY', nil),
)
end
def analyze_pdf(pdf_url: "spec/fixtures/invoice-sample.pdf", prompt: DEFAULT_PROMPT, model_id: CLAUDE_3_5_HAIKU_MODEL_ID)
response = @bedrock.converse(
model_id: model_id,
messages: [
{
role: "user",
content: [
{
document: {
name: "Document 1",
format: "pdf",
source: {
bytes: File.read(pdf_url)
}
}
},
{ text: prompt }
]
}
]
)
JSON.parse(response.output.message.content.first.text)
rescue Aws::BedrockRuntime::Errors::ServiceError => e
Rails.logger.error("AWS Bedrock error: #{e.message}")
raise "Failed to analyze PDF: #{e.message}"
rescue JSON::ParserError => e
Rails.logger.error("JSON parsing error: #{e.message}")
raise "Failed to parse Bedrock response: #{e.message}"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment