Skip to content

Instantly share code, notes, and snippets.

@vjt
Created November 19, 2011 12:21
Show Gist options
  • Save vjt/1378785 to your computer and use it in GitHub Desktop.
Save vjt/1378785 to your computer and use it in GitHub Desktop.
Parse a Word OpenXML document file with a predefined structure
#!/usr/bin/env ruby
require 'nokogiri'
require 'zip/zipfilesystem'
file = ARGV[0] or abort("Usage: #$PROGRAM_NAME <file.docx>")
Zip::ZipFile.open(file) do |zip|
doc = Nokogiri::XML(zip.file.read('word/document.xml'))
rels = Nokogiri::XML(zip.file.read('word/_rels/document.xml.rels'))
doc.xpath('//w:tr').each do |row|
fields = row.xpath('.//w:tc').map do |cell|
[].tap do |ret|
if link = cell.xpath('.//w:hyperlink').first
id = link['id']
ret << rels.css("Relationship[Id=#{id}]").first['Target']
end
ret << cell.xpath('.//w:t').text
end
end
puts fields.join(',')
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment