Skip to content

Instantly share code, notes, and snippets.

@fervisa
Created May 22, 2014 00:20
Show Gist options
  • Select an option

  • Save fervisa/1fa793949ce44ece9633 to your computer and use it in GitHub Desktop.

Select an option

Save fervisa/1fa793949ce44ece9633 to your computer and use it in GitHub Desktop.
Cleaning converted docx -> html file
require 'rubygems'
require 'sanitize'
require 'nokogiri-styles'
no_whitespaces = lambda {|env|
node = env[:node]
return unless node.elem?
unless node.children.any?{|c| !c.text? || c.content.strip.length > 0 }
node.unlink
end
}
first_table_row = lambda {|env|
node = env[:node]
return unless node.elem?
if node.name == 'table'
tr = node.children.first
tr.children.each do |td|
td.name = 'th'
end
tr['class'] = 'gray'
end
}
color_cell = lambda {|env|
node = env[:node]
return if !node.elem? or !['td', 'tr'].member? node.name
node['class'] = 'blue' if node.styles['background'] == '#99CCFF'
node['class'] = 'gray' if %w(#737373 gray).member? node.styles['background']
node.delete 'style'
return {:node_whitelist => [node]}
}
html = File.read('Contractor Monitoring Form.htm')
result = Sanitize.clean(
html,
attributes:{
all: [:class, :colspan]
},
elements: ['html', 'body', 'table', 'tbody', 'tr', 'td', 'div', 'style', 'title'],
transformers: [no_whitespaces, color_cell]
)
File.open('converted.html', 'w') { |file| file.write result }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment