Created
February 2, 2011 01:42
-
-
Save kryzhovnik/807100 to your computer and use it in GitHub Desktop.
требует установленного гема loofah (gem install loofah)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
desc <<END | |
Find all html files from the specified directory and clean them: removes comments, whitespaces, and carriage return. | |
Before using install loofah gem: | |
gem install loofah | |
Usage: | |
rake clean_html DIR=my_dir | |
Warning: by default, DIR variable point current directory - #{Dir.pwd} | |
END | |
task 'clean_html' do | |
require 'loofah' | |
require 'active_support/core_ext/string' | |
clean = Loofah::Scrubber.new do |node| | |
if node.type == Nokogiri::XML::Node::COMMENT_NODE | |
node.remove | |
else | |
if node.name == 'pre' | |
Loofah::Scrubber::STOP # don't bother with the rest of the subtree | |
elsif node.type == Nokogiri::XML::Node::TEXT_NODE | |
if node.content.blank? | |
node.remove | |
else | |
node.content = node.content.strip | |
end | |
end | |
end | |
end | |
ENV['DIR'] ||= Dir.pwd | |
files = Dir["#{ENV['DIR']}/**/*.html"] | |
files.each do |file_path| | |
puts "parse file: #{file_path}" | |
html = File.open(file_path, 'r').read | |
clear_tree = Loofah.document(html).scrub!(clean) | |
clear_html = clear_tree.to_html(:indent_text => '', :indent => 0, :save_with => 0) | |
html = File.open(file_path, 'w').write clear_html | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment