Created
July 3, 2009 07:23
-
-
Save anonymous/139987 to your computer and use it in GitHub Desktop.
A ruby snippet for Parsing and cleaning Word HTML
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# | |
# This function takes messy Word HTML pasted into a WYSIWYG and cleans it up | |
# It leaves the tags and attributes specified in the params | |
# Copyright (c) 2009, Radio New Zealand | |
# Released under the MIT license | |
require 'rubygems' | |
require 'sanitize' | |
def clean_up_word_html(html, elements = ['p', 'b', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'], attributes={}) | |
email_regex = /<p>Email:\s+((\w|\-|\_|\.)+\@((\w|\-|\_)+\.)+[a-zA-Z]{2,})/i | |
html.gsub! /[\n|\r]/ , '' | |
# keep only the things we want. | |
html = Sanitize.clean( html, :elements => elements, :attributes => attributes ) | |
# butt up any tags | |
html.gsub! / / , ' ' | |
html.gsub! />\s+</ , '><' | |
#remove email address lines | |
html.gsub! email_regex , '<p>' | |
# post sanitize cleanup of empty blocks | |
# the order of removal is import - this is the way word stacks these elements | |
html.gsub! /<i><\/i>/ , '' | |
html.gsub! /<b><\/b>/ , '' | |
html.gsub! /<\/b><b>/ , '' | |
html.gsub! /<p><\/p>/ , '' | |
html.gsub! /<p><b><\/b><\/p>/ , '' | |
# misc - fix butted times | |
html.gsub! /(\d)am / , '\1 am ' | |
html.gsub! /(\d)pm / , '\1 pm ' | |
# misc - remove multiple space that may cause doc specific regexs to fail (in dates for example) | |
html.gsub! /\s+/ , ' ' | |
# add new lines at the end of lines | |
html.gsub! /<\/(p|h\d|dt|dd|dl)>/, '</\1>' + "\n" | |
html.gsub! /<dl>/ , '<dl>' + "\n" | |
html | |
end | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you!