Last active
September 15, 2019 04:34
-
-
Save stemar/55e303f38ff35af29d15 to your computer and use it in GitHub Desktop.
Ruby helpers to filter out unwanted HTML class names or style properties, including example to filter out Microsoft Office classes and styles
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| class String | |
| def filter_html_attribute(attribute_key, args={}) | |
| self.gsub!(/#{attribute_key}="([^"]*)"/) do |attribute| | |
| return "" if attribute.nil? | |
| attribute_value = $1.split(args[:split]).delete_if {|i| i.match(args[:match]) }.join(args[:join]) | |
| attribute_value += args[:append].to_s unless attribute_value.empty? | |
| attribute.replace("#{attribute_key}=\"#{attribute_value}\"") | |
| end | |
| self.gsub!(/ (#{attribute_key})=""/, "") | |
| end | |
| def filter_html_class(args={}) | |
| args = {split: /\s+/, join: " "}.merge(args) | |
| self.filter_html_attribute("class", args) | |
| end | |
| def filter_html_style(args={}) | |
| args = {split: /\s*;\s*/, join: "; ", append: ";"}.merge(args) | |
| self.filter_html_attribute("style", args) | |
| end | |
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| html = <<HTML | |
| <p class="MsoNormal">Lorem ipsum dolor sit amet</p> | |
| <p class="first MsoNormal last">Lorem ipsum dolor sit amet</p> | |
| <p class="first MsoNormal">Lorem ipsum dolor sit amet</p> | |
| <p class="3DMsoNormal last"><span style="mso-border-insideh:none;mso-border-insidev:none;">Lorem ipsum dolor sit amet</span></p> | |
| <p><span style="color:red; mso-border-insideh:none;mso-border-insidev:none; background: white; ">Lorem ipsum dolor sit amet</span></p> | |
| HTML | |
| puts html | |
| class String | |
| # Filter out Microsoft Office classes and styles | |
| def filter_out_mso | |
| self.filter_html_class(match: /(M|m)so\S+/) | |
| self.filter_html_style(match: /^(M|m)so/) | |
| end | |
| end | |
| puts html.filter_out_mso | |
| # Result: | |
| # <p>Lorem ipsum dolor sit amet</p> | |
| # <p class="first last">Lorem ipsum dolor sit amet</p> | |
| # <p class="first">Lorem ipsum dolor sit amet</p> | |
| # <p class="last"><span>Lorem ipsum dolor sit amet</span></p> | |
| # <p><span style="color:red; background: white;">Lorem ipsum dolor sit amet</span></p> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment