Skip to content

Instantly share code, notes, and snippets.

@jrochkind
Created October 15, 2012 17:19
Show Gist options
  • Save jrochkind/3893745 to your computer and use it in GitHub Desktop.
Save jrochkind/3893745 to your computer and use it in GitHub Desktop.
Truncating html with nokogiri, with/without Rails

Rails has a handy truncate helper (which is actually mostly a method added to String ), but it warns you it's not safe to use on html source, it'll cut off end tags and such.

What if you want an HTML safe one? There are a variety of suggested solutions you can google, none of which were quite robust/powerful enough for me.

So I started with my favorite, by Andrea Singh, using nokogiri.

But:

  • I modified it to not monkey-patch Nokogiri, but be a static method instead (sadly making already confusing code yet more confusing, but I didn't want to monkey patch nokogiri)

  • I made it smarter about putting the mark-of-omission inside the tag who's text ended up truncated, instead of at the end of the source -- this is also not perfect, but works 'good enough' for most common use cases.

  • I made it handle Rails :seperator option -- again, very not perfectly, it will often break at a tag boundary instead of the actual best seperator, but in ways that should be good enough for most common use cases (tag boundaries are usually good breaking points too).

  • I made the top-level invocation method a Rails helper method using Rails functionality so-as to handle both html-safe truncation and ordinary truncation, if the string is html-safe, it uses html-safe truncation and returns a string that's still html-safe.

  • I added some tests (my tests run at the rails-helper method level, because that was convenient for me).

See the tests to see what it does and doesn't do. It's not perfect, and there are a variety of different implementation or api choices that could be made -- but it's good enough for me, and if others have use cases like mine possibly better than anything else easily findable on the net.

If there's a lot of interest, I could turn this into an actual gem.

Although ultimately, for use in Rails, what I think should really happen is for this functionality to be added to Rails html sanitize helper -- times when you want to sanitize overlap extensively with times when you want to truncate (since both are normally going to be with html as 'input' to your program), and both require an HTML parse. Better to do the HTML parse just once for both functions simultaneously, then need to do it once for sanitizing and again for truncating. (Rails sanitize doesn't use nokogiri, but it's own weird html parser).

# Nothing in here assumes Rails
require 'nokogiri'
module Util
# An HTML-safe truncation using nokogiri, based off of:
# http://blog.madebydna.com/all/code/2010/06/04/ruby-helper-to-cleanly-truncate-html.html
#
# but without monkey-patching, and behavior more consistent with Rails
# truncate.
#
# It's hard to get all the edge-cases right, we probably mis-calculate slightly
# on edge cases, and we aren't always able to strictly respect :seperator, sometimes
# breaking on tag boundaries instead. But this should be good enough for actual use
# cases, where those types of incorrect results are still good enough.
#
# ruby 1.9 only, in 1.8.7 non-ascii won't be handled quite right.
#
# Pass in a Nokogiri node, probably created with Nokogiri::HTML::DocumentFragment.parse(string)
#
# Might want to check length of your string to see if, even with HTML tags, it's
# still under limit, before parsing as nokogiri and passing in here -- for efficiency.
#
# Get back a Nokogiri node, call #inner_html on it to go back to a string
# (and you probably want to call .html_safe on the string you get back for use
# in rails view)
def self.nokogiri_truncate(node, max_length, omission = '…', seperator = nil)
if node.kind_of?(::Nokogiri::XML::Text)
if node.content.length > max_length
allowable_endpoint = [0, max_length - omission.length].max
if seperator
allowable_endpoint = (node.content.rindex(seperator, allowable_endpoint) || allowable_endpoint)
end
::Nokogiri::XML::Text.new(node.content.slice(0, allowable_endpoint) + omission, node.parent)
else
node.dup
end
else # DocumentFragment or Element
return node if node.inner_text.length <= max_length
truncated_node = node.dup
truncated_node.children.remove
remaining_length = max_length
node.children.each do |child|
#require 'debugger'
#debugger
if remaining_length == 0
truncated_node.add_child ::Nokogiri::XML::Text.new(omission, truncated_node)
break
elsif remaining_length < 0
break
end
truncated_node.add_child nokogiri_truncate(child, remaining_length, omission, seperator)
# can end up less than 0 if the child was truncated to fit, that's
# fine:
remaining_length = remaining_length - child.inner_text.length
end
truncated_node
end
end
end
require 'util'
module SomeHelper
# Like rails truncate helper, and taking the same options, but html_safe.
#
# If input string is NOT marked html_safe?, simply passes to rails truncate helper.
# If a string IS marked html_safe?, uses nokogiri to parse it, and truncate
# actual displayed text to max_length, while keeping html structure valid.
#
# Default omission marker is unicode elipsis unlike rails three periods.
#
# :length option will also default to 280, what we think is a good
# length for abstract/snippet display, unlike rails 10.
def special_truncate(str, options = {})
options.reverse_merge!(:omission => "…", :length => 280)
# works for non-html of course, but for html a quick check
# to avoid expensive nokogiri parse if the whole string, even
# with tags, is still less than max length.
return str if str.length < options[:length]
if str.html_safe?
noko = Nokogiri::HTML::DocumentFragment.parse(str)
Util.nokogiri_truncate(noko, options[:length], options[:omission], options[:seperator]).inner_html.html_safe
else
return truncate(str, options)
end
end
end
# encoding: UTF-8
require 'test_helper'
def test_truncate_basic
# Basic test
output = special_truncate("12345678901234567890", :length => 10)
assert_equal "123456789…", output
end
def test_truncate_tags
# With tags
html_input = "123456<p><b>78901234567</b>890</p>".html_safe
html_output = special_truncate(html_input, :length => 10)
assert html_output.html_safe?, "truncated html_safe? is still html_safe?"
assert_equal "123456<p><b>789…</b></p>", html_output
end
def test_truncate_tag_boundary
# With break on tag boundary. Yes, there's an error not accounting
# for length of omission marker in this particular edge case,
# hard to fix, good enough for now.
html_input = "<p>1234567890<b>123456</b>7890</p>".html_safe
html_output = special_truncate(html_input, :length => 10)
assert_equal "<p>1234567890…</p>", html_output
end
def test_truncate_boundary_edge_case
html_input = "12345<p>6789<b>0123456</b>7890</p>".html_safe
html_output = special_truncate(html_input, :length => 10)
# yeah, weird elipses in <b> of their own, so it goes.
assert_equal "12345<p>6789<b>…</b></p>", html_output
end
def test_truncate_another_edge_case
html_input = "12345<p>67890<b>123456</b>7890</p>".html_safe
html_output = special_truncate(html_input, :length => 10)
assert_equal "12345<p>67890…</p>", html_output
end
def test_truncate_html_with_seperator
html_input = "12345<p>67 901234<b></p>".html_safe
html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
assert_equal "12345<p>67…</p>", html_output
end
def test_truncate_html_with_seperator_unavailable
html_input = "12345<p>678901234<b></p>".html_safe
html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
assert_equal "12345<p>6789…</p>", html_output
end
def test_truncate_html_with_boundary_seperator
# known edge case we dont' handle, sorry. If this test
# fails, that could be a good thing if you've fixed the edge case!
html_input = "12345<p>6 8<b>90123456</b>7890</p>".html_safe
html_output = special_truncate(html_input, :length => 10, :seperator => ' ')
assert_equal "12345<p>6 8<b>9…</b></p>", html_output
end
end
@nono
Copy link

nono commented Oct 15, 2012

If you want some inspirations for more special cases, you can look at the tests of HTML-Truncator specs: https://github.com/nono/HTML-Truncator/blob/master/spec/html_truncator_spec.rb

@jrochkind
Copy link
Author

Cool, I dont' know if I caught your HTML-Truncator already, perhaps it already does what I'm trying to do anyway and I should have just used that?

@NARKOZ
Copy link

NARKOZ commented Oct 20, 2012

require 'truncate_html'

some_html = '<ul><li><a href="http://whatever">This is a link</a></li></ul>'
truncate_html(some_html, :length => 15, :omission => '...(continued)')
# => <ul><li><a href="http://whatever">This...(continued)</a></li></ul>

@jerome-diver
Copy link

narkose (or narcose maybe),
require, ok, but where ? for which version of rails can be used ? truncate_html need to be installed ? from where ?
Please, more relevant and use-full information maybe ?
thank you for share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment