Created
December 8, 2011 23:42
-
-
Save ender672/1449283 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'nokogiri' | |
html = '<html><body><br/></body></html>' | |
# Nokogiri's new HTML encoding detection uses a custom SAX document handler to | |
# "peek" at an IO before parsing it. | |
# | |
# It interrupts the SAX parser by throwing from the context of a SAX document | |
# handler callback: | |
# https://github.com/tenderlove/nokogiri/blob/master/lib/nokogiri/html/document.rb#L144 | |
# | |
# This causes a memory leak since the libxml2 parser does not expect its | |
# callbacks to longjump. Nokogiri leaks a little bit of memory every time we | |
# open an HTML document from an IO. | |
loop do | |
doc = Nokogiri::HTML::Document::EncodingReader::SAXHandler.new(:foo) | |
prs = Nokogiri::HTML::SAX::Parser.new(doc) | |
ctx = Nokogiri::HTML::SAX::ParserContext.memory(html, 'UTF-8') | |
catch(:foo) do | |
ctx.parse_with(prs) | |
end | |
end | |
# The above shows what is going on behind the scenes. Here is a much easier way | |
# to trigger this memory leak: | |
loop{ Nokogiri::HTML(StringIO.new(html)) } | |
# The proper fix for this issue is intrusive. I am unsure if we want to | |
# incorporate it into a stable release. It involves wrapping every rb_funcall in | |
# xml_sax_parser.c so that it: | |
# * intercepts exceptions and throws | |
# * stashes the exception in a new C struct associated with the handler or the | |
# parser. | |
# * tells libxml2 to stop parsing. | |
# * re-throws the exception after libxml2 finishes its cleanup. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment