Re: [xml] Non recursive html parser



On Wed, Feb 17, 2010 at 10:04:08AM +0100, Daniel Veillard wrote:
On Tue, Feb 16, 2010 at 10:00:03AM +0300, Eugene Pimenov wrote:
Hello everyone,

As my colleague pointed out in December (http://mail.gnome.org/archives/xml/2009-December/msg00036.html ; 
although he didn't do it in a clear manner), there're real world examples of  HTML pages that overflows 
stack. We're using libxml through nokogiri ( http://nokogiri.org/ it's a Ruby library). 

E. g.
    >> Nokogiri::HTML::SAX::Parser.new(Nokogiri::XML::SAX::Document.new).parse_memory("<b>"*100_000)
    #=> SystemStackError: stack level too deep

In the patch I change htmlParseElement to return immediately and let the caller htmlParseContent do the 
job.

htmlParseElement is not a static function, and I changed it behavior! I googled around 
(http://google.com/codesearch?q=htmlParseElement&hl=en&btnG=Search+Code) and I don't see everyone 
actually using it. But if this is an issue, I can make htmlParseElement call the secret (static) 
htmlParseElement and then htmlParseContent until level matches. I'd rather see htmlParseElement converted 
to static though.

  The patch as is also breaks the ABI by changing the ordering of
elements in the public structure of a parser context.
  Also changing the behaviour of the function to that extend is not
correct and googling is not sufficient to answer if that behaviour would
be in use somewhere. libxml2 is used in many embedded project, you won't
find their source in google !
  So at the minimum the public function must be preserved. The new
elements in the parser context must be added at the end of the
structure (I also need to be convinced this is really needed) and
all the regression tests must still pass with the patch applied.

  So I had to look at this more closely. I ended up keeping
htmlParseElement and htmlParseContent as-is, that's the old code
now there for compatibility. I renamed your modified version
htmlParseElementInternal and htmlParseContentInternal and made
sure we were calling the new one. I also moved your extensions to
the parsing context to the end of the structure to preserve ABI.
I gave it a fair bit of testing and this looks solid. So that's now
applied but I kept the ABI intact. The new functions are static of
course.

I also attach weirdness.patch that deletes double definitions, and sets nameMax to 0 if it fails to 
allocate some memory.

  That one sounds fine, right !

  And I applied that one as-is,

   thanks a lot !

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]