Re: [xml] Non recursive html parser

From: Daniel Veillard <veillard redhat com>
To: Eugene Pimenov <libc me com>
Cc: xml gnome org
Subject: Re: [xml] Non recursive html parser
Date: Wed, 17 Feb 2010 10:04:08 +0100

On Tue, Feb 16, 2010 at 10:00:03AM +0300, Eugene Pimenov wrote:

Hello everyone,

As my colleague pointed out in December (http://mail.gnome.org/archives/xml/2009-December/msg00036.html ; 
although he didn't do it in a clear manner), there're real world examples of  HTML pages that overflows 
stack. We're using libxml through nokogiri ( http://nokogiri.org/ it's a Ruby library). 

E. g.
      >> Nokogiri::HTML::SAX::Parser.new(Nokogiri::XML::SAX::Document.new).parse_memory("<b>"*100_000)
      #=> SystemStackError: stack level too deep

In the patch I change htmlParseElement to return immediately and let the caller htmlParseContent do the job.

htmlParseElement is not a static function, and I changed it behavior! I googled around 
(http://google.com/codesearch?q=htmlParseElement&hl=en&btnG=Search+Code) and I don't see everyone actually 
using it. But if this is an issue, I can make htmlParseElement call the secret (static) htmlParseElement 
and then htmlParseContent until level matches. I'd rather see htmlParseElement converted to static though.


  The patch as is also breaks the ABI by changing the ordering of
elements in the public structure of a parser context.
  Also changing the behaviour of the function to that extend is not
correct and googling is not sufficient to answer if that behaviour would
be in use somewhere. libxml2 is used in many embedded project, you won't
find their source in google !
  So at the minimum the public function must be preserved. The new
elements in the parser context must be added at the end of the
structure (I also need to be convinced this is really needed) and
all the regression tests must still pass with the patch applied.

I also attach weirdness.patch that deletes double definitions, and sets nameMax to 0 if it fails to 
allocate some memory.


  That one sounds fine, right !

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

References:
- [xml] Non recursive html parser
  - From: Eugene Pimenov

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]