Re: [xml] Non recursive html parser
- From: Daniel Veillard <veillard redhat com>
- To: Eugene Pimenov <libc me com>
- Cc: xml gnome org
- Subject: Re: [xml] Non recursive html parser
- Date: Mon, 15 Mar 2010 15:19:42 +0100
On Wed, Feb 17, 2010 at 10:04:08AM +0100, Daniel Veillard wrote:
On Tue, Feb 16, 2010 at 10:00:03AM +0300, Eugene Pimenov wrote:
Hello everyone,
As my colleague pointed out in December (http://mail.gnome.org/archives/xml/2009-December/msg00036.html ;
although he didn't do it in a clear manner), there're real world examples of HTML pages that overflows
stack. We're using libxml through nokogiri ( http://nokogiri.org/ it's a Ruby library).
E. g.
>> Nokogiri::HTML::SAX::Parser.new(Nokogiri::XML::SAX::Document.new).parse_memory("<b>"*100_000)
#=> SystemStackError: stack level too deep
In the patch I change htmlParseElement to return immediately and let the caller htmlParseContent do the
job.
htmlParseElement is not a static function, and I changed it behavior! I googled around
(http://google.com/codesearch?q=htmlParseElement&hl=en&btnG=Search+Code) and I don't see everyone
actually using it. But if this is an issue, I can make htmlParseElement call the secret (static)
htmlParseElement and then htmlParseContent until level matches. I'd rather see htmlParseElement converted
to static though.
The patch as is also breaks the ABI by changing the ordering of
elements in the public structure of a parser context.
Also changing the behaviour of the function to that extend is not
correct and googling is not sufficient to answer if that behaviour would
be in use somewhere. libxml2 is used in many embedded project, you won't
find their source in google !
So at the minimum the public function must be preserved. The new
elements in the parser context must be added at the end of the
structure (I also need to be convinced this is really needed) and
all the regression tests must still pass with the patch applied.
So I had to look at this more closely. I ended up keeping
htmlParseElement and htmlParseContent as-is, that's the old code
now there for compatibility. I renamed your modified version
htmlParseElementInternal and htmlParseContentInternal and made
sure we were calling the new one. I also moved your extensions to
the parsing context to the end of the structure to preserve ABI.
I gave it a fair bit of testing and this looks solid. So that's now
applied but I kept the ABI intact. The new functions are static of
course.
I also attach weirdness.patch that deletes double definitions, and sets nameMax to 0 if it fails to
allocate some memory.
That one sounds fine, right !
And I applied that one as-is,
thanks a lot !
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]