[xml] Non recursive html parser

From: Eugene Pimenov <libc me com>
To: xml gnome org
Subject: [xml] Non recursive html parser
Date: Tue, 16 Feb 2010 10:00:03 +0300

Hello everyone,

As my colleague pointed out in December (http://mail.gnome.org/archives/xml/2009-December/msg00036.html ; 
although he didn't do it in a clear manner), there're real world examples of  HTML pages that overflows 
stack. We're using libxml through nokogiri ( http://nokogiri.org/ it's a Ruby library). 

E. g.
        >> Nokogiri::HTML::SAX::Parser.new(Nokogiri::XML::SAX::Document.new).parse_memory("<b>"*100_000)
        #=> SystemStackError: stack level too deep

In the patch I change htmlParseElement to return immediately and let the caller htmlParseContent do the job.

htmlParseElement is not a static function, and I changed it behavior! I googled around 
(http://google.com/codesearch?q=htmlParseElement&hl=en&btnG=Search+Code) and I don't see everyone actually 
using it. But if this is an issue, I can make htmlParseElement call the secret (static) htmlParseElement and 
then htmlParseContent until level matches. I'd rather see htmlParseElement converted to static though.

I also attach weirdness.patch that deletes double definitions, and sets nameMax to 0 if it fails to allocate 
some memory.

Good day, everyone :)

Attachment: non-recursive-html-parser.patch
Description: Binary data

Attachment: weirdness.patch
Description: Binary data

Follow-Ups:
- Re: [xml] Non recursive html parser
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]