[xml] Non recursive html parser



Hello everyone,

As my colleague pointed out in December (http://mail.gnome.org/archives/xml/2009-December/msg00036.html ; 
although he didn't do it in a clear manner), there're real world examples of  HTML pages that overflows 
stack. We're using libxml through nokogiri ( http://nokogiri.org/ it's a Ruby library). 

E. g.
        >> Nokogiri::HTML::SAX::Parser.new(Nokogiri::XML::SAX::Document.new).parse_memory("<b>"*100_000)
        #=> SystemStackError: stack level too deep

In the patch I change htmlParseElement to return immediately and let the caller htmlParseContent do the job.

htmlParseElement is not a static function, and I changed it behavior! I googled around 
(http://google.com/codesearch?q=htmlParseElement&hl=en&btnG=Search+Code) and I don't see everyone actually 
using it. But if this is an issue, I can make htmlParseElement call the secret (static) htmlParseElement and 
then htmlParseContent until level matches. I'd rather see htmlParseElement converted to static though.

I also attach weirdness.patch that deletes double definitions, and sets nameMax to 0 if it fails to allocate 
some memory.

Good day, everyone :)

Attachment: non-recursive-html-parser.patch
Description: Binary data

Attachment: weirdness.patch
Description: Binary data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]