[xml] Parsing a particular HTML file causes parse_html_string to hang



Hello all.  I've hit a problem using libxml2 to parse HTML files.  Usually everything works great, but on a particular input file I'm getting a hang with the process hogging the CPU indefinitely until killed.  When I run it through xmllint I see (aside from a bunch of run-of-the-mill HTML parsing warnings):

  $ /usr/local/bin/xmllint --html fail.html
  fail.html:927: parser error : Excessive depth in document: change xmlParserMaxDepth = 1024
  marcy playground<br /><option><em>

Then xmllint hangs, using 100% of the CPU until killed.

Another note - my first attempt to work around this was to add an alarm() call before parsing, hoping to terminate the failed parse if it took too long.  For some reason that didn't work - the alarm signal never reach my signal handler.  Any ideas why?  I'm ok with the parser failing to parse bad HTML - that's just a fact of life - but I can't allow it to hang indefinitely!

This is libxml2 v2.6.30 on Linux:

  $ /usr/local/bin/xmllint --html --version                   
  /usr/local/bin/xmllint: using libxml version 20630
     compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib

Would you like me to send in the killer file?  It's around 208k, so I didn't think it would be very polite to send unasked-for.

Thanks for any help you can give me!

-sam




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]