[xml] HTMLparser comment parsing bug and patch




A correspondent has reported major performance problems with 
mod_proxy_html[1] parsing large HTML files.  mod_proxy_html
is a SAX application using htmlPushParser, and a filter in
Apache's pipelined architecture.

My correspondent had profiled the problem, and found it came
entirely from within the final call to htmlParseChunk.
Furthermore, no data are being passed down the filter chain
until the final call to htmlParseChunk, so pipelining is broken.

I was able to confirm this, and refine the diagnosis by profiling
with mod_diagnostics and flushing output frequently.  What is
in fact happening is that when an HTML comment is encountered,
it never finds the end of the comment ( htmlParseLookupSequence
always returns -1 ) so all input thereafter is not parsed but is
appended to the buffer.  The offending code is around line 4355
(in version 2.5.8).  I cannot see the purpose of this code at
all, and simply disabling it (as in the patch) fixes the problem.


[1] http://www.webthing.com/software/mod_proxy_html/

Regards,

-- 
Nick Kew

In urgent need of paying work - http://www.webthing.com/~nick/cv.html

Attachment: patch
Description: Text document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]