Re: [xml] HTMLparser comment parsing bug and patch

Nick Kew wrote:
A correspondent has reported major performance problems with
mod_proxy_html[1] parsing large HTML files.  mod_proxy_html
is a SAX application using htmlPushParser, and a filter in
Apache's pipelined architecture.

My correspondent had profiled the problem, and found it came
entirely from within the final call to htmlParseChunk.
Furthermore, no data are being passed down the filter chain
until the final call to htmlParseChunk, so pipelining is broken.

I was able to confirm this, and refine the diagnosis by profiling
with mod_diagnostics and flushing output frequently.  What is
in fact happening is that when an HTML comment is encountered,
it never finds the end of the comment ( htmlParseLookupSequence
always returns -1 ) so all input thereafter is not parsed but is
appended to the buffer.  The offending code is around line 4355
(in version 2.5.8).  I cannot see the purpose of this code at
all, and simply disabling it (as in the patch) fixes the problem.



Nick Kew

In urgent need of paying work -

Good catch, and nice debugging!

I can see the purpose of the code, and it is well-intentioned but,
unfortunately, incorrect.  Although I'm sure your patch fixed your
current problem, it is not good for many other cases involving
comments.  I think the attached patch should be the proper
correction.  Could you test it for me and let me know?  If it's ok,
I'll commit it to CVS.

Bill Brack

Attachment: patch
Description: Text document

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]