Nick Kew wrote:
A correspondent has reported major performance problems with mod_proxy_html[1] parsing large HTML files. mod_proxy_html is a SAX application using htmlPushParser, and a filter in Apache's pipelined architecture. My correspondent had profiled the problem, and found it came entirely from within the final call to htmlParseChunk. Furthermore, no data are being passed down the filter chain until the final call to htmlParseChunk, so pipelining is broken. I was able to confirm this, and refine the diagnosis by profiling with mod_diagnostics and flushing output frequently. What is in fact happening is that when an HTML comment is encountered, it never finds the end of the comment ( htmlParseLookupSequence always returns -1 ) so all input thereafter is not parsed but is appended to the buffer. The offending code is around line 4355 (in version 2.5.8). I cannot see the purpose of this code at all, and simply disabling it (as in the patch) fixes the problem. [1] http://www.webthing.com/software/mod_proxy_html/ Regards, -- Nick Kew In urgent need of paying work - http://www.webthing.com/~nick/cv.html
Good catch, and nice debugging! I can see the purpose of the code, and it is well-intentioned but, unfortunately, incorrect. Although I'm sure your patch fixed your current problem, it is not good for many other cases involving comments. I think the attached patch should be the proper correction. Could you test it for me and let me know? If it's ok, I'll commit it to CVS. Bill Brack
Attachment:
patch
Description: Text document