A correspondent has reported major performance problems with mod_proxy_html[1] parsing large HTML files. mod_proxy_html is a SAX application using htmlPushParser, and a filter in Apache's pipelined architecture. My correspondent had profiled the problem, and found it came entirely from within the final call to htmlParseChunk. Furthermore, no data are being passed down the filter chain until the final call to htmlParseChunk, so pipelining is broken. I was able to confirm this, and refine the diagnosis by profiling with mod_diagnostics and flushing output frequently. What is in fact happening is that when an HTML comment is encountered, it never finds the end of the comment ( htmlParseLookupSequence always returns -1 ) so all input thereafter is not parsed but is appended to the buffer. The offending code is around line 4355 (in version 2.5.8). I cannot see the purpose of this code at all, and simply disabling it (as in the patch) fixes the problem. [1] http://www.webthing.com/software/mod_proxy_html/ Regards, -- Nick Kew In urgent need of paying work - http://www.webthing.com/~nick/cv.html
Attachment:
patch
Description: Text document