Re: [xml] Patch to improve HTMLparser's robustness



On Tue, Apr 22, 2008 at 03:56:33PM +0200, Arnold Hendriks wrote:
Daniel Veillard wrote:
I didn't forgot about the issue, and got a bit of time to test yesterday
and look at it. First the patch makes senses it fixes a serious problem,
there is no leak, that's fine, but the result is still problematic
[..]
 Basically the error is correctly displayed, but the close of the embedded
body and html tags generate a serious mess. We are able to detect the 
embedding
but the autoclose kind of misbehaves. moreover if using the push parser the
autoclose ends the document immediately:
 
Can I cheat? :) Given the fact that nothing should appear between 
</body> and </html>, and </html> is always the last tag, its' easiest to 
just ignore them and let the autoclose deal with it...
[...]

Which looks good enough to me. It's probably at least enough to get it 
properly through my html email sanitizer.

  unfortunately that means you don't get the SAX end element callbacks for
body and head when they arise. it's a bit too much cheating it would kill apps
relying on those. I far prefer marking the fact that some element ends need to
be ignored instead. I need to think about it, maybe the real solution is to
still push the html/body/head on the nameTab stack but not generate the
associated callbacks ...


 I think the embedding error condition should be noted somewhere in the 
parser state and disable at least partially the closing tag processing so
that the 'end text' paragraph shows up as a sibling of the 'embbeded text'
paragraph.
 
It probably should generate an error, yes. My patch simply ignores the 
situtation.

  but break the normal cases, which is not acceptable, nice try ;-)

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]