[xml] HTML parser space dropping (with patch)




I'm sending this again because the mailing list bounced it the last time as having a "suspicious header". Dunno. Apologies if you get this twice.

I'm running into a problem with the HTML parser dropping lots of spaces. There was a patch for this a while back, and I've verified that my source tree still contains that patch, but it isn't solving the problem.

Trivial example: feed the following to xmllint --html --htmlout

<html>
<head></head><body>
<pre>
Word1<!-- comment 1 --> <!-- comment2 -->Word2
</.pre>
</body></html>

Note that the contents are in a <pre> tag, so all spaces should be kept. However, the space between comments is dropped, resulting in very damaged output with words run together. Imagine a few thousand of these in a doc and you see my problem. :-)

I've also found that there's no way to turn this space dropping off. The obvious flags (keepSpaces, for example) don't work.

After looking at the code, it appears that the current behavior is to only keep spaces if:

1.  It is the first thing in an element that allows spaces
2.  It follows another container that allows spaces

This second part, of course, fails for comments because they aren't really elements, so they can't legitimately allow spaces.

I suspect that there are other cases where spaces are being dropped incorrectly as well. The ones that stand out from looking at the code are spaces after <br> and after <img>.

In any case, it seems to me that the current model is more complicated than necessary, and that any space within an element that follows a mixed content model should automatically be considered sacred, regardless of what element (if any) precedes it. If an app needs the current behavior, it's easy to nuke the spaces in the tree. By contrast, you can never get back spaces that have been dropped during the parse.

Suggested patch attached.


Thoughts?
David

Attachment: libxml-html-whitespace-new.patch
Description: Binary data




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]