[xml] Stripping of blank content in HTML parser



Hi there,
I've found that blank content nodes are stripped underneath the BODY
level. Given the attached HTML file message.html, the result of running
"xmllint --html" is in the file message_xmllint.html (xmllint reported
no errors).

Any blank nodes which are children of the html element are stripped,
causing the removal of spaces from the message.

According to the HTML 4.01 loose DTD, the body element can contain
PCDATA; relevant lines from http://www.w3.org/TR/REC-html40/loose.dtd are:

<!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
<!ENTITY % flow "%block; | %inline;">
<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; |
%formctrl;">

This is not true of the strict DTD; the body element in that case cannot
contain PCDATA.

The bug I filed is at http://bugzilla.gnome.org/show_bug.cgi?id=304637
and an initial suggested patch at
http://bugzilla.gnome.org/attachment.cgi?id=46595&action=view

Daniel did say in the bug that it might be better if this change was
added as a HTML parser option.

Comments?

Thanks,
Gary.
This message has several formatting options applied all at once.
This messagehasseveral formattingoptionsappliedallatonce

.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]