Re: [xml] Stripping of blank content in HTML parser

On Wed, May 18, 2005 at 02:09:20PM +0100, Gary Coady wrote:
Hi there,
I've found that blank content nodes are stripped underneath the BODY
level. Given the attached HTML file message.html, the result of running
"xmllint --html" is in the file message_xmllint.html (xmllint reported
no errors).

Any blank nodes which are children of the html element are stripped,
causing the removal of spaces from the message.

According to the HTML 4.01 loose DTD, the body element can contain
PCDATA; relevant lines from are:

<!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
<!ENTITY % flow "%block; | %inline;">
<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; |

This is not true of the strict DTD; the body element in that case cannot
contain PCDATA.

The bug I filed is at
and an initial suggested patch at

Daniel did say in the bug that it might be better if this change was
added as a HTML parser option.

  okay, thanks for coming back with complete informations.
Based on the loose DTD fragment I think it would be fine to remove the
blanks text node stripping of body elements, even by default. But others
may care and disagree.
  We will just have to make sure that parsing/serializing repeatedly 
such an instance with libxml2 does not break (should be fine I think).


Daniel Veillard      | Red Hat Desktop team
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]