Re: [xml] Stripping of blank content in HTML parser



On Wed, May 18, 2005 at 02:09:20PM +0100, Gary Coady wrote:
Hi there,
I've found that blank content nodes are stripped underneath the BODY
level. Given the attached HTML file message.html, the result of running
"xmllint --html" is in the file message_xmllint.html (xmllint reported
no errors).

Any blank nodes which are children of the html element are stripped,
causing the removal of spaces from the message.

According to the HTML 4.01 loose DTD, the body element can contain
PCDATA; relevant lines from http://www.w3.org/TR/REC-html40/loose.dtd are:

<!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
<!ENTITY % flow "%block; | %inline;">
<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; |
%formctrl;">

This is not true of the strict DTD; the body element in that case cannot
contain PCDATA.

The bug I filed is at http://bugzilla.gnome.org/show_bug.cgi?id=304637
and an initial suggested patch at
http://bugzilla.gnome.org/attachment.cgi?id=46595&action=view

Daniel did say in the bug that it might be better if this change was
added as a HTML parser option.

  okay, thanks for coming back with complete informations.
Based on the loose DTD fragment I think it would be fine to remove the
blanks text node stripping of body elements, even by default. But others
may care and disagree.
  We will just have to make sure that parsing/serializing repeatedly 
such an instance with libxml2 does not break (should be fine I think).

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]