Re: [xml] HTMLparser: whitespace in <body> tags

From: Malcolm Tredinnick <malcolm commsecure com au>
To: Benj Carson <benjcarson digitaljunkies ca>
Cc: xml gnome org
Subject: Re: [xml] HTMLparser: whitespace in <body> tags
Date: Sun, 26 Sep 2004 12:17:12 +1000

On Sat, 2004-09-25 at 14:51 -0600, Benj Carson wrote:

Hello,

This may or may not be a bug, but I've discovered that the html parser 
removes whitespace between inline elements that are direct children of body 
elements, e.g.:

$ cat adj_inline.html
<html>
<body>
<a>a</a> <a>b</a>
</body>
</html>

$ xmllint --html adj_inline.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<a>a</a><a>b</a>
</body></html>

I haven't been able to find anything explicit about the significance of 
whitespace within body tags in the html spec, but from the point of view of 
a visual browser, I would think that whitespace is significant (i.e. there 
is a difference between "a b" and "ab").


Here is my attempt at interpreting what should be happening:

Section 9.1 of the HTML 4.01 spec says that although consecutive
whitespace occurrences can be collapsed to a single whitespace character
(outside of PRE elements), inter-word whitespace should still be
displayed. So I don't think libxml should be eating all of the spaces
here.

I've written a patch that solves this problem for me (see below).  I chose 
to prevent the creation of implicit <p> tags by removing "body" from 
htmlNoContentElements.  This may or may not be desired, but made the most 
sense to me.


This is tricky to say: the problem is that you are using the HTML 4.01
Loose DTD and libxml implements parsing according to HTML 4.01 Strict
DTD. In 4.01 Loose, the body element is defined as:

        <!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
        <!ENTITY % flow "%block; | %inline;">
        <!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">

So a body element can contain PCDATA directly, which appears at odds
with its inclusion in htmlNoContentElements. However, in 4.01 Strict,
the body element is:

        !ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->

and you cannot have inline content directly in the body (and even
assuming an implicit P element is not strictly valid -- opening P tags
are required).

I don't know if any of this is correct or not, so please point me to the 
appropriate sections of the spec if I'm way off here.


One constant on this list is that whenever I try to claim anything based
on the specs, I inevitably screw up and Daniel corrects me. It's good
for my humility. :-)

So don't read too much into the above, but I have tried to give you some
references that might be useful. I'm not really sure what the solution
here is, since my understanding is that true HTML (non-XHTML) parsing is
kind of a value-add in libxml and not as fully implemented as XML
parsing. To get this completely correct, I think you need to teach
libxml to detect the DTD you are using and adapt appropriately.

Cheers,
Malcolm

Follow-Ups:
- Re: [xml] HTMLparser: whitespace in <body> tags
  - From: Daniel Veillard
- Re: [xml] HTMLparser: whitespace in <body> tags
  - From: Benj Carson

References:
- [xml] HTMLparser: whitespace in <body> tags
  - From: Benj Carson

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]