[xml] HTMLparser: whitespace in <body> tags


This may or may not be a bug, but I've discovered that the html parser 
removes whitespace between inline elements that are direct children of body 
elements, e.g.:

$ cat adj_inline.html
<a>a</a> <a>b</a>

$ xmllint --html adj_inline.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 

I haven't been able to find anything explicit about the significance of 
whitespace within body tags in the html spec, but from the point of view of 
a visual browser, I would think that whitespace is significant (i.e. there 
is a difference between "a b" and "ab").

I've written a patch that solves this problem for me (see below).  I chose 
to prevent the creation of implicit <p> tags by removing "body" from 
htmlNoContentElements.  This may or may not be desired, but made the most 
sense to me.

I don't know if any of this is correct or not, so please point me to the 
appropriate sections of the spec if I'm way off here.


Benj Carson

HTMLParser.c.diff (against 2.6.8):

--- HTMLparser.c.orig 2004-09-25 14:27:41.000000000 -0600
+++ HTMLparser.c 2004-09-25 14:22:15.000000000 -0600
@@ -940,7 +940,6 @@
 static const char *htmlNoContentElements[] = {
-    "body",
@@ -2022,8 +2021,6 @@
   if (xmlStrEqual(ctxt->name, BAD_CAST"head"))
-  if (xmlStrEqual(ctxt->name, BAD_CAST"body")) 
-  return(1); 
   if (ctxt->node == NULL) return(0);
   lastChild = xmlGetLastChild(ctxt->node);

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]