[xml] HTMLparser: whitespace in <body> tags



Hello,

This may or may not be a bug, but I've discovered that the html parser 
removes whitespace between inline elements that are direct children of body 
elements, e.g.:

$ cat adj_inline.html
<html>
<body>
<a>a</a> <a>b</a>
</body>
</html>

$ xmllint --html adj_inline.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><body>
<a>a</a><a>b</a>
</body></html>

I haven't been able to find anything explicit about the significance of 
whitespace within body tags in the html spec, but from the point of view of 
a visual browser, I would think that whitespace is significant (i.e. there 
is a difference between "a b" and "ab").

I've written a patch that solves this problem for me (see below).  I chose 
to prevent the creation of implicit <p> tags by removing "body" from 
htmlNoContentElements.  This may or may not be desired, but made the most 
sense to me.

I don't know if any of this is correct or not, so please point me to the 
appropriate sections of the spec if I'm way off here.

Thanks,


Benj Carson


HTMLParser.c.diff (against 2.6.8):

--- HTMLparser.c.orig 2004-09-25 14:27:41.000000000 -0600
+++ HTMLparser.c 2004-09-25 14:22:15.000000000 -0600
@@ -940,7 +940,6 @@
 static const char *htmlNoContentElements[] = {
     "html",
     "head",
-    "body",
     NULL
 };
 
@@ -2022,8 +2021,6 @@
  return(1);
   if (xmlStrEqual(ctxt->name, BAD_CAST"head"))
  return(1);
-  if (xmlStrEqual(ctxt->name, BAD_CAST"body")) 
-  return(1); 
   if (ctxt->node == NULL) return(0);
 
   lastChild = xmlGetLastChild(ctxt->node);



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]