[xml] HTMLparser: whitespace in <body> tags
- From: Benj Carson <benjcarson digitaljunkies ca>
- To: xml gnome org
- Subject: [xml] HTMLparser: whitespace in <body> tags
- Date: Sat, 25 Sep 2004 14:51:15 -0600
Hello,
This may or may not be a bug, but I've discovered that the html parser
removes whitespace between inline elements that are direct children of body
elements, e.g.:
$ cat adj_inline.html
<html>
<body>
<a>a</a> <a>b</a>
</body>
</html>
$ xmllint --html adj_inline.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a>a</a><a>b</a>
</body></html>
I haven't been able to find anything explicit about the significance of
whitespace within body tags in the html spec, but from the point of view of
a visual browser, I would think that whitespace is significant (i.e. there
is a difference between "a b" and "ab").
I've written a patch that solves this problem for me (see below). I chose
to prevent the creation of implicit <p> tags by removing "body" from
htmlNoContentElements. This may or may not be desired, but made the most
sense to me.
I don't know if any of this is correct or not, so please point me to the
appropriate sections of the spec if I'm way off here.
Thanks,
Benj Carson
HTMLParser.c.diff (against 2.6.8):
--- HTMLparser.c.orig 2004-09-25 14:27:41.000000000 -0600
+++ HTMLparser.c 2004-09-25 14:22:15.000000000 -0600
@@ -940,7 +940,6 @@
static const char *htmlNoContentElements[] = {
"html",
"head",
- "body",
NULL
};
@@ -2022,8 +2021,6 @@
return(1);
if (xmlStrEqual(ctxt->name, BAD_CAST"head"))
return(1);
- if (xmlStrEqual(ctxt->name, BAD_CAST"body"))
- return(1);
if (ctxt->node == NULL) return(0);
lastChild = xmlGetLastChild(ctxt->node);
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]