Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
- From: Gary Coady <gary lyranthe org>
- To: xml gnome org
- Subject: Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
- Date: Thu, 12 Jan 2006 01:29:39 +0000
John Hockaday ga gov au wrote:
Hi All,
I personally believe that it should be based on the DTD being used for the
HTML.
I use XHTML Strict (-//W3C//DTD XHTML 1.0 Strict//EN) and I would expect any
conversion from XML to XHTML to make an XML document that is valid against
the DTD. If libxml2 does what you want then this will *not* be the case and
hence all my XHTML would be invalid.
In the future people will have problems with their XHTML if they do not
consider using the strict version. The semantic web and machine to machine
communications will need to depend on the documents being as compliant as
possible to the standard.
"-//W3C//DTD XHTML 1.0 Transitional//EN" is supposed to be for "transitional"
use while one is going from HTML 4.0 to XHTML. I believe that
XHTML1.0-Strict is the expected standard until it is replaced by the W3C XML
Schema version. For this reason I believe Libxml2 should automatically
provide XHTML1.0-Strict. If not then should libxml2 be creating HTML 1.0?
;--)
I agree with most of the above, but an alteration would not affect the
behaviour you're worried about; XHTML should be parsed with the XML
parser, not the legacy HTML parser - and this issue involves the
behaviour of the latter.
A few months ago, I came across a "bug" where whitespace nodes as a
direct child of the <body> tag would be removed. The problem is similar
in that pure whitespace nodes are forbidden by the strict DTD, but
allowed by the transitional DTD.
In this case, the applied patch checked the DTD in use with code like
dtd = xmlGetIntSubset(ctxt->myDoc);
if (dtd != NULL && dtd->ExternalID != NULL) {
if (!xmlStrcasecmp(dtd->ExternalID,
BAD_CAST "-//W3C//DTD HTML 4.01//EN") ||
!xmlStrcasecmp(dtd->ExternalID,
BAD_CAST "-//W3C//DTD HTML 4//EN"))
{
(line 2060, HTMLparser.c).
This code assumes that HTML 4 and HTML 4.01 are the only strict non-XML
DTDs in existence.
Something similar might be useful for this issue - the <p> tags are not
needed for a Transitional DTD. I'll have a look to see if there's an
easy fix at the weekend, if nobody's supplied a patch before that :-)
Gary.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]