Re: [xml] Apparently incorrect paragraph wrapping in HTML parser

From: Daniel Veillard <veillard redhat com>
To: Gary Coady <gary lyranthe org>
Cc: xml gnome org
Subject: Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
Date: Thu, 12 Jan 2006 01:58:39 -0500

On Thu, Jan 12, 2006 at 01:29:39AM +0000, Gary Coady wrote:

John Hockaday ga gov au wrote:

Hi All,

I personally believe that it should be based on the DTD being used for the
HTML.  

I use XHTML Strict (-//W3C//DTD XHTML 1.0 Strict//EN) and I would expect any
conversion from XML to XHTML to make an XML document that is valid against
the DTD.  If libxml2 does what you want then this will *not* be the case and
hence all my XHTML would be invalid.

In the future people will have problems with their XHTML if they do not
consider using the strict version.  The semantic web and machine to machine
communications will need to depend on the documents being as compliant as
possible to the standard.  

"-//W3C//DTD XHTML 1.0 Transitional//EN" is supposed to be for "transitional"
use while one is going from HTML 4.0 to XHTML.  I believe that
XHTML1.0-Strict is the expected standard until it is replaced by the W3C XML
Schema version.  For this reason I believe Libxml2 should automatically
provide XHTML1.0-Strict.  If not then should libxml2 be creating HTML 1.0?
;--)


I agree with most of the above, but an alteration would not affect the
behaviour you're worried about; XHTML should be parsed with the XML
parser, not the legacy HTML parser - and this issue involves the
behaviour of the latter.


  Right.

A few months ago, I came across a "bug" where whitespace nodes as a
direct child of the <body> tag would be removed. The problem is similar
in that pure whitespace nodes are forbidden by the strict DTD, but
allowed by the transitional DTD.

In this case, the applied patch checked the DTD in use with code like

dtd = xmlGetIntSubset(ctxt->myDoc);
if (dtd != NULL && dtd->ExternalID != NULL) {
    if (!xmlStrcasecmp(dtd->ExternalID,
        BAD_CAST "-//W3C//DTD HTML 4.01//EN") ||
        !xmlStrcasecmp(dtd->ExternalID,
        BAD_CAST "-//W3C//DTD HTML 4//EN"))
{
(line 2060, HTMLparser.c).

This code assumes that HTML 4 and HTML 4.01 are the only strict non-XML
DTDs in existence.

Something similar might be useful for this issue - the <p> tags are not
needed for a Transitional DTD. I'll have a look to see if there's an
easy fix at the weekend, if nobody's supplied a patch before that :-)


  Yes, thanks ! That sounds the right approach to me, I would just turn
merge that with a new htmlParserOption HTML_PARSE_STRICT, which could be
either passed by the user to maintain the current behaviour or activated by
default when the DOCTYPE is read if it happen to be a Strict HTML one.

   make sense ? 

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Follow-Ups:
- Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
  - From: iSteve
- Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
  - From: Gary Coady

References:
- RE: [xml] Apparently incorrect paragraph wrapping in HTML parser
  - From: John.Hockaday
- Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
  - From: Gary Coady

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]