Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
- From: Daniel Veillard <veillard redhat com>
- To: iSteve <isteve deadcd org>
- Cc: xml gnome org
- Subject: Re: [xml] Apparently incorrect paragraph wrapping in HTML parser
- Date: Mon, 9 Jan 2006 09:19:05 -0500
On Mon, Jan 09, 2006 at 02:44:34PM +0100, iSteve wrote:
Greetings,
for the past week, I've been fixing various bugs in gtkhtml2. Recently,
I've found an issue that I -- hopefully correctly -- traced back to
libxml2's HTML parser.
When parsing a html such as:
<html><body> xxx <div>aaa</div> yyy <div>bbb</div> zzz </body></html>
I get the 'xxx', 'yyy' and 'zzz' wrapped into paragraphs ("p" element,
eg. "[...]<body><p>xxx</p><div>[...]).
The html:
<html><body>some <img src="foo.bar"> text</body></html>
turns into:
<html><body><p>some <img src="foo.bar"> text</p></body></html>
The reason is apparently that each text should be in it's own block;
unfortunately, wrapping them right into paragraph elements has quite a
few drawbacks:
a) During later processing, eg. a stylesheet may (and in fact does)
get applied to the "p" element; imagine, for example, having a
background-image set for all <p>, and you'll suddenly see it even where
it shouldn't be at all... It may therefore also break rendering of eg.
float (please find the two attached test HTMLs, one without "p"
elements, one with them).
b) It doesn't appear to be compliant with the standard either; at
least I didn't find any such such in the HTML 4.01 standard.
c) I have no idea why does the text go into <p> in the second example,
too...
The spec for body is at :
http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.1
<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->
I'm not sure text nodes are to be accepted directly as child of a body element
For div, it seems adding the <p> is superfluous
http://www.w3.org/TR/REC-html40/struct/global.html#edef-DIV
<!ELEMENT DIV - - (%flow;)* -- generic language/style container -->
I do not believe that wrapping the text into paragraph (which, I
believe, is performed by htmlCheckParagraph()) is the best way; perhaps
setting the tag name to eg. NULL instead, or a zero-size string (as a
element with no name or element with empty names would break so much
code assuming a correct that nothing could justify such a hack, sorry !!!
special value) would be a better way to resolve the point a) and b). If
no styling and rendering would be applied to the reported block (by the
forementioned fix), it would imply that c) would no longer matter
anyway, too.
Daniel
--
Daniel Veillard | Red Hat http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]