Re: [xml] Apparently incorrect paragraph wrapping in HTML parser



On Mon, Jan 09, 2006 at 02:44:34PM +0100, iSteve wrote:
Greetings,
for the past week, I've been fixing various bugs in gtkhtml2. Recently, 
I've found an issue that I -- hopefully correctly -- traced back to 
libxml2's HTML parser.

When parsing a html such as:
<html><body> xxx <div>aaa</div> yyy <div>bbb</div> zzz </body></html>

I get the 'xxx', 'yyy' and 'zzz' wrapped into paragraphs ("p" element, 
eg. "[...]<body><p>xxx</p><div>[...]).

The html:
<html><body>some <img src="foo.bar"> text</body></html>
turns into:
<html><body><p>some <img src="foo.bar"> text</p></body></html>

The reason is apparently that each text should be in it's own block; 
unfortunately, wrapping them right into paragraph elements has quite a 
few drawbacks:

 a) During later processing, eg. a stylesheet may (and in fact does) 
get applied to the "p" element; imagine, for example, having a 
background-image set for all <p>, and you'll suddenly see it even where 
it shouldn't be at all... It may therefore also break rendering of eg. 
float (please find the two attached test HTMLs, one without "p" 
elements, one with them).

 b) It doesn't appear to be compliant with the standard either; at 
least I didn't find any such such in the HTML 4.01 standard.

 c) I have no idea why does the text go into <p> in the second example, 
too...

  The spec for body is at :

  http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.1
    <!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->

I'm not sure text nodes are to be accepted directly as child of a body element

  For div, it seems adding the <p> is superfluous

  http://www.w3.org/TR/REC-html40/struct/global.html#edef-DIV
<!ELEMENT DIV - - (%flow;)*            -- generic language/style container -->

I do not believe that wrapping the text into paragraph (which, I 
believe, is performed by htmlCheckParagraph()) is the best way; perhaps 
setting the tag name to eg. NULL instead, or a zero-size string (as a 

  element with no name or element with empty names would break so much
code assuming a correct that nothing could justify such a hack, sorry !!!

special value) would be a better way to resolve the point a) and b). If 
no styling and rendering would be applied to the reported block (by the 
forementioned fix), it would imply that c) would no longer matter 
anyway, too.

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]