[xml] Apparently incorrect paragraph wrapping in HTML parser



Greetings,
for the past week, I've been fixing various bugs in gtkhtml2. Recently, I've found an issue that I -- hopefully correctly -- traced back to libxml2's HTML parser.

When parsing a html such as:
<html><body> xxx <div>aaa</div> yyy <div>bbb</div> zzz </body></html>

I get the 'xxx', 'yyy' and 'zzz' wrapped into paragraphs ("p" element, eg. "[...]<body><p>xxx</p><div>[...]).

The html:
<html><body>some <img src="foo.bar"> text</body></html>
turns into:
<html><body><p>some <img src="foo.bar"> text</p></body></html>

The reason is apparently that each text should be in it's own block; unfortunately, wrapping them right into paragraph elements has quite a few drawbacks:

a) During later processing, eg. a stylesheet may (and in fact does) get applied to the "p" element; imagine, for example, having a background-image set for all <p>, and you'll suddenly see it even where it shouldn't be at all... It may therefore also break rendering of eg. float (please find the two attached test HTMLs, one without "p" elements, one with them).

b) It doesn't appear to be compliant with the standard either; at least I didn't find any such such in the HTML 4.01 standard.

c) I have no idea why does the text go into <p> in the second example, too...

I do not believe that wrapping the text into paragraph (which, I believe, is performed by htmlCheckParagraph()) is the best way; perhaps setting the tag name to eg. NULL instead, or a zero-size string (as a special value) would be a better way to resolve the point a) and b). If no styling and rendering would be applied to the reported block (by the forementioned fix), it would imply that c) would no longer matter anyway, too.

Thanks in advance for reply.
 -- iSteve

xxx
aaa
yyy
bbb
zzz

xxx

aaa

yyy

bbb

zzz



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]