Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()



Hi Julien,

Finally got some time to look at this more. Comments inline.

On Feb 23, 2009, at 2:25 AM, Julien Chaffraix wrote:

Hi,

Sorry for not catching this in the first place.

On Mon, Feb 23, 2009 at 9:31 AM, Daniel Veillard <veillard redhat com> wrote:
On Tue, Feb 17, 2009 at 12:29:02PM -0800, Rush Manbert wrote:
I am processing XHTML source files, rendering them to HTML strings, then
loading the HTML string into a browser control (Webkit).

Rendering XHTML files as HTML is asking for trouble. HTML has some
quirks that will make your XHTML page render strangely.

Originally I was generating the string by calling xmlDocDumpMemory(),
but I kept reading articles that suggested you render as HTML if the
result is being displayed by a browser. I changed to use
htmlDocDumpMemory(), and my application still worked with no problems.

Recently, however, we were developing a new set of web pages, and I had occasion to load the HTML string output into a real browser (Safari), by first writing the HTML string to a file, then opening the file in the browser. To my surprise, the JavaScript error console displayed quite a few errors. Many of them were complaints that the HTML contained element pairs such as "<br></br>", or "<p></p>". Someone had asked be why we had extra blank lines in the browser display, and I finally realized it was
because Safari was treating <br></br> as <br><br> (which is what the
error message said it would do).

I had a look at our HTML parser and it seems that in quirks mode,
</br> is interpreted as <br> as you were reporting
(just check the comment at
http://trac.webkit.org/browser/trunk/WebCore/html/ HTMLParser.cpp#L204).
So it is not a bug but a compatibility quirk (provided you are indeed
in quirks mode). I think the complain about <p></p> is an overzealous
check for </p> with unmatched <p> (again in quirks mode) but I may be
wrong here.

Rush, have you specified a doctype in your html file? Have you checked
how other browsers behave?

Here is a sample of rendered output, using htmlDocDumpMemory():

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml"; lang="en-US" xml:lang="en- US">
<!--Template match for /x:html/x:head-->
<head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/ ">
<!--No imltemplate file, or no x:imltemplatehead.-->
<!--Processing $docHeadContent directly-->
<title>RenderingTestPage</title></head>
<!--Template match for /x:html/x:body-->
<body><!--No imltemplate file, or no x:imltemplatebody.-->
<!--Processing $docBodyContent directly-->
<p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal "br" element<br></br>Line 4: And this comes immediately after it</p><img src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg "></img></body>
</html>

I didn't clean it up at all, so it has a bunch of trace output that my XSL processing inserts. I have a doctype that seems to be inserted by libxml. My doctype declaration in the XHTML source specifies my own DTD, which is an extension of XHTML. I also tried changing it to this: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1- strict.dtd" > but the rendered output has the same doctype declaration shown above, whether I use htmlDocDumpMemory() or xmlDocDumpMemory().

I tried loading this from a file with Safari and Firefox on the Mac and IE and Firefox on Windows. They all treat the </br> as <br>, and output an extra blank line, but only Safari said anything about it on the console.

I'm afraid that I lied about the <p></p>. They are fine with that. However, Safari also complained about the </img>, while no one else does (but it displayed the image).



From an XML parser <br /> and <br></br> are strictly equivalent (well
except for the Microsoft reader API which distinguishes the two but
should not), so if your broswer is loading the file with an XML parser
then the to forms are equivalent (BTW Safari is using libxml2 for XML
parsing so maybe someone can comment about this in more details ;-)

Sure :-)

WebKit is using libxml2's SAX callbacks. Both forms should lead to the
same callbacks' sequence and thus will result in the same element been
created. I have tried this and it is the case in Safari 3.2.1.

Now an HTML parser should make no difference between <br /> and
<br>, that's why it's suggested to serialize XHTML that way.

The behaviour you mention sounds like a bug in my opinion, <br />
should be safe for both kind of parsing, except if internally Safari
loads as XML , reserialize as <br></br> and then hands this to the
HTML parser, I don't see any other logical way to achieve what you got.

No, we avoid moving documents from one parser to another. We determine
the document type using different methods (content-type header,
extension ...) and then use either the XML parser that uses libxml2 or
our own HTML parser.

I guess one thing that I don't understand is why htmlDocDumpMemory() would ever insert a </br>, since HTML doesn't need it, or a separate </ img>, since there can't be any content between the <img> and the </img>.

However, if it's safer to render as XML, I can go back to that.

Here is the same page rendered with xmlDocDumpMemory():

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
<html xmlns="http://www.w3.org/1999/xhtml"; lang="en-US" xml:lang="en- US">
<!--Template match for /x:html/x:head-->
<head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/ "/><!--No imltemplate file,
or no x:imltemplatehead.-->
<!--Processing $docHeadContent directly-->
<title>RenderingTestPage</title></head>
<!--Template match for /x:html/x:body-->
<body><!--No imltemplate file, or no x:imltemplatebody.-->
<!--Processing $docBodyContent directly-->
<p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal "br" element<br/>Line 4: And this comes immediately after it</p><img src= "file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg "/></body>
</html>



And here is the output of diff xml html:

1,2c1
< <?xml version="1.0" encoding="utf-8" standalone="yes"?>
< <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
---
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
5c4,5
< <head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/ "/><!--No imltemplate file, or no x:imltemplatehead.-->
---
> <head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/ ">
> <!--No imltemplate file, or no x:imltemplatehead.-->
11c11
< <p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal "br" element<br/>Line 4: And this comes immediately after it</p><img src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg "/></body>
---
> <p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal "br" element<br></br>Line 4: And this comes immediately after it</ p><img src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg "></img></body>

To me, the XHML output looks to be better behaved and it sounds like your recommendation would be to keep my output as XHTML.

My question to Daniel would be this: If I want to render as XHTML, should I use the new API, or just stay with xmlDocDumpMemory()?

Best regards,
Rush




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]