Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
- From: Rush Manbert <rush manbert com>
- To: Julien Chaffraix <julien chaffraix gmail com>
- Cc: xml gnome org, veillard redhat com
- Subject: Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
- Date: Fri, 27 Feb 2009 17:50:23 -0800
Hi Julien,
Finally got some time to look at this more. Comments inline.
On Feb 23, 2009, at 2:25 AM, Julien Chaffraix wrote:
Hi,
Sorry for not catching this in the first place.
On Mon, Feb 23, 2009 at 9:31 AM, Daniel Veillard
<veillard redhat com> wrote:
On Tue, Feb 17, 2009 at 12:29:02PM -0800, Rush Manbert wrote:
I am processing XHTML source files, rendering them to HTML
strings, then
loading the HTML string into a browser control (Webkit).
Rendering XHTML files as HTML is asking for trouble. HTML has some
quirks that will make your XHTML page render strangely.
Originally I was generating the string by calling
xmlDocDumpMemory(),
but I kept reading articles that suggested you render as HTML if the
result is being displayed by a browser. I changed to use
htmlDocDumpMemory(), and my application still worked with no
problems.
Recently, however, we were developing a new set of web pages, and
I had
occasion to load the HTML string output into a real browser
(Safari), by
first writing the HTML string to a file, then opening the file in
the
browser. To my surprise, the JavaScript error console displayed
quite a
few errors. Many of them were complaints that the HTML contained
element
pairs such as "<br></br>", or "<p></p>". Someone had asked be why
we had
extra blank lines in the browser display, and I finally realized
it was
because Safari was treating <br></br> as <br><br> (which is what the
error message said it would do).
I had a look at our HTML parser and it seems that in quirks mode,
</br> is interpreted as <br> as you were reporting
(just check the comment at
http://trac.webkit.org/browser/trunk/WebCore/html/
HTMLParser.cpp#L204).
So it is not a bug but a compatibility quirk (provided you are indeed
in quirks mode). I think the complain about <p></p> is an overzealous
check for </p> with unmatched <p> (again in quirks mode) but I may be
wrong here.
Rush, have you specified a doctype in your html file? Have you checked
how other browsers behave?
Here is a sample of rendered output, using htmlDocDumpMemory():
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-
US">
<!--Template match for /x:html/x:head-->
<head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/
">
<!--No imltemplate file, or no x:imltemplatehead.-->
<!--Processing $docHeadContent directly-->
<title>RenderingTestPage</title></head>
<!--Template match for /x:html/x:body-->
<body><!--No imltemplate file, or no x:imltemplatebody.-->
<!--Processing $docBodyContent directly-->
<p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal
"br" element<br></br>Line 4: And this comes immediately after it</p><img
src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg
"></img></body>
</html>
I didn't clean it up at all, so it has a bunch of trace output that my
XSL processing inserts. I have a doctype that seems to be inserted by
libxml. My doctype declaration in the XHTML source specifies my own
DTD, which is an extension of XHTML. I also tried changing it to this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-
strict.dtd" >
but the rendered output has the same doctype declaration shown above,
whether I use htmlDocDumpMemory() or xmlDocDumpMemory().
I tried loading this from a file with Safari and Firefox on the Mac
and IE and Firefox on Windows. They all treat the </br> as <br>, and
output an extra blank line, but only Safari said anything about it on
the console.
I'm afraid that I lied about the <p></p>. They are fine with that.
However, Safari also complained about the </img>, while no one else
does (but it displayed the image).
From an XML parser <br /> and <br></br> are strictly equivalent (well
except for the Microsoft reader API which distinguishes the two but
should not), so if your broswer is loading the file with an XML
parser
then the to forms are equivalent (BTW Safari is using libxml2 for XML
parsing so maybe someone can comment about this in more details ;-)
Sure :-)
WebKit is using libxml2's SAX callbacks. Both forms should lead to the
same callbacks' sequence and thus will result in the same element been
created. I have tried this and it is the case in Safari 3.2.1.
Now an HTML parser should make no difference between <br /> and
<br>, that's why it's suggested to serialize XHTML that way.
The behaviour you mention sounds like a bug in my opinion, <br />
should be safe for both kind of parsing, except if internally Safari
loads as XML , reserialize as <br></br> and then hands this to the
HTML parser, I don't see any other logical way to achieve what you
got.
No, we avoid moving documents from one parser to another. We determine
the document type using different methods (content-type header,
extension ...) and then use either the XML parser that uses libxml2 or
our own HTML parser.
I guess one thing that I don't understand is why htmlDocDumpMemory()
would ever insert a </br>, since HTML doesn't need it, or a separate </
img>, since there can't be any content between the <img> and the </img>.
However, if it's safer to render as XML, I can go back to that.
Here is the same page rendered with xmlDocDumpMemory():
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-
US">
<!--Template match for /x:html/x:head-->
<head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/
"/><!--No imltemplate file,
or no x:imltemplatehead.-->
<!--Processing $docHeadContent directly-->
<title>RenderingTestPage</title></head>
<!--Template match for /x:html/x:body-->
<body><!--No imltemplate file, or no x:imltemplatebody.-->
<!--Processing $docBodyContent directly-->
<p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal
"br" element<br/>Line 4: And this comes immediately after it</p><img
src=
"file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg
"/></body>
</html>
And here is the output of diff xml html:
1,2c1
< <?xml version="1.0" encoding="utf-8" standalone="yes"?>
< <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
---
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
5c4,5
< <head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/
"/><!--No imltemplate file, or no x:imltemplatehead.-->
---
> <head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/
">
> <!--No imltemplate file, or no x:imltemplatehead.-->
11c11
< <p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal
"br" element<br/>Line 4: And this comes immediately after it</p><img
src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg
"/></body>
---
> <p>Line 1</p><i>Line 2</i><p> Line 3: This comes before a XML-legal
"br" element<br></br>Line 4: And this comes immediately after it</
p><img src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg
"></img></body>
To me, the XHML output looks to be better behaved and it sounds like
your recommendation would be to keep my output as XHTML.
My question to Daniel would be this: If I want to render as XHTML,
should I use the new API, or just stay with xmlDocDumpMemory()?
Best regards,
Rush
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]