Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()

From: Rush Manbert <rush manbert com>
To: Julien Chaffraix <julien chaffraix gmail com>
Cc: xml gnome org, veillard redhat com
Subject: Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
Date: Fri, 27 Feb 2009 17:50:23 -0800

Hi Julien,

Finally got some time to look at this more. Comments inline.

On Feb 23, 2009, at 2:25 AM, Julien Chaffraix wrote:

Hi,

Sorry for not catching this in the first place.
On Mon, Feb 23, 2009 at 9:31 AM, Daniel Veillard<veillard redhat com> wrote:
On Tue, Feb 17, 2009 at 12:29:02PM -0800, Rush Manbert wrote:
I am processing XHTML source files, rendering them to HTMLstrings, then
loading the HTML string into a browser control (Webkit).
Rendering XHTML files as HTML is asking for trouble. HTML has some
quirks that will make your XHTML page render strangely.
Originally I was generating the string by callingxmlDocDumpMemory(),
but I kept reading articles that suggested you render as HTML if the
result is being displayed by a browser. I changed to use
htmlDocDumpMemory(), and my application still worked with noproblems.
Recently, however, we were developing a new set of web pages, andI hadoccasion to load the HTML string output into a real browser(Safari), byfirst writing the HTML string to a file, then opening the file inthebrowser. To my surprise, the JavaScript error console displayedquite afew errors. Many of them were complaints that the HTML containedelementpairs such as " ", or "". Someone had asked be whywe hadextra blank lines in the browser display, and I finally realizedit was
because Safari was treating as (which is what the
error message said it would do).
I had a look at our HTML parser and it seems that in quirks mode,
 is interpreted as as you were reporting
(just check the comment at
http://trac.webkit.org/browser/trunk/WebCore/html/HTMLParser.cpp#L204).
So it is not a bug but a compatibility quirk (provided you are indeed
in quirks mode). I think the complain about is an overzealous
check for with unmatched (again in quirks mode) but I may be
wrong here.

Rush, have you specified a doctype in your html file? Have you checked
how other browsers behave?


Here is a sample of rendered output, using htmlDocDumpMemory():

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html xmlns="http://www.w3.org/1999/xhtml"; lang="en-US" xml:lang="en-US">

<!--Template match for /x:html/x:head-->

<head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/">

<!--No imltemplate file, or no x:imltemplatehead.-->
<!--Processing $docHeadContent directly-->
<title>RenderingTestPage</title></head>
<!--Template match for /x:html/x:body-->
<body><!--No imltemplate file, or no x:imltemplatebody.-->
<!--Processing $docBodyContent directly-->

Line 1Line 2 Line 3: This comes before a XML-legal"br" element Line 4: And this comes immediately after it<imgsrc="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg"></img></body>

</html>

I didn't clean it up at all, so it has a bunch of trace output that myXSL processing inserts. I have a doctype that seems to be inserted bylibxml. My doctype declaration in the XHTML source specifies my ownDTD, which is an extension of XHTML. I also tried changing it to this:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd" >but the rendered output has the same doctype declaration shown above,whether I use htmlDocDumpMemory() or xmlDocDumpMemory().

I tried loading this from a file with Safari and Firefox on the Macand IE and Firefox on Windows. They all treat the as , andoutput an extra blank line, but only Safari said anything about it onthe console.

I'm afraid that I lied about the . They are fine with that.However, Safari also complained about the </img>, while no one elsedoes (but it displayed the image).

From an XML parser <br /> and <br></br> are strictly equivalent (well
except for the Microsoft reader API which distinguishes the two but

should not), so if your broswer is loading the file with an XMLparser

then the to forms are equivalent (BTW Safari is using libxml2 for XML
parsing so maybe someone can comment about this in more details ;-)


Sure :-)

WebKit is using libxml2's SAX callbacks. Both forms should lead to the
same callbacks' sequence and thus will result in the same element been
created. I have tried this and it is the case in Safari 3.2.1.

Now an HTML parser should make no difference between <br /> and
<br>, that's why it's suggested to serialize XHTML that way.

The behaviour you mention sounds like a bug in my opinion, <br />
should be safe for both kind of parsing, except if internally Safari
loads as XML , reserialize as <br></br> and then hands this to the

HTML parser, I don't see any other logical way to achieve what yougot.


No, we avoid moving documents from one parser to another. We determine
the document type using different methods (content-type header,
extension ...) and then use either the XML parser that uses libxml2 or
our own HTML parser.

I guess one thing that I don't understand is why htmlDocDumpMemory()would ever insert a , since HTML doesn't need it, or a separate </img>, since there can't be any content between the <img> and the </img>.


However, if it's safer to render as XML, I can go back to that.

Here is the same page rendered with xmlDocDumpMemory():

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">

<html xmlns="http://www.w3.org/1999/xhtml"; lang="en-US" xml:lang="en-US">

<!--Template match for /x:html/x:head-->

<head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/"/><!--No imltemplate file,

or no x:imltemplatehead.-->
<!--Processing $docHeadContent directly-->
<title>RenderingTestPage</title></head>
<!--Template match for /x:html/x:body-->
<body><!--No imltemplate file, or no x:imltemplatebody.-->
<!--Processing $docBodyContent directly-->

Line 1Line 2 Line 3: This comes before a XML-legal"br" element Line 4: And this comes immediately after it<imgsrc="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg"/></body>

</html>



And here is the output of diff xml html:

1,2c1
< <?xml version="1.0" encoding="utf-8" standalone="yes"?>
< <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
---
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
5c4,5

< <head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/"/>

---

> <head><base href="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/">

> <!--No imltemplate file, or no x:imltemplatehead.-->
11c11

< Line 1Line 2 Line 3: This comes before a XML-legal"br" element Line 4: And this comes immediately after it<imgsrc="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg"/></body>

---

> Line 1Line 2 Line 3: This comes before a XML-legal"br" element Line 4: And this comes immediately after it<img src="file://localhost//Users/rmanbert/development/libxmlRendering/client/pace/demos/iml/imlDemoV0_1/images/engChurchill.jpg"></img></body>

To me, the XHML output looks to be better behaved and it sounds likeyour recommendation would be to keep my output as XHTML.

My question to Daniel would be this: If I want to render as XHTML,should I use the new API, or just stay with xmlDocDumpMemory()?


Best regards,
Rush

References:
- [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Rush Manbert
- Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Daniel Veillard
- Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Julien Chaffraix

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]