Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()

From: Julien Chaffraix <julien chaffraix gmail com>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
Date: Mon, 23 Feb 2009 11:25:14 +0100

Hi,

Sorry for not catching this in the first place.

On Mon, Feb 23, 2009 at 9:31 AM, Daniel Veillard <veillard redhat com> wrote:

On Tue, Feb 17, 2009 at 12:29:02PM -0800, Rush Manbert wrote:

I am processing XHTML source files, rendering them to HTML strings, then
loading the HTML string into a browser control (Webkit).


Rendering XHTML files as HTML is asking for trouble. HTML has some
quirks that will make your XHTML page render strangely.

Originally I was generating the string by calling xmlDocDumpMemory(),
but I kept reading articles that suggested you render as HTML if the
result is being displayed by a browser. I changed to use
htmlDocDumpMemory(), and my application still worked with no problems.

Recently, however, we were developing a new set of web pages, and I had
occasion to load the HTML string output into a real browser (Safari), by
first writing the HTML string to a file, then opening the file in the
browser. To my surprise, the JavaScript error console displayed quite a
few errors. Many of them were complaints that the HTML contained element
pairs such as "<br></br>", or "<p></p>". Someone had asked be why we had
extra blank lines in the browser display, and I finally realized it was
because Safari was treating <br></br> as <br><br> (which is what the
error message said it would do).


I had a look at our HTML parser and it seems that in quirks mode,
</br> is interpreted as <br> as you were reporting
(just check the comment at
http://trac.webkit.org/browser/trunk/WebCore/html/HTMLParser.cpp#L204).
So it is not a bug but a compatibility quirk (provided you are indeed
in quirks mode). I think the complain about <p></p> is an overzealous
check for </p> with unmatched <p> (again in quirks mode) but I may be
wrong here.

Rush, have you specified a doctype in your html file? Have you checked
how other browsers behave?

 From an XML parser <br /> and <br></br> are strictly equivalent (well
except for the Microsoft reader API which distinguishes the two but
should not), so if your broswer is loading the file with an XML parser
then the to forms are equivalent (BTW Safari is using libxml2 for XML
parsing so maybe someone can comment about this in more details ;-)


Sure :-)

WebKit is using libxml2's SAX callbacks. Both forms should lead to the
same callbacks' sequence and thus will result in the same element been
created. I have tried this and it is the case in Safari 3.2.1.

 Now an HTML parser should make no difference between <br /> and
<br>, that's why it's suggested to serialize XHTML that way.

 The behaviour you mention sounds like a bug in my opinion, <br />
should be safe for both kind of parsing, except if internally Safari
loads as XML , reserialize as <br></br> and then hands this to the
HTML parser, I don't see any other logical way to achieve what you got.


No, we avoid moving documents from one parser to another. We determine
the document type using different methods (content-type header,
extension ...) and then use either the XML parser that uses libxml2 or
our own HTML parser.

Regards,
Julien

Follow-Ups:
- Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Daniel Veillard
- Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Rush Manbert

References:
- [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Rush Manbert
- Re: [xml] htmlDocDumpMemory() vs xmlDocDumpMemory()
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]