Re: [xml] Adding default DOCTYPE when HTML document doesn't have any



On Mon, Jul 26, 2010 at 10:24:29AM +0200, Damian Pietras wrote:
Hi, I use libxml to do HTML processing using htmlParseDocument, than do
some simple transformations (like replacing URIs just to correct relative
patch etc.) and then save the document using xmlSaveDoc(). The output is
an HTML file that is passed to the web browser.

The problem is that in case that there is no DOCTYPE declaration in the
input document libxml2 adds a default one:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
      "http://www.w3.org/TR/REC-html40/loose.dtd";>

There is a difference in rendering of pages by web browsers that comes
from various quirks modes that are turned on or off based on the DOCTYPE
declaration. To illustrate the difference there is a test page where you
can see the same HTML/CSS code with various DOCTYPEs prepended:

http://dbaron.org/mozilla/tests/compat?doctype=
http://dbaron.org/mozilla/tests/compat?doctype=%3C!DOCTYPE+HTML+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fhtml4%2Floose.dtd%22%3E
http://dbaron.org/mozilla/tests/compat?doctype=%3C!DOCTYPE+HTML+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+HTML+4.01+Transitional%2F%2FEN%22%3E
http://dbaron.org/mozilla/tests/compat?doctype=%3C!DOCTYPE+HTML%3E

Although that in the cases I've seen the web page having no DOCTYPE is
rendered like with the DOCTYPE that is prepended by libxml2 I would be
happy if there was a way to not append the default DOCTYPE or to
know that the original document had no DOCTYPE at all. Is there a
way to do that?

  Hum,

this is added automatically at the end of htmlParseDocument() if no
doctype was found, and until now there is no option to turn this off.

Since this is an arbitrary behaviour from libxml2, and while this can
be fixed (by finding and removing said DTD from the resulting tree),
I think it's best to provide a new HTML_PARSE_NODEFDTD parsing option
for the HTML parser to avoid this. The code is actually fairly simple,
I'm attaching the patch I will commit soon,

I'm adding an --nodefdtd option to xmllint to use with --html in order
to activate the flag:

paphio:~/XML -> xmllint --html --debug tst.html
HTML DOCUMENT
URL=tst.html
standalone=true
  DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM
http://www.w3.org/TR/REC-html40/loose.dtd
  ELEMENT html
    ELEMENT body
      TEXT
        content=
paphio:~/XML -> xmllint --html --nodefdtd --debug tst.html
HTML DOCUMENT
URL=tst.html
standalone=true
  ELEMENT html
    ELEMENT body
      TEXT
        content=
paphio:~/XML ->

  thanks for raising the issue,

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Attachment: no_html_default_dtd.patch
Description: Text document



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]