Re: [xml] LibXML Incorrectly Parses Tables by Omitting Implied TBODY

From: Daniel Veillard <veillard redhat com>
To: Alan Hogan <alanhogan gmail com>
Cc: xml gnome org
Subject: Re: [xml] LibXML Incorrectly Parses Tables by Omitting Implied TBODY
Date: Mon, 26 Sep 2011 15:59:46 +0800

On Thu, Sep 22, 2011 at 11:52:16AM -0700, Alan Hogan wrote:

According to HTML 4â, HTML5â, and all the browsers I have tested* (including Firefox, IE7/8/9, Chrome, 
Safari, Opera, Android, iOS):
- No <table> should be without a <tbody>. 

- No <tr> should exist outside of a <thead>, <tfoot>, and <tbody>. 

- The first <tr> encountered in a <table>, if not within a <thead> or <tfoot>, 
and if no <tbody> was manually defined, implies that a <tbody> element was 
just created as well (as the parent of this and all subsequest <tr>s). 

LibXML, however, seems happy to parse <tr> elements as if they were direct children of a table. 
This is simply wrong, nonstandard, and incompatible with user agents. 

It is creating a headache for me because CSS / XPath selections will  not act as expected, and in an 
asymmetrical way with regards to actual users' browsers.

Can we get this to be considered a bug? 

After all, itâs not that the document author declared there was no 
tbody. Wittingly or no, they implied its presence; LibXML is simply failing to 
make the correct inference.


  The big problem is that when you start making inferences like that
you do change the document. In some basic cases it's rather hard to go
wrong, but real world HTML is not about basic cases it's about an ocean
of broken HTML in all possible ways.

  Even something as simple as implying <body> get nasty really fast,
assume a document start with <p> , you would think per the rules you can
add implicit <html><body> ... well until you hit

<p>blah
<title>foo

yes that's wrong, yes it exists, engines will parse and render this
silently. And no I won't try to fix it, maybe <p> was added by some
broken customization layer, maybe the beginner who typed this though
title was a good substicture for h1. If libxml2 start doing this it will
put policies on how to handle brokeness, and since it's a library it's
the wrong place to put this in. For the browser, they are mostly end
application so it's fine for them to implement policies, for libxml2 as
a building block, we can't.

Now for tables that even more complex.
Sometimes the best at the parser level is to just *parse* and let the
interpetation of the result to the application, because if you try to
interpret based on the specification, well in real HTML you're garanteed
to blow up one way or another.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

References:
- [xml] LibXML Incorrectly Parses Tables by Omitting Implied TBODY
  - From: Alan Hogan

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]