Re: [xml] Support of HTML v5 parsing



On Mon, Jun 29, 2015 at 05:36:58PM +0200, Stefan Behnel wrote:
Bruce Miller schrieb am 28.05.2015 um 18:37:
On 05/28/2015 12:29 PM, Noam Postavsky wrote:
On Thu, May 28, 2015 at 12:13 PM, Frank Gross wrote:
  Are there any plans to support parsing of HTML V5 in libxml ? I tried
function htmlCtxtReadMemory(), but it raises an error for HTML document
containing tags introduced in HTML V5 such as : Tag header invalid.

I'd love to see this happen!  I'm so used to the libxml2 tools,
and the tools built upon them, it would SO simplify my life.

I think the same question has already been asked, and answered at
https://mail.gnome.org/archives/xml/2013-April/msg00006.html

Sorta, yes. But HTML5 is essentially _defined_ by it's parser rather than
by it's spec. In particular the (annoying) way that it rewrites the DOM
to turn what you wrote into what it wants.  That being the case, there's
more to adapting libxml's HTML parser than just being more forgiving about
the unrecognized tags --- the resulting DOM might not be quite what HTML5
specifies!

I think most people would be happy if the new tags were recognised
correctly, e.g. the self-closing ones. Whether or not the resulting DOM
tree is strictly HTML5 parsing conform or not - does it really matter that
much?

 I assume that would not make us conformant, but that would make us less bad :-)


Which is all to say that it's not quite trivial; would probably amount to
importing the "official" parser and modifying it to create libxml's internal
structure.  Sadly, Daniel doesn't have the time.   Nor, alas, do I.

There's a long list of tag metadata in the HTMLparser.c file. I'm sure a
patch that adds just a couple of the new tags would be warmly appreciated.
As long as everyone just goes "*I* don't have time ATM, not even to add one
little tag", nothing's going to change here.

  Agreed, that's one way to do it, and based on my current work status
I don't see any "ample free time" coming any time soon, so we'd better be
very practical.
  Recognizing that a document is HTML5, extending the list of tags name
(did HTML deprecate some of those in HTML4 ?) and associated attributes
would be a relatively simple first step.

  Someone up to the task, or is there somewhere a list of HTML5 extensions
compared to HTML4 ?

   thanks,

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]