Re: [xml] Support of HTML v5 parsing



Bruce Miller schrieb am 28.05.2015 um 18:37:
On 05/28/2015 12:29 PM, Noam Postavsky wrote:
On Thu, May 28, 2015 at 12:13 PM, Frank Gross wrote:
  Are there any plans to support parsing of HTML V5 in libxml ? I tried
function htmlCtxtReadMemory(), but it raises an error for HTML document
containing tags introduced in HTML V5 such as : Tag header invalid.

I'd love to see this happen!  I'm so used to the libxml2 tools,
and the tools built upon them, it would SO simplify my life.

I think the same question has already been asked, and answered at
https://mail.gnome.org/archives/xml/2013-April/msg00006.html

Sorta, yes. But HTML5 is essentially _defined_ by it's parser rather than
by it's spec. In particular the (annoying) way that it rewrites the DOM
to turn what you wrote into what it wants.  That being the case, there's
more to adapting libxml's HTML parser than just being more forgiving about
the unrecognized tags --- the resulting DOM might not be quite what HTML5
specifies!

I think most people would be happy if the new tags were recognised
correctly, e.g. the self-closing ones. Whether or not the resulting DOM
tree is strictly HTML5 parsing conform or not - does it really matter that
much?


Which is all to say that it's not quite trivial; would probably amount to
importing the "official" parser and modifying it to create libxml's internal
structure.  Sadly, Daniel doesn't have the time.   Nor, alas, do I.

There's a long list of tag metadata in the HTMLparser.c file. I'm sure a
patch that adds just a couple of the new tags would be warmly appreciated.
As long as everyone just goes "*I* don't have time ATM, not even to add one
little tag", nothing's going to change here.

Stefan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]