Re: [xml] Fwd: HTML5 test cases



On 11/03/2010 02:50 PM, Daniel Veillard wrote:

  Well if there is now a good semantic about what an HTML parser should
do in corner cases, I have no problem with getting patches in !
  The current HTML parser was basically implemented using the HTML4 spec
but without the crazyness of trying to mimics what browsers do with
that input. The main usage is screen-scraping or conversion to XML
(at least for me) and that wasn't looking worth the effort.
   Now if there is a decent semantic about what a parser should do with
HTML5 and HTML5-like (that's the problem) kind of input, then nice,
I'm sure once it gets REC status then people will be enthusistaic to
develop small parsers and maybe libxml2 can be one of them.
   Me I'm really welcoming HTML5 parser patches, one can probably make
a new parsing option for the existing parser to allow old and new
behaviour (or switch automatically but we all know it's error prone :-)
But I have no time developping this myself, libvirt is what I'm
working on ATM,

This does not need to wait until REC status, the parsing algorithm is fairly stable.

Some background: Henri wrote a fully compliant HTML parser in Java, and has been keeping it in sync with the specification (at times even writing bug reports against the HTML5 spec as required):

http://about.validator.nu/htmlparser/

He then wrote a translator which mechanically converts his usage of Java into a C++ program with dependencies on some Mozilla libraries:

http://groups.google.com/group/mozilla.dev.platform/msg/35ace94ab1ae1511?pli=1
http://mxr.mozilla.org/mozilla-central/source/parser/

The result is not only compliant with the HTML5 specification, it is the actual parser which will ship with Firefox 4:

http://hg.mozilla.org/mozilla-central/rev/129e19d979f0

Oversimplifying, but if this same code could target the underlying string and DOM handling routines, the result of an parse would be immediately useful to applications which build on top of libxml2.

Daniel

- Sam Ruby




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]