Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2

From: Andi Sidwell <andi takkaria org>
To: Stefan Behnel <stefan_ml behnel de>
Cc: xml gnome org, "Michael \(tm\) Smith" <mike w3 org>, Nick Kew <nick webthing com>
Subject: Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
Date: Wed, 20 Aug 2008 15:34:41 +0100

Stefan Behnel wrote:

Hi,

Karl Dubost wrote:

   Nick Kew weighed in and proposed that we should target [6]libxml
   which includes an HTML parser and is already supported by Apache
   server and many other tools.

      [6] http://xmlsoft.org/html/libxml-HTMLparser.html

   From here it would be interesting to implement HTML 5 parsing
   algorithm into libxml2. It would benefit the community as large.


Have you tried joining forces with the people who started the C implementation
of html5lib? Maybe they have ideas to contribute or (partially) working code
that you can look at. It may even happen that you get them convinced of the
project.

In any case, having working implementations in Python and Java should get you
a lot closer to your goal by looking under the hood.


FWIW, I've spent the summer working on a C HTML5 parser which is
approaching stability, called Hubbub[1].  It's about as half as fast as
libxml2 at parsing the HTML 5 spec with an O(1) treebuilder, and it's
fairly easy to bind to the libxml2 interfaces (and is being used in lieu
of the libxml2 HTML parser in a small Web browser, NetSurf[2], in the
development branch).  Note it's a) not buildable as a shared library or
b) had a formal release, but if someone wants an HTML5 parser in C, then
it's probably not a bad bet.

[1] http://www.netsurf-browser.org/projects/hubbub/
[2] http://www.netsurf-browser.org/

Follow-Ups:
- Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Karl Dubost

References:
- [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Karl Dubost
- Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Stefan Behnel
- Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Karl Dubost
- Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]