Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2

From: Karl Dubost <karl w3 org>
To: Stefan Behnel <stefan_ml behnel de>
Cc: xml gnome org, "Michael \(tm\) Smith" <mike w3 org>, Nick Kew <nick webthing com>
Subject: Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
Date: Mon, 18 Aug 2008 13:15:12 +0900

Replying to multiple emails
back from my break

Le 8 août 2008 à 17:25, Stefan Behnel a écrit :

There's already an HTML5 implementation in Python (html5lib) whichyou can usetogether with lxml (so you can benefit from both HTML5 *and* libxml2already).

Yes I know, I gave a pointer in the document. The issue is that it isslow.

IMHO, it's better to stick with higher level implementations duringthe
specification phase, and to push the work on an optimised, low-level C
implementation back until the target is a bit more focussed. Butthen, maybe
that's just me...

Mike Smith (co-chair) of the HTML WG should be able to give hintsabout the stability of parsing section.



Le 8 août 2008 à 18:33, Daniel Veillard a écrit :

 Well as long as any technical argument is kept on the list that's
fine.


agreed.

 My main concern is that HTML5 is a working draft. I can't tell just
from the draft (or rather http://www.w3.org/TR/2008/WD-html5-diff-20080610/)

if people globally agree on parsing processing or if changes arelikely

n the future.


see above.

Le 9 août 2008 à 00:51, Chris Wilson a écrit :

You're serving it as XHTML.

Yes it is an XHTML 1.1 document served as application/xhtml+xml. I putthe text copy below.


Le 9 août 2008 à 17:56, Michael Day a écrit :

In summary: it would be great if libxml2 was also a HTML5 parser!
Is anyone available to implement it? :)


That's the core issue. I wonder if Nick Kew is reading here?


Thanks everyone.


                      Clean the Web with libxml2

Introduction

   The Web (of HTML/XHTML documents) is largely defined by tag soup:
   Invalid and non well-formed syntax.

   In 1996 at WWW5, a paper "[1]An Investigation of Documents from the
   World Wide Web" reports data collected over 2.6 million HTML
   documents collected by the Inktomi Web Crawler. Authors found out
   that "over 40% of the documents in our study contain at least one
   error". Since there has been a number of surveys, [2]The Web
   Authoring Statistics by Ian Hickson at Google is one of the most
   recent ones. 90% to 95% of the Web is invalid and/or non well-formed
   according to surveys.

      [1] http://www.paulaoki.com/papers/www5-color.pdf
      [2] http://code.google.com/webstats/index.html

HTML 5 goals

   On March 2007, the W3C has restarted the work on HTML using the work
   done by the WHAT WG and its editor, Ian Hickson, defining HTML 5.
   HTML 5 is far more than an evolution of HTML 4.01. It includes the
   DOM, some APIs and a custom parsing algorithm. For the first time,
   HTML is defined in terms of a DOM which is the way the browsers
   interpret the Web. Once this DOM tree has been created, there is a
   choice between two serializations, xml and html. The xml
   serialization has to be served with application/xhtml+xml, the html
   serialization has to be served with text/html.

   Html5 Serializations

   [3]HTML 5, one vocabulary, two serializations, W3C Q&A blog, January
   15, 2008

      [3] http://www.w3.org/QA/2008/01/html5-is-html-and-xml.html

   When reading the document on the Web (likely to be invalid) and
   creating the DOM tree, clients have to recover for syntax errors.
   [4]HTML 5 Parsing algorithm describes precisely how to recover from
   erroneous syntax.

      [4] http://www.w3.org/TR/html5/parsing.html#parsing

Cleaning the Web - Implementing HTML 5 parsing in libxml2

   The html5 parsing algorithm starts to be implemented in some
   clients. Some libraries have been developed. In the "[5]How-To for
   html 5 parsing", there is a list of ongoing implementations (python,
   java, ruby). Some of them are quite slow.

      [5] http://www.w3.org/QA/2008/07/html5-parsing-howto.html

   The original idea was to have an Apache module that could clean up
   the content before pushing the page to clients. So clients which
   have not taken care about having to recover for broken documents
   could be more effective. At the same time it would be a lot easier
   to create quality reporting tools for webmasters and/or CMSes. The
   error analyses being done on the server. Basically it raises the
   quality of the content step by step.

   Nick Kew weighed in and proposed that we should target [6]libxml
   which includes an HTML parser and is already supported by Apache
   server and many other tools.

      [6] http://xmlsoft.org/html/libxml-HTMLparser.html

   From here it would be interesting to implement HTML 5 parsing
   algorithm into libxml2. It would benefit the community as large.

HTML 5 Community

     * [7]HTML 5 specification
          + [8]commit-watchers mailing list
          + [9]interactive Web interface
          + [10]CVS webview
          + [11]Subversion interface
          + [12]Twitter messages (non-editorial changes only)
          + [13]HTML diff with the last version in Subversion
     * IRC channels: #html-wg on W3C, #whatwg on FreeNode (all
       [14]logged)
     * [15]HTML WG Home page
     * [16]Michael(tm) Smith, W3C, co-chair
     * [17]Chris Wilson, Microsoft, co-chair
     * [18]Dan Connolly, W3C, team contact

      [7] http://www.w3.org/TR/html5/
      [8] http://lists.whatwg.org/listinfo.cgi/commit-watchers-whatwg.org
      [9] http://html5.org/tools/web-apps-tracker
     [10] http://dev.w3.org/cvsweb/html5/spec/Overview.html
     [11] http://svn.whatwg.org/
     [12] http://twitter.com/WHATWG
     [13] http://whatwg.org/specs/web-apps/current-work/index-diff
     [14] http://krijnhoetmer.nl/irc-logs/
     [15] http://www.w3.org/html
     [16] http://people.w3.org/mike/
     [17] http://blogs.msdn.com/cwilso/
     [18] http://www.w3.org/People/Connolly/

More references

     * October 1996, [19]An Investigation of Documents from the World
       Wide Web, A. Woodruff, P.M. Aoki, E. Brewer, P. Gauthier and
       L.A. Rowe. "2.6 million HTML documents, over 40% of the
       documents contain at least one error."
     * 4th December 2001, [20]How to cope with incorrect HTML, Dagfinn
       Parnas. "2.4 millions URIs sample. Only 0.71% of documents were
       valid."

     [19] http://www.paulaoki.com/papers/www5-color.pdf
     [20] http://www.ub.uib.no/elpub/2001/h/413001/Hovedoppgave.pdf


    Created on August 8, 2008 by [21]Karl Dubost
    $Id: libxml.xhtml,v 1.4 2008/08/08 09:16:50 kdubost Exp $

     [21] http://www.w3.org/People/karl/



--
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool

Follow-Ups:
- Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Stefan Behnel

References:
- [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Karl Dubost
- Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
  - From: Stefan Behnel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]