Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
- From: Karl Dubost <karl w3 org>
- To: Stefan Behnel <stefan_ml behnel de>
- Cc: xml gnome org, "Michael \(tm\) Smith" <mike w3 org>, Nick Kew <nick webthing com>
- Subject: Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2
- Date: Mon, 18 Aug 2008 13:15:12 +0900
Replying to multiple emails
back from my break
Le 8 août 2008 à 17:25, Stefan Behnel a écrit :
There's already an HTML5 implementation in Python (html5lib) which
you can use
together with lxml (so you can benefit from both HTML5 *and* libxml2
already).
Yes I know, I gave a pointer in the document. The issue is that it is
slow.
IMHO, it's better to stick with higher level implementations during
the
specification phase, and to push the work on an optimised, low-level C
implementation back until the target is a bit more focussed. But
then, maybe
that's just me...
Mike Smith (co-chair) of the HTML WG should be able to give hints
about the stability of parsing section.
Le 8 août 2008 à 18:33, Daniel Veillard a écrit :
Well as long as any technical argument is kept on the list that's
fine.
agreed.
My main concern is that HTML5 is a working draft. I can't tell just
from the draft (or rather http://www.w3.org/TR/2008/WD-html5-diff-20080610/)
if people globally agree on parsing processing or if changes are
likely
n the future.
see above.
Le 9 août 2008 à 00:51, Chris Wilson a écrit :
You're serving it as XHTML.
Yes it is an XHTML 1.1 document served as application/xhtml+xml. I put
the text copy below.
Le 9 août 2008 à 17:56, Michael Day a écrit :
In summary: it would be great if libxml2 was also a HTML5 parser!
Is anyone available to implement it? :)
That's the core issue. I wonder if Nick Kew is reading here?
Thanks everyone.
Clean the Web with libxml2
Introduction
The Web (of HTML/XHTML documents) is largely defined by tag soup:
Invalid and non well-formed syntax.
In 1996 at WWW5, a paper "[1]An Investigation of Documents from the
World Wide Web" reports data collected over 2.6 million HTML
documents collected by the Inktomi Web Crawler. Authors found out
that "over 40% of the documents in our study contain at least one
error". Since there has been a number of surveys, [2]The Web
Authoring Statistics by Ian Hickson at Google is one of the most
recent ones. 90% to 95% of the Web is invalid and/or non well-formed
according to surveys.
[1] http://www.paulaoki.com/papers/www5-color.pdf
[2] http://code.google.com/webstats/index.html
HTML 5 goals
On March 2007, the W3C has restarted the work on HTML using the work
done by the WHAT WG and its editor, Ian Hickson, defining HTML 5.
HTML 5 is far more than an evolution of HTML 4.01. It includes the
DOM, some APIs and a custom parsing algorithm. For the first time,
HTML is defined in terms of a DOM which is the way the browsers
interpret the Web. Once this DOM tree has been created, there is a
choice between two serializations, xml and html. The xml
serialization has to be served with application/xhtml+xml, the html
serialization has to be served with text/html.
Html5 Serializations
[3]HTML 5, one vocabulary, two serializations, W3C Q&A blog, January
15, 2008
[3] http://www.w3.org/QA/2008/01/html5-is-html-and-xml.html
When reading the document on the Web (likely to be invalid) and
creating the DOM tree, clients have to recover for syntax errors.
[4]HTML 5 Parsing algorithm describes precisely how to recover from
erroneous syntax.
[4] http://www.w3.org/TR/html5/parsing.html#parsing
Cleaning the Web - Implementing HTML 5 parsing in libxml2
The html5 parsing algorithm starts to be implemented in some
clients. Some libraries have been developed. In the "[5]How-To for
html 5 parsing", there is a list of ongoing implementations (python,
java, ruby). Some of them are quite slow.
[5] http://www.w3.org/QA/2008/07/html5-parsing-howto.html
The original idea was to have an Apache module that could clean up
the content before pushing the page to clients. So clients which
have not taken care about having to recover for broken documents
could be more effective. At the same time it would be a lot easier
to create quality reporting tools for webmasters and/or CMSes. The
error analyses being done on the server. Basically it raises the
quality of the content step by step.
Nick Kew weighed in and proposed that we should target [6]libxml
which includes an HTML parser and is already supported by Apache
server and many other tools.
[6] http://xmlsoft.org/html/libxml-HTMLparser.html
From here it would be interesting to implement HTML 5 parsing
algorithm into libxml2. It would benefit the community as large.
HTML 5 Community
* [7]HTML 5 specification
+ [8]commit-watchers mailing list
+ [9]interactive Web interface
+ [10]CVS webview
+ [11]Subversion interface
+ [12]Twitter messages (non-editorial changes only)
+ [13]HTML diff with the last version in Subversion
* IRC channels: #html-wg on W3C, #whatwg on FreeNode (all
[14]logged)
* [15]HTML WG Home page
* [16]Michael(tm) Smith, W3C, co-chair
* [17]Chris Wilson, Microsoft, co-chair
* [18]Dan Connolly, W3C, team contact
[7] http://www.w3.org/TR/html5/
[8] http://lists.whatwg.org/listinfo.cgi/commit-watchers-whatwg.org
[9] http://html5.org/tools/web-apps-tracker
[10] http://dev.w3.org/cvsweb/html5/spec/Overview.html
[11] http://svn.whatwg.org/
[12] http://twitter.com/WHATWG
[13] http://whatwg.org/specs/web-apps/current-work/index-diff
[14] http://krijnhoetmer.nl/irc-logs/
[15] http://www.w3.org/html
[16] http://people.w3.org/mike/
[17] http://blogs.msdn.com/cwilso/
[18] http://www.w3.org/People/Connolly/
More references
* October 1996, [19]An Investigation of Documents from the World
Wide Web, A. Woodruff, P.M. Aoki, E. Brewer, P. Gauthier and
L.A. Rowe. "2.6 million HTML documents, over 40% of the
documents contain at least one error."
* 4th December 2001, [20]How to cope with incorrect HTML, Dagfinn
Parnas. "2.4 millions URIs sample. Only 0.71% of documents were
valid."
[19] http://www.paulaoki.com/papers/www5-color.pdf
[20] http://www.ub.uib.no/elpub/2001/h/413001/Hovedoppgave.pdf
Created on August 8, 2008 by [21]Karl Dubost
$Id: libxml.xhtml,v 1.4 2008/08/08 09:16:50 kdubost Exp $
[21] http://www.w3.org/People/karl/
--
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]