Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2

Karl Dubost <karl <at>> writes:
I have written a short document to explain the project [Cleaning the  
It describes what is html5 and what would be the benefits of  
implementing the html 5 parsing algorithm in libxml2 html parser.

There's already an HTML5 implementation in Python (html5lib) which you can use 
together with lxml (so you can benefit from both HTML5 *and* libxml2 already). 
IIRC, there was also a push towards a C implementation, but I'm not sure that 
really lead anywhere. What's in SVN doesn't look very complete:

IMHO, it's better to stick with higher level implementations during the 
specification phase, and to push the work on an optimised, low-level C 
implementation back until the target is a bit more focussed. But then, maybe 
that's just me...

I didn't read your proposal, so I'll just assume you meant to extend the 
existing HTML parser instead of writing a new one. That would sound more 
promising than a start from scratch.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]