Re: [xml] xmlReader and HTML

On Fri, Jun 10, 2011 at 08:57:47PM +0200, Joachim Zobel wrote:
On Fri, 2011-06-10 at 22:29 +0800, Daniel Veillard wrote:
  It doesn't, right now the reader is always operating on top of an
XML parser, not an HTML one, hence your result.

Why are there different parsers for DOM/SAX and xmlReader? It should be
possible to build xmlReader on top of SAX, or am I missing something?

  The reader uses a misture of the tree and SAX API to do its work.
IT's a bit complex in coding but stable in practice.

  Except modifying it to allow HTML parsing 

It would be a bit tedious to maintain 2 HTML parser. HTML is a moving

  Well I don't remember anybody asking for the reader for HTML. I think
the main point is that HTML documents are usually not that large and
building the fulltree (instead of a sliding window of the tree as in
the reader) is acceptable. I think you're the first one who really want
to unify the 2, and IMHO as a temporary measure working on the reader
our of a full HTML tree should be okay unless you have memory pressure
or huge HTML documents

I can offer a bit of help with modifying the parsing, but I wouldn't
dare to touch the core unless there are good tests.

  the place where the ditincion should be done is where the
xmlTextReaderSetup() does xmlCreatePushParserCtxt(), either we duplicate
all the entry points for creating XML Readers to provide HTML readers
or we add a new XML parser option to tell to parse as HTML.
Unfortunately I'm afraid teh first one is the best long term, leading
to some code duplication.
  W.r.t. good tests there are none since the functionality doesn't
already exists, but one good way to test is make sure the reader for
HTML and walking from an xmlReaderForDoc created by parsing the HTML
document lead to identical result. Duplicatig the code for the API
entry point also nearly garantees that the stability of the XMLreader
entry points wouldn't be affected.

  It should not be too hard, a bit tedious that's all :-)


Daniel Veillard      | libxml Gnome XML XSLT toolkit
daniel veillard com  | Rpmfind RPM search engine | virtualization library

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]