Re: [xml] Parsing tag-soup HTML

On Sun, 17 Jun 2007 11:42:08 -0400
Daniel Veillard <veillard redhat com> wrote:

 Coming back with some kind of definition of what a tag soup parser
behaviour is is probably more important than digging in libxml2 code.

A slightly circular argument in this case.  What I really need to
do is review the case history of what users complain about, and
relate that to how the parser works.  Bear in mind this is a 
streaming SAX parser: other APIs are way too slow and therefore
of no interest in this context.

If I write a new parser from scratch, it'll be a simpleminded thing
based on what bad tag-soup "html" expects:

  <foo ...> generates a startElement event
  </foo> generates an EndElement event
  <!-- generates a start-comment which is terminated by -->
  <script> and <style> treat their contents as a black-box
  terminated by </script>/</style> and nothing else.

With libxml2 we can add value to that by inserting implied closing
tags.  But in some cases, we need to avoid inserting implied opening
tags.  And we should dispense with some error corrections such as 
rejecting an <html> opening tag after a document has opened.
In fact, I think we need to dispense with generating *any* implied
opening tags when in tag-soup mode.  Which in turn means we can't
imply closing tags, lest they be unmatched!

So in terms of a first-iteration draft wishlist, tag-soup mode should:
  - avoid inserting any implied tags in a SAX parse
  - treat contents of <script></script> and <style></style> as raw
    CDATA, and don't parse it.

which it seems would defeat your first example I guess.
The problem really is to try to come back to a set of garantees and 
behavior rules. Reading the slides pointed from the end of that page
may help. But I'm not sure it's what you want, but since you use the
same name, it should hopefully be close.

Sounds like he's using "tag soup" to mean something that cleans it up,
in the tradition of Tidy or AccessValet.  I'm contemplating the exact
opposite: something that leaves it intact!

Nick Kew

Application Development with Apache - the Apache Modules Book

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]