Re: [xml] Parsing tag-soup HTML



On Mon, Jun 18, 2007 at 11:07:42AM +0100, Nick Kew wrote:
On Sun, 17 Jun 2007 11:42:08 -0400
Daniel Veillard <veillard redhat com> wrote:

 Coming back with some kind of definition of what a tag soup parser
behaviour is is probably more important than digging in libxml2 code.

A slightly circular argument in this case.  What I really need to
do is review the case history of what users complain about, and
relate that to how the parser works.  Bear in mind this is a 
streaming SAX parser: other APIs are way too slow and therefore
of no interest in this context.

  Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.

If I write a new parser from scratch, it'll be a simpleminded thing
based on what bad tag-soup "html" expects:

  <foo ...> generates a startElement event
  </foo> generates an EndElement event
  <!-- generates a start-comment which is terminated by -->
  <script> and <style> treat their contents as a black-box
  terminated by </script>/</style> and nothing else.

With libxml2 we can add value to that by inserting implied closing
tags.  But in some cases, we need to avoid inserting implied opening
tags.  And we should dispense with some error corrections such as 
rejecting an <html> opening tag after a document has opened.
In fact, I think we need to dispense with generating *any* implied
opening tags when in tag-soup mode.  Which in turn means we can't
imply closing tags, lest they be unmatched!

So in terms of a first-iteration draft wishlist, tag-soup mode should:
  - avoid inserting any implied tags in a SAX parse

  That would be contrary to what Tag Soup actually means for most people
as I pointed out.

  - treat contents of <script></script> and <style></style> as raw
    CDATA, and don't parse it.

  You need *some* parsing just to detect the end of tag, and now you're 
back to the origin, what criteria will you keep

    </
    </sc
    </script
    </script>
    </SCRIPT
    </ScRIpT
    </SCRIPT >
 
 ?
    
which it seems would defeat your first example I guess.
The problem really is to try to come back to a set of garantees and 
behavior rules. Reading the slides pointed from the end of that page
may help. But I'm not sure it's what you want, but since you use the
same name, it should hopefully be close.

Sounds like he's using "tag soup" to mean something that cleans it up,
in the tradition of Tidy or AccessValet.  I'm contemplating the exact
opposite: something that leaves it intact!

  And I think as an API you just can't ! You will break apps if you deliver
    <em> aaa <b> bbb </em> ccc </b>
 as 2 opening tag and then 2 closing tag but inverted.
Seems what you want is textual transformation only, and in that case a parser
doesn't sound like the best tool to implement this. But maybe I misunderstand.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]