Re: [xml] Parsing tag-soup HTML

On Mon, 18 Jun 2007 08:14:01 -0400
Daniel Veillard <veillard redhat com> wrote:

  Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.

Because it lacks a ParseChunk API, which means it can't work with
Apache's pipelined filter architecture.  Unless you've added
such an API since I last looked.

So in terms of a first-iteration draft wishlist, tag-soup mode
  - avoid inserting any implied tags in a SAX parse

  That would be contrary to what Tag Soup actually means for most
people as I pointed out.

OK, consider the example referenced from my blog in my first post,
coming from a microsoft sharepoint backend, which inserted a bogus
<meta> at the top.

Try running the following through "xmllint --html":

<meta http-equiv="content-type" content="text/html;charset=ascii" />
<html lang="en">
<body><h1>Hello, World</h1></body>

and it becomes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
<head><meta http-equiv="content-type"
<p> lang="en"&gt;
<h1>Hello, World</h1>

From the point of view of the user, that's worse than the original,
because real-life browsers will render that first bogus paragraph.
It's because of examples like that that I want to make it a
configurable option NOT to insert any inferred tags.

  - treat contents of <script></script> and <style></style> as raw
    CDATA, and don't parse it.

  You need *some* parsing just to detect the end of tag, and now
you're back to the origin, what criteria will you keep

    </SCRIPT >

Case-insensitive "</script" is the token to look for.
Having found it, we then look for ">" preceded by zero or
more whitespace chars.

Yes, that'll still screw up on document.write('</script>').
Needs more thought.  But at least it will leave things like

Sounds like he's using "tag soup" to mean something that cleans it
up, in the tradition of Tidy or AccessValet.  I'm contemplating the
exact opposite: something that leaves it intact!

  And I think as an API you just can't ! You will break apps if you
deliver <em> aaa <b> bbb </em> ccc </b>
 as 2 opening tag and then 2 closing tag but inverted.

Cases like that don't seem to hit my inbox.  I guess that's because
even frontpage-weenies don't product code like that (or if they do,
they can see what's wrong for themselves).

Seems what you want is textual transformation only, and in that case
a parser doesn't sound like the best tool to implement this. But
maybe I misunderstand.

Yes, you could be right.  That's the other option.

I already have a simple sed-like filter (mod_line_edit), which
offers a fallback to users with hopelessly broken markup they
can't do anything about.  But that loses the point and the power
of a markup-aware parser generating a stream of events.

Nick Kew

Application Development with Apache - the Apache Modules Book

