Re: [xml] Parsing tag-soup HTML
- From: Nick Kew <nick webthing com>
- To: xml gnome org
- Subject: Re: [xml] Parsing tag-soup HTML
- Date: Mon, 18 Jun 2007 14:02:39 +0100
On Mon, 18 Jun 2007 08:14:01 -0400
Daniel Veillard <veillard redhat com> wrote:
Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.
Because it lacks a ParseChunk API, which means it can't work with
Apache's pipelined filter architecture. Unless you've added
such an API since I last looked.
So in terms of a first-iteration draft wishlist, tag-soup mode
should:
- avoid inserting any implied tags in a SAX parse
That would be contrary to what Tag Soup actually means for most
people as I pointed out.
OK, consider the example referenced from my blog in my first post,
coming from a microsoft sharepoint backend, which inserted a bogus
<meta> at the top.
Try running the following through "xmllint --html":
<meta http-equiv="content-type" content="text/html;charset=ascii" />
<html lang="en">
<head><title>foo</title></head>
<body><h1>Hello, World</h1></body>
</html>
and it becomes:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="content-type"
content="text/html;charset=ascii"></head>
<body>
<p> lang="en">
</p>
<title>foo</title>
<h1>Hello, World</h1>
</body>
</html>
From the point of view of the user, that's worse than the original,
because real-life browsers will render that first bogus paragraph.
It's because of examples like that that I want to make it a
configurable option NOT to insert any inferred tags.
- treat contents of <script></script> and <style></style> as raw
CDATA, and don't parse it.
You need *some* parsing just to detect the end of tag, and now
you're back to the origin, what criteria will you keep
</
</sc
</script
</script>
</SCRIPT
</ScRIpT
</SCRIPT >
?
Case-insensitive "</script" is the token to look for.
Having found it, we then look for ">" preceded by zero or
more whitespace chars.
Yes, that'll still screw up on document.write('</script>').
Needs more thought. But at least it will leave things like
<script>
document.write('<p>Something</p>');
</script>
intact.
Sounds like he's using "tag soup" to mean something that cleans it
up, in the tradition of Tidy or AccessValet. I'm contemplating the
exact opposite: something that leaves it intact!
And I think as an API you just can't ! You will break apps if you
deliver <em> aaa <b> bbb </em> ccc </b>
as 2 opening tag and then 2 closing tag but inverted.
Cases like that don't seem to hit my inbox. I guess that's because
even frontpage-weenies don't product code like that (or if they do,
they can see what's wrong for themselves).
Seems what you want is textual transformation only, and in that case
a parser doesn't sound like the best tool to implement this. But
maybe I misunderstand.
Yes, you could be right. That's the other option.
I already have a simple sed-like filter (mod_line_edit), which
offers a fallback to users with hopelessly broken markup they
can't do anything about. But that loses the point and the power
of a markup-aware parser generating a stream of events.
--
Nick Kew
Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]