Re: [xml] Bug in HTML parser output?
- From: Matt Sergeant <matt sergeant org>
- To: Daniel Veillard <veillard redhat com>
- Cc: "xml gnome org" <xml gnome org>
- Subject: Re: [xml] Bug in HTML parser output?
- Date: Fri, 26 Apr 2002 12:00:43 +0100 (BST)
On Fri, 26 Apr 2002, Daniel Veillard wrote:
On Fri, Apr 26, 2002 at 11:16:51AM +0100, Matt Sergeant wrote:
In using libxml2's HTML parser to create valid XML, I noticed a "bug"...
xmllint --html --format http://www.messagelabs.com/VirusEye/ | xmllint -
Croaks on the bad ---> comment in the HTML.
Is there any way to make this just "work"?
hum, right this seems a loophole, the HTML parser is overly flexible to
be able to parse what's found on the net, but doesn't take corrective measures
to cleanup things like HTML comments
(yeah I know I should get them to fix thier nasty HTML too)
I wonder what's the best approach:
- fix the HTML importer
- fix the XML serializer
the second case sounds quite more generic, I would be tempted to go that
way.
Yeah, unfortunately there's no way in XML to allow through that dash, even
by turning it into two comments. But frankly I wouldn't care if it got
stripped completely ;-)
How urgent is this ?
It's not. I just added in an additional "| perl -pe 's/--->/-->/g'" into
the pipeline to fix it for now. Plus I work for MessageLabs, so I can kick
some butt fairly easily ;-)
--
<!-- Matt -->
<:->Get a smart net</:->
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]