Re: [xml] Bug in HTML parser output?

From: Matt Sergeant <matt sergeant org>
To: Daniel Veillard <veillard redhat com>
Cc: "xml gnome org" <xml gnome org>
Subject: Re: [xml] Bug in HTML parser output?
Date: Fri, 26 Apr 2002 12:00:43 +0100 (BST)

On Fri, 26 Apr 2002, Daniel Veillard wrote:

On Fri, Apr 26, 2002 at 11:16:51AM +0100, Matt Sergeant wrote:

In using libxml2's HTML parser to create valid XML, I noticed a "bug"...

xmllint --html --format http://www.messagelabs.com/VirusEye/ | xmllint -

Croaks on the bad ---> comment in the HTML.

Is there any way to make this just "work"?


  hum, right this seems a loophole, the HTML parser is overly flexible to
be able to parse what's found on the net, but doesn't take corrective measures
to cleanup things like HTML comments

(yeah I know I should get them to fix thier nasty HTML too)


  I wonder what's the best approach:
    - fix the HTML importer
    - fix the XML serializer

the second case sounds quite more generic, I would be tempted to go that
way.


Yeah, unfortunately there's no way in XML to allow through that dash, even
by turning it into two comments. But frankly I wouldn't care if it got
stripped completely ;-)

How urgent is this ?


It's not. I just added in an additional "| perl -pe 's/--->/-->/g'" into
the pipeline to fix it for now. Plus I work for MessageLabs, so I can kick
some butt fairly easily ;-)

-- 
<!-- Matt -->
<:->Get a smart net</:->

References:
- Re: [xml] Bug in HTML parser output?
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]