Re: [xml] Bug in HTML parser output?



On Fri, 26 Apr 2002, Daniel Veillard wrote:

On Fri, Apr 26, 2002 at 11:16:51AM +0100, Matt Sergeant wrote:
In using libxml2's HTML parser to create valid XML, I noticed a "bug"...

xmllint --html --format http://www.messagelabs.com/VirusEye/ | xmllint -

Croaks on the bad ---> comment in the HTML.

Is there any way to make this just "work"?

  hum, right this seems a loophole, the HTML parser is overly flexible to
be able to parse what's found on the net, but doesn't take corrective measures
to cleanup things like HTML comments

(yeah I know I should get them to fix thier nasty HTML too)

  I wonder what's the best approach:
    - fix the HTML importer
    - fix the XML serializer

the second case sounds quite more generic, I would be tempted to go that
way.

Yeah, unfortunately there's no way in XML to allow through that dash, even
by turning it into two comments. But frankly I wouldn't care if it got
stripped completely ;-)

How urgent is this ?

It's not. I just added in an additional "| perl -pe 's/--->/-->/g'" into
the pipeline to fix it for now. Plus I work for MessageLabs, so I can kick
some butt fairly easily ;-)

-- 
<!-- Matt -->
<:->Get a smart net</:->




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]