Re: [xml] Parsing tag-soup HTML

From: Stefan Behnel <stefan_ml behnel de>
To: Nick Kew <nick webthing com>
Cc: xml gnome org
Subject: Re: [xml] Parsing tag-soup HTML
Date: Mon, 18 Jun 2007 15:31:27 +0200



Nick Kew wrote:

Stefan Behnel <stefan_ml behnel de> wrote:

Nick Kew wrote:

On Mon, 18 Jun 2007 08:14:01 -0400
Try running the following through "xmllint --html":

<meta http-equiv="content-type" content="text/html;charset=ascii" />
<html lang="en">
<head><title>foo</title></head>
<body><h1>Hello, World</h1></body>
</html>

In that case I would actually prefer making it a general special case
rule in the current parser to interpret a leading <meta> tag as an
encoding hint to the parser. That would add quite a portion of
real-world non-HTML to the set of parsable (i.e. fixable) documents.

[...]

I'm trying to get away from ad-hoc fixes!


I don't consider that an ad-hoc fix. It's just special casing a specific type
of broken HTML that exists in real life. I wouldn't even mind if the <meta>
tag was discarded, it should just

a) be interpreted as an encoding hint
and
b) not change the remaining 'real' markup.

I think such a rule should go into the mainstream parser.

Stefan

References:
- [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Stefan Behnel
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]