Re: [xml] Parsing tag-soup HTML

From: Stefan Behnel <stefan_ml behnel de>
To: Nick Kew <nick webthing com>
Cc: xml gnome org
Subject: Re: [xml] Parsing tag-soup HTML
Date: Mon, 18 Jun 2007 12:17:43 +0200

Nick Kew wrote:

On Sun, 17 Jun 2007 11:42:08 -0400
Daniel Veillard <veillard redhat com> wrote:

The problem really is to try to come back to a set of garantees and 
behavior rules. Reading the slides pointed from the end of that page
may help. But I'm not sure it's what you want, but since you use the
same name, it should hopefully be close.


Sounds like he's using "tag soup" to mean something that cleans it up,
in the tradition of Tidy or AccessValet.  I'm contemplating the exact
opposite: something that leaves it intact!


I don't think libxml2 is the right place for something that "leaves tag soup
intact". It has an XML tree model, so you can't leave tags unclosed, for example.

I actually think that most use cases want something that's cleaned up and
conforms to some spec when it comes in rather than to write something back out
that's horribly broken. The current parser tries to deal with broken legacy
HTML code and makes it usable. It doesn't try to preserve its brokenness.

Stefan

References:
- [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]