Re: [xml] Parsing tag-soup HTML

Nick Kew wrote:
On Sun, 17 Jun 2007 11:42:08 -0400
Daniel Veillard <veillard redhat com> wrote:
The problem really is to try to come back to a set of garantees and 
behavior rules. Reading the slides pointed from the end of that page
may help. But I'm not sure it's what you want, but since you use the
same name, it should hopefully be close.

Sounds like he's using "tag soup" to mean something that cleans it up,
in the tradition of Tidy or AccessValet.  I'm contemplating the exact
opposite: something that leaves it intact!

I don't think libxml2 is the right place for something that "leaves tag soup
intact". It has an XML tree model, so you can't leave tags unclosed, for example.

I actually think that most use cases want something that's cleaned up and
conforms to some spec when it comes in rather than to write something back out
that's horribly broken. The current parser tries to deal with broken legacy
HTML code and makes it usable. It doesn't try to preserve its brokenness.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]