Re: [xml] Parsing tag-soup HTML

From: Daniel Veillard <veillard redhat com>
To: Nick Kew <nick webthing com>
Cc: xml gnome org
Subject: Re: [xml] Parsing tag-soup HTML
Date: Sun, 17 Jun 2007 11:42:08 -0400

On Sun, Jun 17, 2007 at 03:52:28PM +0100, Nick Kew wrote:

On Sun, 17 Jun 2007 10:18:29 -0400
Daniel Veillard <veillard redhat com> wrote:

So, what do you think?  Is this something the libxml2 project
would like to see, or would you prefer to steer well clear?


  I'm not adverse to adding a new HTML parsing option for 'tag soup'
but you would have to define clearly what is the new parsing strategy
before I (and others on this list) can say yes or no to that option.
So what would the 'tag soup' parser do that the current HTML parser
does not and vice-versa ? If you could define this other than by an
accumulation of specific cases then that's probably viable, but if
it's just an ever growing list of individual preferences on a case
by case basis, this doesn't sound okay to say yes to your selection 
rather than someone else application own set.
  Makes sense ?


Thanks for the quick response.

Yes, of course I didn't expect a straight "yes" to such a vague
proposal.  My question concerned whether I should invest the time
and effort to determine the details of how this should look in the
context of HTMLparser.

I'll take your reply as a yes in principle, and dive into the code
to think it through a little more.  If it looks promising, I'll
come back to you with more concrete proposals.


 Coming back with some kind of definition of what a tag soup parser
behaviour is is probably more important than digging in libxml2 code.
I am not sure we can emulate web browser parsers behaviour. There
is John Cowan's TagSoup which is probably what most people will think
about in term of implementation:

  http://ccil.org/~cowan/XML/tagsoup/

  "It does guarantee well-structured results: tags will wind up properly
   nested, default attributes will appear appropriately, and so on"

but also

  "For example, if the first tag is LI, it will supply the application
   with enclosing HTML, BODY, and UL tags."

which it seems would defeat your first example I guess.
The problem really is to try to come back to a set of garantees and 
behavior rules. Reading the slides pointed from the end of that page
may help. But I'm not sure it's what you want, but since you use the
same name, it should hopefully be close.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

Follow-Ups:
- Re: [xml] Parsing tag-soup HTML
  - From: Michael Day
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew

References:
- [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]