Re: [xml] Parsing tag-soup HTML

From: Daniel Veillard <veillard redhat com>
To: Nick Kew <nick webthing com>
Cc: xml gnome org
Subject: Re: [xml] Parsing tag-soup HTML
Date: Sun, 17 Jun 2007 10:18:29 -0400

On Sun, Jun 17, 2007 at 02:39:57PM +0100, Nick Kew wrote:

I've been using libxml2 for some years to parse both XML and HTML
in the context of Apache filter modules.  All these modules use the
parseChunk API, which is the only reasonable option in the context
of the Apache filter architecture.  My most widely-used libxml2-based
module is mod_proxy_html, which serves to rewrite HTML links in a
reverse proxy.

A FAQ arising in this context is why some pages get mangled.
The straight answer is that they're hopelessly malformed tag-soup,
and HTMLparser is somewhat less forgiving than mainstream browsers.
Common examples include:
  - Documents that start with a <meta ...>, followed by
    <html>(normal contents)</html>
  - <script> sections that are prematurely closed by things
    like document.write("<p>foo</p>");
  - Documents with multiple <html> or multiple <body> tags.

I have some hacks to error-correct for some of these: for example
as described at
http://bahumbug.wordpress.com/2006/10/12/mod_proxy_html-revisited/
But now I'm looking at providing a systematically more forgiving
parser as an option to my users.  That leaves me two options:
  (1) Write a new tag-soup parser from scratch, and make the
      choice of parser a configuration option for users.
  (2) Work within your existing HTMLparser to make it (optionally)
      more forgiving.
The second is only realistically an option if I can feed back
changes to the libxml2 codebase and not land myself with an
unmaintainable branch.

So, what do you think?  Is this something the libxml2 project
would like to see, or would you prefer to steer well clear?


  I'm not adverse to adding a new HTML parsing option for 'tag soup'
but you would have to define clearly what is the new parsing strategy
before I (and others on this list) can say yes or no to that option.
So what would the 'tag soup' parser do that the current HTML parser
does not and vice-versa ? If you could define this other than by an
accumulation of specific cases then that's probably viable, but if
it's just an ever growing list of individual preferences on a case
by case basis, this doesn't sound okay to say yes to your selection 
rather than someone else application own set.
  Makes sense ?

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

Follow-Ups:
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew

References:
- [xml] Parsing tag-soup HTML
  - From: Nick Kew

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]