[xml] Parsing tag-soup HTML

From: Nick Kew <nick webthing com>
To: xml gnome org
Subject: [xml] Parsing tag-soup HTML
Date: Sun, 17 Jun 2007 14:39:57 +0100

I've been using libxml2 for some years to parse both XML and HTML
in the context of Apache filter modules.  All these modules use the
parseChunk API, which is the only reasonable option in the context
of the Apache filter architecture.  My most widely-used libxml2-based
module is mod_proxy_html, which serves to rewrite HTML links in a
reverse proxy.

A FAQ arising in this context is why some pages get mangled.
The straight answer is that they're hopelessly malformed tag-soup,
and HTMLparser is somewhat less forgiving than mainstream browsers.
Common examples include:
  - Documents that start with a <meta ...>, followed by
    <html>(normal contents)</html>
  - <script> sections that are prematurely closed by things
    like document.write("<p>foo</p>");
  - Documents with multiple <html> or multiple <body> tags.

I have some hacks to error-correct for some of these: for example
as described at
http://bahumbug.wordpress.com/2006/10/12/mod_proxy_html-revisited/
But now I'm looking at providing a systematically more forgiving
parser as an option to my users.  That leaves me two options:
  (1) Write a new tag-soup parser from scratch, and make the
      choice of parser a configuration option for users.
  (2) Work within your existing HTMLparser to make it (optionally)
      more forgiving.
The second is only realistically an option if I can feed back
changes to the libxml2 codebase and not land myself with an
unmaintainable branch.

So, what do you think?  Is this something the libxml2 project
would like to see, or would you prefer to steer well clear?

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/

Follow-Ups:
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]