[xml] Parsing tag-soup HTML
- From: Nick Kew <nick webthing com>
- To: xml gnome org
- Subject: [xml] Parsing tag-soup HTML
- Date: Sun, 17 Jun 2007 14:39:57 +0100
I've been using libxml2 for some years to parse both XML and HTML
in the context of Apache filter modules. All these modules use the
parseChunk API, which is the only reasonable option in the context
of the Apache filter architecture. My most widely-used libxml2-based
module is mod_proxy_html, which serves to rewrite HTML links in a
A FAQ arising in this context is why some pages get mangled.
The straight answer is that they're hopelessly malformed tag-soup,
and HTMLparser is somewhat less forgiving than mainstream browsers.
Common examples include:
- Documents that start with a <meta ...>, followed by
- <script> sections that are prematurely closed by things
- Documents with multiple <html> or multiple <body> tags.
I have some hacks to error-correct for some of these: for example
as described at
But now I'm looking at providing a systematically more forgiving
parser as an option to my users. That leaves me two options:
(1) Write a new tag-soup parser from scratch, and make the
choice of parser a configuration option for users.
(2) Work within your existing HTMLparser to make it (optionally)
The second is only realistically an option if I can feed back
changes to the libxml2 codebase and not land myself with an
So, what do you think? Is this something the libxml2 project
would like to see, or would you prefer to steer well clear?
Application Development with Apache - the Apache Modules Book
] [Thread Prev