[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] Parsing tag-soup HTML
- From: Daniel Veillard <veillard redhat com>
- To: Nick Kew <nick webthing com>
- Cc: xml gnome org
- Subject: Re: [xml] Parsing tag-soup HTML
- Date: Sun, 17 Jun 2007 10:18:29 -0400
On Sun, Jun 17, 2007 at 02:39:57PM +0100, Nick Kew wrote:
> I've been using libxml2 for some years to parse both XML and HTML
> in the context of Apache filter modules. All these modules use the
> parseChunk API, which is the only reasonable option in the context
> of the Apache filter architecture. My most widely-used libxml2-based
> module is mod_proxy_html, which serves to rewrite HTML links in a
> reverse proxy.
>
> A FAQ arising in this context is why some pages get mangled.
> The straight answer is that they're hopelessly malformed tag-soup,
> and HTMLparser is somewhat less forgiving than mainstream browsers.
> Common examples include:
> - Documents that start with a <meta ...>, followed by
> <html>(normal contents)</html>
> - <script> sections that are prematurely closed by things
> like document.write("<p>foo</p>");
> - Documents with multiple <html> or multiple <body> tags.
>
> I have some hacks to error-correct for some of these: for example
> as described at
> http://bahumbug.wordpress.com/2006/10/12/mod_proxy_html-revisited/
> But now I'm looking at providing a systematically more forgiving
> parser as an option to my users. That leaves me two options:
> (1) Write a new tag-soup parser from scratch, and make the
> choice of parser a configuration option for users.
> (2) Work within your existing HTMLparser to make it (optionally)
> more forgiving.
> The second is only realistically an option if I can feed back
> changes to the libxml2 codebase and not land myself with an
> unmaintainable branch.
>
> So, what do you think? Is this something the libxml2 project
> would like to see, or would you prefer to steer well clear?
I'm not adverse to adding a new HTML parsing option for 'tag soup'
but you would have to define clearly what is the new parsing strategy
before I (and others on this list) can say yes or no to that option.
So what would the 'tag soup' parser do that the current HTML parser
does not and vice-versa ? If you could define this other than by an
accumulation of specific cases then that's probably viable, but if
it's just an ever growing list of individual preferences on a case
by case basis, this doesn't sound okay to say yes to your selection
rather than someone else application own set.
Makes sense ?
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]