Re: [xml] Parsing a file that I didn't create

From: "Jeffrey Bigham" <jbigham u washington edu>
To: "Bjoern Hoehrmann" <derhoermi gmx net>
Cc: xml gnome org
Subject: Re: [xml] Parsing a file that I didn't create
Date: Sat, 14 Oct 2006 21:07:59 -0700

* Jeffrey Bigham wrote:
>libxml correctly messes this up because the closing HTML tags between
>the </script> tags aren't correctly written as <\/name>.  Is there a
>way to use libxml (I'm currently using the SAX parser) without having
>it try to fix things for me?  If not, is there another C library that
>people know of that can just return each tag to me, one at a time,
>without enforcing adherence to the standard?

HTML Tidy (http://tidy.sf.net/) is able to cope with most of these cases
and you could use it as replacement or as pre-processor (e.g., you could
use it to convert the tag soup into well-formed XML and parse that with
libxml2). Perl's HTML::Parser (http://search.cpan.org/dist/HTML-Parser/)
is also written in C and can handle such tag soup in a similar way.


Thanks for the suggestions.  Tidy isn't attractive because time is of
paramount concern and I don't really want to have to do two passes
over the the data.  I took a look at the Perl version and I think it
could probably work for my purposes, although it doesn't look like
there's an easy way to just drop it into my current project.

Isn't there a flag or something I could set in libxml that would tell
it not to output a tag if it doesn't exist in the original source.  If
not, why not?

Thanks again!
Jeff

--
Björn Höhrmann · mailto:bjoern hoehrmann de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Follow-Ups:
- Re: [xml] Parsing a file that I didn't create
  - From: Daniel Veillard

References:
- [xml] Parsing a file that I didn't create
  - From: Jeffrey Bigham
- Re: [xml] Parsing a file that I didn't create
  - From: Bjoern Hoehrmann

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]