Re: [xml] Parsing a file that I didn't create
- From: Bjoern Hoehrmann <derhoermi gmx net>
- To: "Jeffrey Bigham" <jbigham u washington edu>
- Cc: xml gnome org
- Subject: Re: [xml] Parsing a file that I didn't create
- Date: Sat, 14 Oct 2006 20:15:20 +0200
* Jeffrey Bigham wrote:
libxml correctly messes this up because the closing HTML tags between
the </script> tags aren't correctly written as <\/name>. Is there a
way to use libxml (I'm currently using the SAX parser) without having
it try to fix things for me? If not, is there another C library that
people know of that can just return each tag to me, one at a time,
without enforcing adherence to the standard?
HTML Tidy (http://tidy.sf.net/) is able to cope with most of these cases
and you could use it as replacement or as pre-processor (e.g., you could
use it to convert the tag soup into well-formed XML and parse that with
libxml2). Perl's HTML::Parser (http://search.cpan.org/dist/HTML-Parser/)
is also written in C and can handle such tag soup in a similar way.
--
Björn Höhrmann · mailto:bjoern hoehrmann de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]