Re: [xml] Parsing a file that I didn't create



On Sat, Oct 14, 2006 at 09:07:59PM -0700, Jeffrey Bigham wrote:
* Jeffrey Bigham wrote:
libxml correctly messes this up because the closing HTML tags between
the </script> tags aren't correctly written as <\/name>.  Is there a
way to use libxml (I'm currently using the SAX parser) without having
it try to fix things for me?  If not, is there another C library that
people know of that can just return each tag to me, one at a time,
without enforcing adherence to the standard?

HTML Tidy (http://tidy.sf.net/) is able to cope with most of these cases
and you could use it as replacement or as pre-processor (e.g., you could
use it to convert the tag soup into well-formed XML and parse that with
libxml2). Perl's HTML::Parser (http://search.cpan.org/dist/HTML-Parser/)
is also written in C and can handle such tag soup in a similar way.

Thanks for the suggestions.  Tidy isn't attractive because time is of
paramount concern and I don't really want to have to do two passes
over the the data.  I took a look at the Perl version and I think it
could probably work for my purposes, although it doesn't look like
there's an easy way to just drop it into my current project.

Isn't there a flag or something I could set in libxml that would tell
it not to output a tag if it doesn't exist in the original source.  If
not, why not?

 In your case, it *is* present in the original source. Reread the HTML
spec about condition for closing <script>, so <script> *is* closed and 
next < marks the beginning of an opening tag. Sorry the intended behaviour
of the application in that case is to ignore tags which *are* present.
Stating that libxml2 should not add tags which doesn't exist is a reformulation
of the problem, the input is broken, not libxml2, and you must agree
that special diverging processing will be needed to cope with those.
Willing to parse and accept specially broken input cost a lot to everybody,
and well, you must be ready to accept this cost if you want to accept this
input in a broken way, sad situation, but the current one. If you start
changing the parser to make the broken behaviour the default, then you
will break correctly written pages as far as I can tell, so the choice is
relatively obvious to me.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]