[xml] Event based HTML parsing with libxml2, and...
- From: Bill Moseley <moseley hank org>
- To: xml gnome org
- Subject: [xml] Event based HTML parsing with libxml2, and...
- Date: Thu, 09 Aug 2001 22:24:53 -0700
I'm one of the developers on the open source swish-e search engine.
The old swish-e code used its own XML and HTML parsers. I've replaced the
XML parser with Expat. I'd like to now replace the HTML parser, too.
libxml2 seems like it might be a good solution as I could used it for both
HTML and XML parsing.
I assume there's a SAX type of event based parser for HTML, but I haven't
been able to locate any examples -- or haven't recognized them when I saw
them.
My needs are simple. With Expat, all I'm using are:
XML_SetUserData();
XML_SetElementHandler();
XML_SetCharacterDataHandler();
XML_SetCommentHandler();
Swish doesn't need to know much when parsing. For html it really only
needs to extract out text, and know where the text is located (title, body,
emphasized, or from a meta tag). That's about it.
So, can I do event-based parsing of HTML with libxml2, and can someone
provide or point me to an example?
Second, if we end up bundling libxml2 with swish-e, would someone be able
to offer help on how to incorporate it into swish? Is the best way to
include the entire package, and let our ./configure script call libxml2's
./configure?, call make, and then link swish-e against the resulting
library. Sound about right?
Swish-e is only 600K as a tarball, so adding libxml2 would triple it's
size. Would there be a "lite" package to include?
Thanks very much,
Oh, BTW --
~/libxml2/libxml2-2.4.1 > ./configure
creating cache ./config.cache
checking host system type... configure: error: can not guess host type; you
must specify one
It's rare that I don't have good luck with configure scripts on Linux.
Bill Moseley
mailto:moseley hank org
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]