[xml] Event based HTML parsing with libxml2, and...



I'm one of the developers on the open source swish-e search engine.

The old swish-e code used its own XML and HTML parsers.  I've replaced the
XML parser with Expat.  I'd like to now replace the HTML parser, too.
libxml2 seems like it might be a good solution as I could used it for both
HTML and XML parsing.

I assume there's a SAX type of event based parser for HTML, but I haven't
been able to locate any examples -- or haven't recognized them when I saw
them.

My needs are simple.  With Expat, all I'm using are:

XML_SetUserData();
XML_SetElementHandler();
XML_SetCharacterDataHandler();
XML_SetCommentHandler();

Swish doesn't need to know much when parsing.  For html it really only
needs to extract out text, and know where the text is located (title, body,
emphasized, or from a meta tag).  That's about it.

So, can I do event-based parsing of HTML with libxml2, and can someone
provide or point me to an example?

Second, if we end up bundling libxml2 with swish-e, would someone be able
to offer help on how to incorporate it into swish?  Is the best way to
include the entire package, and let our ./configure script call libxml2's
./configure?, call make, and then link swish-e against the resulting
library.  Sound about right?

Swish-e is only 600K as a tarball, so adding libxml2 would triple it's
size.  Would there be a "lite" package to include?

Thanks very much,

Oh, BTW --

~/libxml2/libxml2-2.4.1 > ./configure
creating cache ./config.cache
checking host system type... configure: error: can not guess host type; you
must specify one

It's rare that I don't have good luck with configure scripts on Linux.


Bill Moseley
mailto:moseley hank org




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]