Re: Gnome Website Search Demo

On Thu, 2004-07-15 at 11:16 -0400, Michael Henson wrote:
> On Fri, 2004-07-16 at 00:56 +1000, Jeff Waugh wrote:
> >
> > 
> > What are the advantages of using plucene over other indexing tools? 
> * paraphrase* ie. namazu and htdig *paraphase*
> We have
> > namazu installed on the GNOME servers (for use with the mailing lists), and
> > there have been vague plans to use htdig (despite its warts) on the website
> > at some stage. Using another tool (not packaged by Red Hat for RHEL3) would
> > be a maintenance cost -> is a Plucene-based index/search solution worth it?
> > 
> I haven't really looked over namazu or htdig. I'll check them out
> tonight if I have a chance.

So, I've taken a look at both namazu and htdig. Both look to be
essentially general purpose indexers. Namazu is aware of multiple files
type whih is nifty and htdig seems to be a good general purpose solution
for indexing web sites. I'm only going to compare plucene and htdig
though since they have the most features.

Both plucene and htdig support a nearly mind numbing number of searching
options supporting boolean searches, sound-based searches, proximity
matches, etc, etc, etc. I can see some of these search types --
especially good word stemming as being beneficial and both programs
support it. htdig is essentially built for crawling web pages. It is
written in c++ and seems to be fairly straightforward to set up. It
provides some fairly smart indexing scripts, can read certain custom
meta tags and has a PageRank-like algorithm for given a higher
precedence to pages that are frequently linked to.

Plucene on the other hand is a roll your own indexing and searching
library. Basically you have a document index -- you create documents
with a number of fields and add them into the index. The way in which
the text is tokenized depends on the type of analyzer you use -- and
it's really easy to make your own. As opposed to htdig which primarily
lets you search just text-like content, plucene searches can be run
across multiple fields. So for instance if I were to index the mailing
lists I might have include the following fields: mailing_list, subject,
author, body. I could then run the following search:
	mailing_list:desktop-devel-list subject:themes
And have it return something sensible. Now those who know the system can
use these special queries directly, but it's easy to make wrapper forms
for them too. Also, since plucene is a roll your own system you can also
writen custom responses that can return more information -- ie for a
given symbol in the API, you could also suggest links to related
tutorials, etc. So we have a large amount of flexibility, but of course
the con is that some code has to be written :D

-- Michael

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]