New Indexer



I have been working on an alternate document indexer and backend that
implements some of the ideas on relevance that I described in a previous
posting.

The advantages over the existing one are:
1) Supports a lot of document types (currently plain text, PDF, png,
html, lots of word wordprocessor docs). This is easily extensible by
adding subclasses.
2) Fast backend - simple single word queries are very fast.
3) Infrastructure for bayesian relevance processing, but I haven't coded
all of that yet.
4) All data lives in a relational DB, rather than the mixed strategy of
the existing doc backend.
5) The indexer is very fast, except for docs that use slow external
converters - in practice this is mostly a problem for html.

The main remaining problems are:
1) I need to do some robustness work on the indexer, if the Abiword
process crashes I don't handle it right.
2) I need to use or write a better html->text converter. Abiword is slow
at this, and leaks a lot of tags that it does not understand. It also
crashes on some html docs (see 1). As a good example see /usr/share/doc/
abiword-2.0.0/roadmap.html
3) I don't do anything with metadata (except for checking creation time)
yet, although it would be easy enough to extend my design to do this.
4) Need to complete the relevance ordering.
5) Doesn't yet do quite the right thing with updated documents.

A neat extra feature for the future would be add extra subclasses for
parsing some programming languages, and index text (comments and
strings) independently from code, so that appropriate contexts would
retrieve only correct thing. 

We could also do the man pages through this, I think it would be much
faster than the current backend.

It also occurs to me that the same infrastructure could easily be hooked
up to an explicit query front-end to give a blindingly fast medusa
replacement.

Julian




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]