Re: New Indexer

From: msevior physics unimelb edu au
To: "Julian Satchell" <j satchell eris qinetiq com>
Cc: dashboard-hackers gnome org, nat nat org, msevior physics unimelb edu au
Subject: Re: New Indexer
Date: Sat, 21 Feb 2004 01:55:31 +1100 (EST)

> I have been working on an alternate document indexer and backend that
> implements some of the ideas on relevance that I described in a previous
> posting.
>
> The advantages over the existing one are:
> 1) Supports a lot of document types (currently plain text, PDF, png,
> html, lots of word wordprocessor docs). This is easily extensible by
> adding subclasses.
> 2) Fast backend - simple single word queries are very fast.
> 3) Infrastructure for bayesian relevance processing, but I haven't coded
> all of that yet.
> 4) All data lives in a relational DB, rather than the mixed strategy of
> the existing doc backend.
> 5) The indexer is very fast, except for docs that use slow external
> converters - in practice this is mostly a problem for html.
>
> The main remaining problems are:
> 1) I need to do some robustness work on the indexer, if the Abiword
> process crashes I don't handle it right.
> 2) I need to use or write a better html->text converter. Abiword is slow
> at this, and leaks a lot of tags that it does not understand. It also
> crashes on some html docs (see 1). As a good example see /usr/share/doc/
> abiword-2.0.0/roadmap.html

Right. The HTML importer has been on the cusp of being sustantially
improved for some time. One day it might actually happen :-)

On the other hand for HTML and for the purposes of dashboard, maybe just a
simple text parser to remove all <...> tags would sufficient?

> 3) I don't do anything with metadata (except for checking creation time)
> yet, although it would be easy enough to extend my design to do this.
> 4) Need to complete the relevance ordering.
> 5) Doesn't yet do quite the right thing with updated documents.
>
> A neat extra feature for the future would be add extra subclasses for
> parsing some programming languages, and index text (comments and
> strings) independently from code, so that appropriate contexts would
> retrieve only correct thing.
>
> We could also do the man pages through this, I think it would be much
> faster than the current backend.
>
> It also occurs to me that the same infrastructure could easily be hooked
> up to an explicit query front-end to give a blindingly fast medusa
> replacement.
>

This all sounds cool. I happy someone jumped up and ran with the AbiWord
converter.

Have you looked at the first page preview functions to create png previews
of the first page?

Cheers

Martin

> Julian
>
>

References:
- New Indexer
  - From: Julian Satchell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]