Indexer



The current cut of code for my indexer, with a matching backend, is
available if you email me. It is 240K, inclduing test data, so I didn't
want to spam everybody with it.

There are many areas for improvement:
1) Proper build support - it is left as an exercise for the student
   at present, with a manual makefile you must edit for the indexer;
   you will need to hack in backends to get the backend built. It
   depends on dblib.dll, exported from the docindex directory.
2) Make it robust - it has fragile dependencies on external
   converters, and if some of them hang the indexer gets confused.
3) Support more file types.
4) Extract metadata from more file types, as well as text content.   
   Currently we only half-heartedly support html.
5) Use the metadata to support more clue types, as well as better
   searching for existing clues.
6) Chain some more clue types.
7) One day, an alternate GUI that exposes a rich explicit search
   interface, but use the existing infrastructure.

The indexer and backend can be built against either sqlite or postgres.
sqlite is fine for say 1500 documents, but too slow with 7000. Postgres
is blindingly fast up to 7000.

I sent an earlier version of it to Nat for review, but haven't heard
back yet, and as I will go home for Easter in a few hours, I thought I
should announce it anyway - please remember the code has not been
approved by Nat, and may never make it in to the code base. 

Requests for code that arrive after about 5.15pm GMT Thursday will
probably not be answered before the Tuesday after Easter, as I won't be
here.

Julian Satchell






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]