Re: Followup: opinions on Search services

On Thu, 2005-04-07 at 17:41 +0100, Jamie McCracken wrote:

> > Let me illustrate with an example:
> > 	"To index a 1 gigabyte file, do I need 1 gigabyte of memory?"
> > Clearly if your answer is `yes', then you are not the most astute
> > programmer, nor the sharpest knife in the drawer.
> No but depending on how its implemented you still have to filter the 
> file into plain text and then generate a unique word list from it. This 
> word list can potentially be quite large for large files and would 
> occupy a fair amount of memory.

My lq-text package can index multiple gigabytes of text without needing
to have all the words from any file in memory at any one time, and it
was written (mostly) in 1989, so is hardly new technology.  The
algorithms have been published and the code is available.  I do have a
limitation that you need to be able to fit all occurrences of a single
word in memory during indexing (although not during retrieval), so if
the record for "the" doesn't fit, you may have to resort to using a
stopword.  Zipf's law applies remarkably well, so it's very rare to
need more than a few stopwords even on small systems.

Merely recording which words are in which files leads to what the
information retrieval researchers call low precision -- if you're
searching for the New York Times you don't want the times that there
was news about York Minster in England.  The more documents you have,
the more you need search services, and the more you need high precision.

Note that Google is also subtly sensitive to word order, and can match

Arguments about technology really ought to come after arguments about
use cases and needs.

It might be that there is a lot of merit in integrating some sort of
indexing framework into the desktop -- indexing services sometimes
work best if they are told *before* a file is deleted or renamed,
for example, so they can "unindex" it efficiently.

An API for this might benefit other applications, especially if it
helps people to find out "which application made this file and why".
"This data file is needed by the game of Empire you've been running for
12 years.  If you delete it, your game will be lost.  Continue?" is
clearer than "really delete emp3016.dat?"

So I think there might be useful things to consider, but at the
interoperability level, not at the specific implementation level.


Liam Quin, W3C XML Activity Lead,
Pictures from old books:
IRC (chat) programs:

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]