Policy question



The sqlite database makes the backend for document indexer slow for
large numbers of documents. 

I can cure the problem by preloading all the data that maps keywords to
documents into memory, so it no longer touches the database. This makes
the backend very fast - it is now the first backend to respond to a
query. 

The downside is two fold:

1) Startup is slow, and increases linearly with the amount of text that
has been indexed. It can easily be 10s of seconds.

2) Memory consumption is increased, and will rise proportionally to the
amount of text indexed. The consumption has been minimised by object
sharing, but still requires one object reference per textual word (which
is the minimum I can imagine). I think this is 4bytes per keyword on
x86. If you have a lot of big documents, you could easily have many
millions of words (for example, I often write or work with reports that
are 5,000 to 20,000 words long and many people would have hundreds of
documents like this). The memory consumption could easily run to many
10s of Mbyte.

Is there a policy about performance vs resource consumption trade-offs?

Julian




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]