Re: Size of index



Hi,

On Sat, 2006-12-16 at 07:40 -0800, Gregoire Gentil wrote:
> I would like to get an idea of the size of Beagle index. Let's say that
> you have xGB of structured data (Word, OpenOffice, text files,
> emails...) yGB of Music and zGV of Video on your system, how much is the
> size of Beagle index? Obviously, indexation of music and Video is almost
> zero. But for "standard" data (=xGB), what's the ratio of index size?

It's hard to put a concrete number on it, because as you say, the type
of data can vary the figure pretty wildly.  There are other constraints
too, such as the number of terms within a document.  Lucene (Beagle's
underlying search engine) doesn't index more than 10,000 terms by
default, so huge files don't go entirely into the index.

As far as files go, plain text files have the greatest "density" when it
comes to indexing.  Although they have no metadata, all of their content
is indexable text.  Emails are also incredibly dense because they
contain almost entirely text, plus pretty straightforward metadata.

I generally keep around a sandbox for testing file indexing.  It's a
directory of 60 megs, mostly text files and various office documents
(OOo, MS Office, PDFs).  Its index is 1.1 megs, the "text cache" is also
1.1 megs.  That's about 1.7% each of total disk usage.

On my laptop, I am using 19 gigs overall in my home directory.  913 megs
of that, or 4.8%, is used by my indexes (which are email-heavy, with
several hundred thousand emails indexed) and 66 megs, or 0.34% is used
by my text cache.

I suspect other people will have similar figures.  I generally tend to
think that 10% is a reasonable figure for index size, although my
indexes don't appear to be anywhere near that.

Thanks,
Joe




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]