Re: question on performance evaluation of beagle's indexing


thanks for your reply.

Actually this is the part I am most interested in --
For testing, we have a inhouse stress testing and file system
correctness checker (trunk/testsuite/bludgeon/) - it creates different
directory structures, fills them with random files and indexes/queries
them looking for inconsistency.

What I meant was all in the context of file system structure and file content.
So for the moment just assume we are using beagle-build-index, with

My first question was do you make design decisions (e.g., chose one indexing
algorithm over another) based on attributes such as --
1. How deep is the directory tree
2. what is the distribution of file sizes
3. how many media v/s text files are there
and other such distributions characterizing a file system.
Does it matter?

In practice, different file systems will have different distributions, so there
is no one-size-fits-all solution, but still, you can optimize for the common case.
And even better, allow the user/beagle to select the best indexing strategy
(if many exist) for a given file system distribution.

Secondly, when you use your inhouse stress tester, you say it creates different
directory structures, fills them with random files.
Is random good enough? Is it meaningful? Do you think you will benefit if you
had precise control over the statistical distributions?
And do you just use it for inconsistency? What about performance?
Do you analyze why performance is good/bad for a particular setup and
try to improve on it?

sorry for the plethora of questions, please feel free to respond according to
your convenience.


On 4/2/08, D Bera <dbera web gmail com> wrote:

> I was wondering how you currently evaluate beagle's
> index performance, and decide on tradeoffs between
> different indexing options? Do you have some sort of
> in-house case base of test file systems?

Usually beagled (the indexer component of beagle) slows itself down to
not consume CPU continuously. You can disable this internal scheduling
by setting the environment variable BEAGLE_EXERCISE_THE_DOG. In that
case you might also want to set "--indexing-delay 0" to start indexing
right away; there is usually a 60sec gap (the option could be set
automatically by the EXERCISE_THE_DOG setting, not sure).

Beagle stores its bookkeeping information per file in the extended
attributes of the files or if that fails, in an sqlite database. There
is a folklore that using the extended attributes is significantly
faster than the sqlite database (disk access vs sqlite access). But
depending on disk I/O speed and other system load, sqlite access could
sometimes be faster IMO. For simplicity we ignore the tradeoff and
always suggest and prefer extended attributes over sqlite. You can
force sqlite by setting BEAGLE_DISABLE_XATTR.

Being a long running desktop process, not everything is tuned for
maximum speed though; we try to strike a balance between speed and
system resources.

Beagle uses XML messages over a Unix socket for IPC. Setting
MONO_XMLSERIALIZER_THS=0 would give you a faster IPC (you need to have
gmcs installed for this to work).

For effectively read-only filesystems (e.g. system documentation
directory, or backup directory), instead of using beagled and its live
filesystem backend (called "Files"), you can build a read-only index
using beagle-build-index. The read-only index can be added to beagled
for querying only, changes will not be monitored; you have to rerun
beagle-build-index to update the index with recent changes.

You can control which backends to start with beagled. Backends are the
different data sources e.g. "Files" for live-filesystem, "Opera" for
Opera browsing history, "KMail" for KMail emails etc. You can either
use "--backend" option to beagled or disable certain backends
permanently using beagle-settings. Disable the unused backends will
certainly boost performance but by not much. Pass "--backend none" to
not start any backend e.g. with "beagled --add-static-backend
/path/to/static/backend --backend none" will only query the read-only
static index.

I cant off the top of my head remember any other indexing option that
will affect indexing performance.

For testing, we have a inhouse stress testing and file system
correctness checker (trunk/testsuite/bludgeon/) - it creates different
directory structures, fills them with random files and indexes/queries
them looking for inconsistency.

I hope that answered your question at least partly. If there is
anything more you are looking for, please let me know.

- dBera

Debajyoti Bera @
beagle / KDE fan
Mandriva / Inspiron-1100 user

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]