Re: question on performance evaluation of beagle's indexing



> My first question was do you make design decisions (e.g., chose one indexing
>
> algorithm over another) based on attributes such as --
> 1. How deep is the directory tree
> 2. what is the distribution of file sizes
> 3. how many media v/s text files are there
> and other such distributions characterizing a file system.
> Does it matter?
>
> In practice, different file systems will have different distributions, so
> there
> is no one-size-fits-all solution, but still, you can optimize for the common
> case.
> And even better, allow the user/beagle to select the best indexing strategy
> (if many exist) for a given file system distribution.

No we dont.
1) We dont know of any existing research on this to tell us what are
the options.
2) It is not clear how to make the decision without crawling/indexing
all the files first. Kind of a chicken-and-egg problem.
3) We try to guard against the worst case. After the initial crawling,
everything else is moderately fast and happens in real time. So a
relatively painless initial crawling is what we aim for.
4) During crawling, we basically crawl according to rule
EarlierPreviouslyCrawledFirst (with a special exception to the home
directory and its subdirectories).
5) Indexing is basically reading all bytes (or specific portions, for
binary files) of a file found by the crawler. Apart from minimizing
disk seek, I dont see any other place of optimization.

> Secondly, when you use your inhouse stress tester, you say it creates
> different
> directory structures, fills them with random files.
> Is random good enough? Is it meaningful? Do you think you will benefit if
> you
> had precise control over the statistical distributions?
> And do you just use it for inconsistency? What about performance?
> Do you analyze why performance is good/bad for a particular setup and
> try to improve on it?

The inhouse tester is mainly to test the consistency (basically an
unit test to check our Filesystem backend and basic querying/indexing
for regressions). We pay some notice to the performance but mostly in
terms of memory and querying speed. Indexing speed, being one time,
comes next. Furthermore, all indexing speed tuning that has been done
till now are for indexing one particular file. I have not done any
benchmarks or tuning based on what the filesystem distribution is;
mainly because I dont know what filesystem to expect.
I also dont see right away how a particular filesystem distribution
would affect performance "too much", positively or negatively.

I would be personally interested in knowing the results of any
relevant research.

- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]