Re: Beagle testing questions



Hi,

   I have been testing beagle for many months now (both releases and svn
versions) and I'm still having problems.  It would seem that if I run a
re-index overnight (on 60Gb of compressed pdf/ms docs) the indexer will hit
the vmrss limit five or six times, get killed and restarted.

Yeah, those are text heavy files. The memory situation is better every
release but not quite there yet. The svn checkout would be slightly
better, but as I see it needs more work. I'll investigate, thanks.

If I repeat this, but search for more obscure (but english) phrases like
"minimum illumination" or something, I know there is a file with that
phrase, and I can see the plain text - but the search returns no hits!

Hmm... that would be a bug. I never tested phrase query that
extensively. I will check this. Unlikely, but it is possible that
beagle-search could be skipping some hits. Can you please check with
beagle-query ? Also, if you search for words (e.g. illumination) then
do you get results ?

Should beagle be able to find /every/ sensible english word?  Is it possible

I would like to claim, yes it should.
Beagle svn contains a simple tool to list all terms in beagle-index,
beagle-dump-index. Something like "beagle-dump-index --terms
--indexdir=/path/to/index-directory" would list all the terms in the
index on which you can grep for the word to see if some word is in the
index. It not quite that simple, you have to grep for the stemmed word
and it would take a long time for a large index but its a useful
debugging tool.

I have a partially complete index?  How do I determine what files are
excluded from the index?

I don't think you have a partially complete index. Check index-info
--status, if the scheduler queue is empty then definitely indexing is
finished. You are using the files backend - right ? If you are testing
using static indexes, then index-info does not apply.

There is no direct way to determine what files are not indexed from
the beagle index.

I have seen from the logs that when processing archives (and all my media
files are .gz compressed) that the actual indexing of the child ( i.e. the
.pdf file contained in the archive) is 'deferred' until later.  What happens
if the index helper get killed?  Does this deferral get ignored and the
archive reexamined later?

Yup. Indexing of archives is not complete till all the included (and
sub-included and so on) files are indexed. If killed midway, they
would be re-indexed.

Is there a way to get a report of all the files that haven't been indexed,
either because of missing filters (postscript docs don't work yet) or
exceptions?

Unfortunately nothing better than grepping the logfiles :(

How can I gain confidence in the validity and completeness of the index?
*sigh* Till I read your email, I assumed beagle at least does not have
_this_ problem. I willdefinitely look into this.

- dBera

--
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]