Beagle testing questions



All,
   I have been testing beagle for many months now (both releases and svn versions) and I'm still having problems.  It would seem that if I run a re-index overnight (on 60Gb of compressed pdf/ms docs) the indexer will hit the vmrss limit five or six times, get killed and restarted.

 Eventually it will all settle down, and the indexer will finish.  If I then take a random pdf (say a technical journal article) and run beagle-extract-content on it, the look for general technical phrases such as "low-voltage" or "phase error" I can then run a search (via kate) and get more hits than I can read in a week.
If I repeat this, but search for more obscure (but english) phrases like "minimum illumination" or something, I know there is a file with that phrase, and I can see the plain text - but the search returns no hits!

Should beagle be able to find /every/ sensible english word?  Is it possible I have a partially complete index?  How do I determine what files are excluded from the index?

I have seen from the logs that when processing archives (and all my media files are .gz compressed) that the actual indexing of the child ( i.e. the .pdf file contained in the archive) is 'deferred' until later.  What happens if the index helper get killed?  Does this deferral get ignored and the archive reexamined later? 

Is there a way to get a report of all the files that haven't been indexed, either because of missing filters (postscript docs don't work yet) or exceptions?

How can I gain confidence in the validity and completeness of the index?

Thanks in advance!

Regards,
Dave.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]