Re: Help! Your beagle likes to eat my file index!



Hello Joe, and thanks for your answer!

On Tue, 2006-08-29 at 15:13 -0400, Joe Shaw wrote:
> Martin Soto wrote:
[...]
> A likely possibility is that Beagle's indexer process is crashing in the 
> middle of indexing a specific file.  When this happens, the main Beagle 
> daemon notices that the indexer has gone away and spawns a new one. 
> Since the last one crashed, it left behind lock files, and the new 
> process ends up purging the index.  This is a nasty but unfortunately 
> unfixable circumstance.

Well, the log files seem to confirm this.

[...]
> Take a look at the *end* of the previous IndexHelper log.  That's the 
> one more likely to have crashed and caused a problem.  Normally there 
> should some line about exiting in there.

These are the last lines of the previous log:

060829 1603493920 07830 IndexH DEBUG:
+file:///media/hda2/home/lap096/UML/linux-2.4.24/include/asm-sparc64/socket.h
060829 1603494070 07830 IndexH DEBUG:
+file:///media/hda2/home/lap096/Desktop/IESE/QPE-Meeting-040721/Dissertation_Vorschlag_04_07_21.ppt
060829 1603503199 07830 IndexH DEBUG: Helper Size: VmRSS=41.9 MB,
size=2.55, 38.7%
060829 1603516560 07830 IndexH DEBUG:
+file:///media/hda2/home/lap096/Desktop/IESE/QPE-Meeting-040721/040721_ms_minutes.doc
060829 1603529778 07830 IndexH  WARN: DocumentSummaryInformationStream
not found
in /media/hda2/home/lap096/Desktop/IESE/QPE-Meeting-040721/040721_ms_minutes.doc
060829 1603533374 07830 IndexH DEBUG: Helper Size: VmRSS=43.8 MB,
size=2.67, 41.6%

The only message that looks suspicious is the one with the "WARN:
DocumentSummaryInformationStream not found". There's also no normal
shutdown messages, like in other log files. Could it be that the MS
Office parsing library crashed with the last document listed? Is there a
way to run the text extracting code on that document alone to see it if
works?

> > By the way, I really don't know, but is Lucene so lacking in robustness
> > that you have to completely erase an index that took days to build just
> > because a process crashed while accessing it?
> 
> If we're in the middle of writing out to the index and the process 
> crashes, there's no way we can guarantee the consistency of the data in 
> the index.  I don't know if this could be considered a real weakness in 
> Lucene; I don't think it's unreasonable for it to expect valid data to 
> be written out to it.

Fair enough. On the other hand, if the problem is that the C libraries
crash while parsing some file, one could think of an approach that
reduces the risk of such an event actually corrupting the index.
Wouldn't it be possible to parse the document first, storing the text
somewhere, and only then open the index and write the text into it? It
would certainly be slower and/or require more memory, but I'd gladly pay
that price if it actually helps robustness. I think it is not that
important if initial indexing takes somewhat longer, as long as you know
you'll system will be reliably indexed in a few days time.

Thanks a lot,

M. S.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]