Re: Help! Your beagle likes to eat my file index!



Unfortunately, it took me  two weeks to answer this. For work reasons, I
had to temporarily disable Beagle on my machine, and couldn't test again
until this week.

On Wed, 2006-08-30 at 13:04 -0400, Joe Shaw wrote:
> On Wed, 2006-08-30 at 13:25 +0200, Martin Soto wrote:
> > The only message that looks suspicious is the one with the "WARN:
> > DocumentSummaryInformationStream not found". 
> 
> I've seen this on troublesome documents before as well, mostly in
> PowerPoint files.  After talking to our resident OLE expert, though, my
> understanding is that not having that stream is allowed, although
> uncommon.
> 
> Is there anything odd about that file?  Password protected, perhaps?
> Created by something other than MS Word?

It was a .doc file created by an older version of OpenOffice. 

> > Could it be that the MS Office parsing library crashed with the last 
> > document listed? 
> 
> Yep, this is almost certainly the case.
> 
> > Is there a way to run the text extracting code on that document alone to 
> > see it if works?
> 
> Yep, you can confirm this by running the beagle-extract-content program
> on the file.  It should crash in the same manner.

Indeed, beagle-extract-content crashes consistently on that file. I'll
report it to the developers of the wv library.

> > Fair enough. On the other hand, if the problem is that the C libraries
> > crash while parsing some file, one could think of an approach that
> > reduces the risk of such an event actually corrupting the index.
> > Wouldn't it be possible to parse the document first, storing the text
> > somewhere, and only then open the index and write the text into it? It
> > would certainly be slower and/or require more memory, but I'd gladly pay
> > that price if it actually helps robustness. I think it is not that
> > important if initial indexing takes somewhat longer, as long as you know
> > you'll system will be reliably indexed in a few days time.
> 
> Yeah, you are right and it'd be possible to do that, although there is
> another host of problems associated with doing the text extraction
> entirely up front and caching it.  There's essentially no bounds on
> memory usage, for example, which a streaming setup like the one we have
> now avoids.
> 
> In my opinion, it would be a lot better to report this crash upstream
> and try to get it fixed in the Word parsing library rather than
> essentially create an entirely separate indexing codepath.  Part of the
> beauty of open source is that we can get problems like this fixed at the
> cause (the wv library); I'm not diametrically opposed to adding a
> workaround, but I'd prefer it be a last resort.

Although I agree wholeheartedly with trying to fix the bugs upstream
whenever possible, I wonder if this is generally practical. As far as
I've understood, one of the objectives of Beagle is to be able to index
the whole "personal information space". This means indexing as many file
types as possible. The problem is that many of those will require C
libraries, and if crashes in any of those libraries cause the file index
to break, we have a big problem.

I think that the "workaround" you're speaking about is rather an
important feature for robustness. Wouldn't it be possible, for example,
to fork a process that only extracts the contents and writes words back
to the indexer process trough a pipe? This way, if the extractor crashes
the indexer can still recover and keep going.

Anyway, the good news is that Beagle 0.2.9 managed to go through the
damaged file and has now indexed my almost 200.000 files. Very nice!

Thanks,

M. S.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]