Re: Help! Your beagle likes to eat my file index!



Hi,

On Wed, 2006-08-30 at 13:25 +0200, Martin Soto wrote:
> The only message that looks suspicious is the one with the "WARN:
> DocumentSummaryInformationStream not found". 

I've seen this on troublesome documents before as well, mostly in
PowerPoint files.  After talking to our resident OLE expert, though, my
understanding is that not having that stream is allowed, although
uncommon.

Is there anything odd about that file?  Password protected, perhaps?
Created by something other than MS Word?

> There's also no normal shutdown messages, like in other log files. 

Yeah, this is the #1 indication that something went wrong.  Normally you
would see "Exiting" if it shutdown cleanly.  There is actually a bunch
of debug info spewed out when a Mono app crashes, but for some reason
that doesn't seem to get redirected to the log files like other standard
output does.

> Could it be that the MS Office parsing library crashed with the last 
> document listed? 

Yep, this is almost certainly the case.

> Is there a way to run the text extracting code on that document alone to 
> see it if works?

Yep, you can confirm this by running the beagle-extract-content program
on the file.  It should crash in the same manner.

> Fair enough. On the other hand, if the problem is that the C libraries
> crash while parsing some file, one could think of an approach that
> reduces the risk of such an event actually corrupting the index.
> Wouldn't it be possible to parse the document first, storing the text
> somewhere, and only then open the index and write the text into it? It
> would certainly be slower and/or require more memory, but I'd gladly pay
> that price if it actually helps robustness. I think it is not that
> important if initial indexing takes somewhat longer, as long as you know
> you'll system will be reliably indexed in a few days time.

Yeah, you are right and it'd be possible to do that, although there is
another host of problems associated with doing the text extraction
entirely up front and caching it.  There's essentially no bounds on
memory usage, for example, which a streaming setup like the one we have
now avoids.

In my opinion, it would be a lot better to report this crash upstream
and try to get it fixed in the Word parsing library rather than
essentially create an entirely separate indexing codepath.  Part of the
beauty of open source is that we can get problems like this fixed at the
cause (the wv library); I'm not diametrically opposed to adding a
workaround, but I'd prefer it be a last resort.

We use the wv1 library for MS Word support.  The project website is
http://wvware.sourceforge.net/ and they use the AbiWord bug tracker:
http://bugzilla.abisource.com/.  If you could file a bug with that
document, that would be great.

Joe




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]