taming of word (.doc) files

Today somebody on irc was seeking help on indexing word .doc files.
Beagle handles word files with wvware. However, there is a bug in wv
due to which some offending documents are able to crash the wv library
(something which beagle or mono cant do anything about). The user in
irc wanted to index a vast repo of word files but a lot of his files
turned out to be crashers. If anybody is in such a situation either he
can stop word filtering or keep adding --deny-pattern's to deny the
crashers based on their names (this works only if there are very few
crashers). Beagle is unable to skip these crashers while indexing
because there is no way of knowing before the crash happens if some
file will cause a crash (duh). There is actually something else you
can try ... and if your luck favours ... things can get better.

You need Joe's latest magic ExternalFilter for that. Get any of the
command-line based text extraction tool for word files (some are
listed at http://www.linux.com/article.pl?sid=06/02/22/201247). Then
use ExternalFilter (search the mailing list for info on ExternalFilter
or ask in irc; some details are given at
http://beaglewiki.org/ExternalFiltersRepository) to filter word doc
files using these command line tools. Testing is actually easier with
beagle-0.2.4 which contains beagle-extract-content. All you need to do
1) Install one of these tools ... or all of them. You never know which
one will be lucky for you.
2) Change the external-filters.xml file (at the proper location).
3) Use "beagle-extract-content /path/to/file.doc" to test the
performance of the filter.
[ 4) If you find beagle is falling back to FilterDoc for indexing .doc
files instead of using FilterExternal, then remove libwv1 from your
system and rebuild beagle. FilterExternal should have the maximum
priority and it needs a small fix.]

Maybe it will return something, maybe it wont return anything. But as
long as it doesnt crash mono, everything is good. The point is, even
if it fails to index the file, as long as it doesnt crash mono, beagle
can continue to index other files. It isnt as bad as stopping indexing

OTOH, people of the adventuerous kind can try to fix the bug in wv
(its linked from the bug in bugzilla with subject smthng like
"crashing in .doc" ...).

