Re: Beagle handling of compressed files and man pages



Michael,

First of all, thanks for your patch!  Good handling of compressed files
will be a nice addition to the crawler, and I'm very pleased that you
are working on it.

I just committed slightly tweaked versions of your mime type discovery
improvements to CVS.  In gnome.cs, I renamed GuessMimeType to
GetMimeTypeFromData.

Here are a few thoughts on the rest of the patch, in no particular
order:

Do we really need to introduce the PeekableStream?  It seems to
introduce a lot of complexity, and I'm not quite sure what we are
getting in return.  It probably would be easier to just open the file,
examine the contents to determine the mime-type, close it, and then open
it again later when we extract the contents.

In IndexableCompressedFile, you don't want to hold a reference the open
stream.  Beagle queues up the indexables and dumps them into the index
in batches, so they shouldn't hold system resources like file
descriptors --- otherwise you'd almost certainly run out of file
descriptors and crash while crawling a tree with lots of compressed
files (like /usr/share/man, for example).  You want to emulate
IndexableFile's behavior here: carry around the path in a string, and
re-open the file in DoBuild (which gets called right before the
indexable is actually indexed).

If IndexableCompressedFile ends up mostly duplicating code in
IndexableFile, it might make sense to merge them and just add compressed
file support to IndexableFile.  

In my tests, Gnome-vfs says that compressed man pages are of type
application/x-troff, not application/x-troff-man.  Is there a way to
reliably identify a man file?  It might make sense to instead just have
a general troff filter that was optimized for the case of man pages.
But then again, it might not. :)  I don't really know enough about troff
to say for sure.  (How many non-man troff files are floating around on a
modern linux system anyway?)

Thanks,
-J









[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]