Re: Beagle handling of compressed files and man pages

On Mon, 2004-06-07 at 09:34, Jon Trowbridge wrote:
> Michael,
> First of all, thanks for your patch!  Good handling of compressed files
> will be a nice addition to the crawler, and I'm very pleased that you
> are working on it.


> I just committed slightly tweaked versions of your mime type discovery
> improvements to CVS.  In gnome.cs, I renamed GuessMimeType to
> GetMimeTypeFromData.

OK, great.

> Here are a few thoughts on the rest of the patch, in no particular
> order:
> Do we really need to introduce the PeekableStream?  It seems to
> introduce a lot of complexity, and I'm not quite sure what we are
> getting in return.  It probably would be easier to just open the file,
> examine the contents to determine the mime-type, close it, and then open
> it again later when we extract the contents.

I wondered the same thing when I wrote it. I agree that it is added
complexity which may not be needed (unless we also want to handle things
that aren't files). The reason I decided to include it anyway was
because I thought it might be useful to handle Zip and Tar files (i.e.
"archives" containing multiple files). The idea was that it would avoid
having to re-open the file and got the the right entry. Given the next
comment that you have, this may not be a good idea to start with ;)

My question then becomes how to "encode" an entry name in a URI or path.
If we are passing paths around then we need a way to identify the
specific entry of a multi-file archive....I may just be showing my
ignorance, but I don't know how to do this. Any ideas? For Gzip and Bzip
the point is moot (I think), but for Zip and Tar we'll probably want
something. If any one has ideas, I'd gladly remove PeekablStream.cs and
change IndexableCompressedFile. (would something like
"file:///path/to/file.gz?entry=entry/path/and/file/name" do the trick in
your opinion ?)

> In IndexableCompressedFile, you don't want to hold a reference the open
> stream.  Beagle queues up the indexables and dumps them into the index
> in batches, so they shouldn't hold system resources like file
> descriptors --- otherwise you'd almost certainly run out of file
> descriptors and crash while crawling a tree with lots of compressed
> files (like /usr/share/man, for example).  You want to emulate
> IndexableFile's behavior here: carry around the path in a string, and
> re-open the file in DoBuild (which gets called right before the
> indexable is actually indexed).

The filehandle point is, of course, quite valid.

> If IndexableCompressedFile ends up mostly duplicating code in
> IndexableFile, it might make sense to merge them and just add compressed
> file support to IndexableFile.  

Yup. It makes sense to refactor.

> In my tests, Gnome-vfs says that compressed man pages are of type
> application/x-troff, not application/x-troff-man.  Is there a way to
> reliably identify a man file?  It might make sense to instead just have
> a general troff filter that was optimized for the case of man pages.
> But then again, it might not. :)  I don't really know enough about troff
> to say for sure.  (How many non-man troff files are floating around on a
> modern linux system anyway?)

I don't speak troff (or man for that matter) ;) I Googled around and
found that a properly formed man page is allowed to start with comments
(.\" as the first characters) and the first non-comment line must be
\.TH followed by 5 fields (as described in the commented regexp in
FilterMan.cs) I noticed on my system man pages which don't all have the
5 fields (hence the relaxed regexp actually used). I haven't been
exhaustive, but in my quick survey I didn't find any man pages which
don't start with \.TH  I have no troff files which aren't files (to my
knowledge at least) which might answer your last question. Any one else
have input? I had planned to add some sniffing for a \.TH line to


> Thanks,
> -J

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]