Re: Suggestion for file type detection approach



On Pre , 2004-01-02 at 17:46, Edward Jay Kreps wrote:
> >Given the performance bottleneck imposed by sniffing, I suggest that it
> >is not used anymore in directory listing routines. It should be used
> >when the user tries to open an unknown file. Let's imagine this case:
> 
> I don't think this is a good way of thinking about things.  The questions are: 
> 1. Is sniffing a good idea?  
> 2. If so does it work correctly?
> 3. If (1) and (2), is it performing fast enough?
> 
> I think others have argued persuasively that sniffing is a good idea
> since unix doesn't generally give file name suffixes to files (even
> though gnome does), and often files have incorrect suffixes.
> 
> A number of people have said that there are instances where sniffing is
> not correctly determining the file type.  If true, this is an excellent
> argument for fixing those cases, but not at all an argument for throwing
> away sniffing altogether.

I've only seen one person say that. And as I understand, no real
information was provided. Only a comment of "it broke on this one file."
But yes, it is a good argument for improving the system, not removing
the core functionality.

> >From your benchmarking it is clear that (3) is a problem, and sniffing
> is taking too long.  This is not an argument for getting rid of it
> though, just an argument for speeding it up.  Someone suggested running
> it through a profiler, but I doubt that will be worthwhile--the problem
> is almost certainly the multiple disk accesses (your disk is having to
> seek for each file).  As others have pointed out, a two pass technique.
> based on extension and then sniffing is also a bad idea since icons, etc
> would change in the case of a discrepancy.

It is not clear that 3 is a problem. It is only probable that 3 may be
an issue on a certain machine with a certain configuration. I guarantee
that the sniffing is not the bottleneck. Running it through a profiler
will be much more worthwhile than sending mail to a list saying "it's
slow." And yes, I am the one that suggested profiling. The disk i/o is
most certainly not the bottleneck. Given the speed of hard disks today,
seek time is not an issue. Even if it took the 12ms maximum seek time on
a newer model hard disk, and you were loading a directory with 1000
files in it, that is only 12 seconds. Given the size of cache on newer
hard disks, and the fact that people generally open up the same few
folders, rather than opening / and traversing the tree looking for
things, or opening odd folders randomly, I would guess that the maximum
seek time would be around 3-4 milliseconds. It is much more likely some
other problem that is specific to something Nautilus is doing to display
the list of files. I also suggested other things than profiling, such
as writing specific benchmarks to compare test-mime from gnome-vfs and
file, which would be much more useful than saying "Nautilus must be slow
because echo * is instantaneous" or other such nonsense. Real benchmarks
are much more reliable than "it seemed slow" or "I used a stopwatch", as
human error and perception can misinterpret how long it actually took to
do something.

> Sniffing is slow because it opens every file and reads some of it every
> time you open a given directory. If you want to make this fast, cache
> filetypes; now opening the huge mp3 folder is just a matter of reading a
> single cache file and sniffing those files with a modification time
> later than that of the cache file.  Naturally this would only need to be
> done for those really huge directories, it would probably be a waste for
> directories with only a hundred files or fewer.

Sniffing is not slow because it opens every file and reads some of it.
Though, it may be if you end up opening network mounts, since all of the
stat()s are network-bound, which is much slower than local hard disk
access. There can surely be improvements in speed for network-bound i/o
in gnome-vfs, and improvements in file type detection as well, since
most all web server installations on the Internet, are broken. They also
generally use filename extensions to determine what to send for the
Content-Type: header. This is what gnome-vfs uses currently.

> I don't think we need to worry about which approach is ultimately going
> to perform faster. For a program like Nautilus either there is or is not
> a human-noticeable lag time; improving performance when there is no lag
> time is totally pointless.

Aye. In general, the speed issues have nothing to do with the way the
mime type is detected. Any claims that one is substantially faster than
the other, is generally due to misperception. The real issues seem to
generally be at a lower level, or totally unrelated, such as the issues
with some of the thumbnailers.

> A number of people have used Windows as an example of why we don't need
> to sniff files; though there are a number of features in Windows worth
> copying this is definitely not one of them.  I have worked on some
> commercial software and the number one frivolous bug report or unfixable
> user issue occurs when the user attempts to open a file that has an
> incorrect filename extension.  People on this list have suggested that
> this is a user problem not a software problem (i.e. that the user was
> stupid and beyond help), but I can assure you that however obvious the
> connection between the hidden Windows filename extension and the error
> message that our program gave is to me and you, it was not obvious to a
> large number of otherwise very intelligent people who just weren't as
> knowledgeable about computers.  People always think that the icon for a
> file is somehow part of the file (it makes sense if you don't think
> about it too hard), and so if a file has, say, a jpeg icon it doesn't
> occur to them that it is not a jpeg.

Indeed.

-- dobey





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]