Suggestion for file type detection approach



Hello,

I've spent some time today thinking about the costs and benefits of the
two approaches used today in GNOME to determine the MIME Type of a file.

Also, I've done some experiments and tweakings to check the impact that 
file sniffing has over nautilus performance. It's impressive. See below.

Some comparison:

DETECTION BY SUFFIX					
 1.Allows wrong results on invalid input (files with wrong 
   suffixes)
 2.Fails to determine type of files without suffixes, eg. 
   README, COPYING, mbox
 3.Very fast
 4.Easily customizable by users (eg. add new file types) 
 5.Generates low disk IO
 6.Code is simple and lightweight. No extension by code 
   is necessary

DETECTION BY SNIFFING
 1.Allows wrong results on _valid_ input (files with correct 
   suffix and funny contents)
 2.Allows detection of file type regardless of suffix
 3.Very slow to be used massively
 4.Unlikely to be customized by users
 5.Too much disk IO, since it needs to open the file
 6.Code is complicated. As an example, GNOME-VFS includes MP3 
   detection code. Not lightweight.

Currently, these two approaches are combined in the directory listing of
Nautilus in a bit unclear manner. There seems to be some priority
mechanism to decide wether the type of a file will be decided by content
or suffix. However, the content is always read and tested.

Additionaly, there are some proposals of implementing some kind of
fallback, to test the contents of the file only when not able to
determine by suffix.

IMO, we could think a bit more and combine these two approaches in a way
very different from simply doing the two things when reading the
directory. 

Today I made some tests to check the impact of sniffing in
GNOME/Nautilus performance and I must confess I am very impressed. I
installed GARNOME and modified GNOME VFS 2.5.3 to disable sniffing. I
have some directories with thousands of 1~2MB files, so I was able to
measure the time that nautilus takes to show these folders with and
without sniffing. It's a simple test. You can do it yourself in minutes.

After researching a bit and understanding how the VFS system works,
modified the "modules/file-method.c" file, line 562:

I changed:

   mime_type = gnome_vfs_get_file_mime_type (full_name,
        stat_buffer, 
	(options & GNOME_VFS_FILE_INFO_FORCE_FAST_MIME_TYPE) != 0);

To: 

   mime_type = gnome_vfs_get_file_mime_type (full_name,
                       stat_buffer, TRUE);

According to "libgnomevfs/gnome-vfs-mime.c", the syntax of this function
is:

gnome_vfs_get_file_mime_type (const char *path, const struct stat
*optional_stat_info, gboolean suffix_only)

This is *obviously not* the solution, but I changed it this way to make
sure that Nautilus would never do sniffing while I was testing. 

My simple testbed was this folder:

/home/fabiofb/emu/smd/roms: This directory has 252 files varying from
512K to 2MB. The average is 1MB. There are ZIP, binary (unknown to
nautilus magic) and text files.

Using a simple chronometer, I tested Nautilus 2.5.3 with and without
sniffing. I rebooted the machine between the tests to ensure that the
disk cache does not mess with the results. Also, I tested multiple times
each. The precision sucks because I must press the cronometer button
manually, but with such a difference, no one cares about precision:

with sniffing    : 21 seconds
without sniffing : less than one second

I had similar difference with many folders of my machine, including
/lib, /usr/lib, etc.

My computer is a Duron-950 with 256 MB of RAM and a quite fast IDE hard
disk.

Given the performance bottleneck imposed by sniffing, I suggest that it
is not used anymore in directory listing routines. It should be used
when the user tries to open an unknown file. Let's imagine this case:

- When listing a directory, the system cannot detect the MIME Type of
'my-spreadsheet' by its suffix, so the file gets
"application/octet-stream".

We could exploit the fact that unknown files have this MIME type by
associating some file type detection utility to them. Let's call it
gnomemagic.

- The user double-clicks the file. "application/octet-stream" is
associated with 'gnomemagic'. So gnomemagic is run and displays a dialog
such as:

-------------------------------------------------------
Unknown File Type

The system was unable to determine the type of this file by its name.
Analysing its contents, it looks like a file of type "Gnumeric
Spreadsheet" (application/x-gnumeric).

What do you want to do? 

[ ] Rename the file, appending the ".gnumeric" suffix to match its type

[x] Open the file with [Gnumeric_______][v] (dropdown/combobox)
      [ ] Configure the system to always open unknown files that 
          look like "Gnumeric Spreadsheet" using this application
 
[ ] Configure the system to always open unknown files with the most
    probable associated application, when one is available

[ CANCEL ] [ OK ]
-------------------------------------------------------
Note: [ ] = checkbox

'gnomemagic' could be a separate GNOME package. This could ease the
maintainabilty of the database, allowing user contributions worldwide.
We could provide a website to allow users post magic for new file types.
Such magic should be submited to testing and certification through some
guidelines.

One cool thing about 'gnomemagic' is that it could be run by
applications after unsucessfully trying to open invalid, corrupted or
unknown files.

This entire approach would allow GNOME-VFS to forget about sniffing,
making the life of maintainers easier, improving performance and
eliminating (most) unexpected results.

If this idea makes some sense, we can start a more ellaborate study. I
would be glad to participate.

Now I am going to my girlfriend's house, where her mother is preparing
endless food. :-)

Thanks for your attention.
 
-- 
Fabio Gomes de Souza <fabio gs2 com br> (+55 81 9127-0597)

.- GS2 TECNOLOGIA DA INFORMACAO LTDA :: www.gs2.com.br
|- IT Infrastructure :: Security :: Embedded systems :: Linux
`- Olinda, Brazil - +55 81 3492-7777 - negocios gs2 com br





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]