RE: Nautilus, metadata and extendet attributes



El jue, 29-01-2004 a las 17:32, Xavier Bestel escribió:

> Ah well ... not really:
> 1) Only the first read matters, otherwise nautilus (or something else)
> will cache the mime-type (or the first bytes) by itself anyway. Just try
> opening nautilus for the first time on a crowded directory, then close
> and reopen it. Feel the difference. So no point here.

Once EAs are widespread, the probability that EAs are cached will be
higher than the files' contents.  Meanwhile, their probability of them
being cached is the same as the current dataset used to determine file
types.  But, read on.


> 2) Uh ?

If you don't know, that's okay. EAs make lots of things possible.  E.g.
you could use EAs to store file notes, descriptions, unique file
identifiers (to track your ever-moving MP3 collection, from unsorted to
sorted), song rankings, links to covers, personalized icons, and much
more.   This all info is expected to be read upon directory read, so
placing MIME types in EAs makes sense.  To tell you the truth, plain
files could also be used, but a) there's problems with locking; b)
security problems arise; c) consistency problems arise.  EAs are the
architecturally correct, trustable solution to all the problems of
older, non-integrated solutions for metadata.


Think of all the wonderful things Mac OS can do with files, and you
begin to comprehend how useful EAs become.


> 3) When you read one single byte on disk, you in fact read a bunch of
> them (how many exactly depends on the drive, driver, etc.). Moreover,
> the cost of seeking to this byte is so high that the "system" will read
> a bunch more sectors (many many bytes) just in case they're needed
> later. No point there too.

You've just confirmed what I said in my last email.  Due to readaheads,
reading EAs would be faster than reading files themselves for getting
contents.

Nowadays, Nautilus readdir's() a directory, and uses either file
contents or file extensions to infer the file type (note I say INFER,
not determine, the file type is already determined).  Reading file
contents is *dog slow* (relatively).  Reading file types is inaccurate 
can lead to security compromises, creates usability problems, and it's
plain wrong (the file type shouldn't be a function of the file name) but
it's fast.


Oh, well, let's move on:

If every (every!) application saved files WITHOUT extensions (naysayers,
please hold on to your hats because this might be too radical for you),
and they stored the file type on a special, standard, agreed-upon
"file", you would have a reliable way of determining file types (and
perhaps many other things, such as file notes, icons, etcetera).  Wanna
know the file type?  Just look on the "filetype file".

The problem is that, you all would have to actually agree a 100% on
where (location) to store (format) that information, and solve all the
technical hurdles:

* does my current user have permission to save to the filetype file?
* what if another process is currently writing to the file?
* what if some user moves his files around?
* how does the filetype file get updated?).

Now you begin to understand the complexity of the problem.

It's a fact of life that we associate data, and one of the consequences
of this fact is that we tend to label, categorize and tag.  These tags,
labels and categories are called "metadata", and one way to categorize
and sort files out is to define file types.  File types *belong* into
the system used to categorize and tag files: the metadata.  File types
literally are *data about data*: by definition, metadata.  Just as we
are human, metadata will always exist, and you can't do anything to get
rid of it.  It's a byproduct of thought, and a useful one.

You need to store it.  Period.  Now, how?  Or perhaps, how best?

The technical solution is to devise a simple, transactional, atomic way
of storing information *along* files.  Atomic to guarantee consistency,
and "along files" so that the information (file types, attached notes,
etcetera) goes WITH the files, when they are moved.  Now, every
application can agree on where to store file types, because the systems
provide a clean, secure, robust and uniform way to do so.

But who will ever write such a piece of software?  How will it take
form?  Is it a daemon?  Is it a service?  Is it a command that runs
every night?

No.  Due to the properties of such a system, it has to work inside the
kernel.  Ensuring that when files are moved, copied, and stuff, the
information stored along the files (called metadata) goes with them. 
Any other system, outside the kernel, would have to follow the user's
actions and files like a nanny follows a naughty baby.

So EAs were defined, standardized, and developed.  Today, most file
systems available for linux and solaris support EAs.  In fact, every
modern system supports them in one way or another (NT calls them
streams, Mac calls it the Resource fork - there is only one and Apple
has standardized ways to store stuff there).  They exist because there
is a valid need for them.  E.g. you could label a folder with a "sticky
note" so when your coworker opens the folder, he sees the note.

...Rewind 20 years...

Since there was no system to store "metadata" when MS-DOS users
multiplied, (perhaps even before) people started using file extensions. 
.DOC, .XLS, .MP3.  It's become so ingrained that nowadays people think
of extensions as the real deal "file types", unknowingly stretching the
definition.

To everyone who is interested, here's the memo: file extensions are NOT
file types.  NOT.  NOT.  Whoever uses the term "file extension" and
"file type" is an ignorant moron.  File extensions are just one way (and
a terrible way at that) to let us distinguish file types.  The fact that
they are a bad thing pops up in inconsistencies (two files apparently
named the same, because extensions are hidden) and mass-mailer viruses
(a file posing as a zip file is actually an executable).  I repeat it
and I'll repeat it myself: using extensions to do anything meaningful is
a terrible design decision and should be abandoned ASAP.  Sure, they
solved the needs of discerning file types for MS-DOSers and Windows'ers,
for a while.

Perhaps the fact that even Hollywood has used them on movies shows how
bad they are =).

...return to 2004...

Today, we're mostly stuck to using either the extension or the contents
of the files to discern its type.  Since the file type is a function of
the file contents, it's logical that the description of the type of the
file be stored along the files. But not directly in the file, because
the file type is not data that belongs in the file: it's data about the
file.  Metadata again.

But now we have the metadata store that MS-DOS didn't have.  A metadata
store that is standard, will get augmented with search technology, and
is guaranteed to be stable.  Someone has written it for us.  We didn't
have to agree at all.  Well, we have to agree to start using it now, but
that's about it.

I think it's about time that we started taking advantage of it.

Since Nautilus and Konqueror (the two most prominent file managers) both
play such a central role, their support is crucial.  This is the way it
could work.

1) Applications start tagging files each with an EA entry (tentatively
named mimetype) that contains the MIME type of the file - portable
libraries would need to be written (libmimetype? libmetadata?  a
"libmimetype" above "libmetadata"?  KFileMetaInfo extensions?) to
automate this job and even deduct and apply more-or-less functional
fallbacks if EAs can't be used

   (this has to be easy for devels, perhaps a one-liner in C)

2) File managers start tagging files with the file type they determine
for each file on the first visit to its directory.  From that point on,
file managers never again use the file contents extension to ascertain
the file type, but instead directly use the EA mimetype entry.  For
those who investigated about WinFS, this is akin to "promoting" files
into WinFS.

This has advantages:

1) OpenOffice documents won't ever be detected again as Zip files =)
2) Using EAs as file type stores is much faster than sniffing each file
(I concede extensions are even faster, but since they are BAD, they are
disqualified).
3) New things become possible (file notes? icons? etcetera... it is only
a matter to standardize)
4) Downloading a file from the Internet isn't a problem either (the MIME
type is transferred with the most frequently used protocols, and the
file managers can "promote" the status of the file whenever they "see"
it for the first time).

For 2005:

* Faster Linux file managers
* More accurate file types (since every app writes file types along with
files, there's much more certainty)
* New features (per-file icons, notes on files, ACLs)
* Medusa-like/Storage-like search services, that take advantage of EAs
(instead of sniffing files for metadata, the EA store could become the
primary source for it, and when files are moved around or sent around
the internet, metadata is reintegrated back into the file whenever
possible, such as MP3 files)
* Linux leading the pack

I see this plan as the way to stop importing idiocy and start exporting
innovations.

> 
> > Once you've determined the file type and stored it in an EA, subsequent
> > reads would be faster than sniffing the files, for all the
> > aforementioned reasons.
-- 
	Manuel Amador (Rudd-O)
	GPG key ID: 0xC1033CAD at keyserver.net

Attachment: signature.asc
Description: Esta parte del mensaje =?ISO-8859-1?Q?est=E1?= firmada digitalmente



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]