Re: Request: Test suite for EFS.




[Michael: comments on OLE2 at the end]

> Disabling searching to avoid parsing the whole file sounds lame to
> me. XML is definitely structured. Storing images in it should not be a
> problem, see the RFC 2397 for one way to do it. Storing embedded
> objects should be no problem either as long as they serialize to
> XML. XML is perfectly happy to let you use multiple DTDs in one file.

People from the Windows world are used to multi-megabyte files.  Some
of the Gnumeric test cases for Excel loading are pretty large.

If we use XML exclusively, I wonder who is the brave soul who will be
scanning a directory for information with an XML file.  Consider a few
hundred files on a server, and you are looking for documents that have
been edited by "Maciej" at some point in life.  

I can picture the disk IO action going up, the memory usage going up
and the time going up.

Can you picture a way in which this could be solved with XML?

> To do good searching you really need a background indexer in any case,
> and that gives equally good performance either way, and people are
> working on various parts of this problem already.

That is one option, and might work if you set up things properly.  But
lets think as a regular, plain user.  A small office of people who do
not even have a sysadmin.

They choose to put their docs on "/company docs/", and they accumulate
a few hundred of those.  Who will setup the background indexing for
them?  What if they add another directory?  Is the setting global?
toi the network?  Per application?  is it even standard across
applications?

The entire scenario described above is avoided completely in current
Microsoft environments, because they can just scan the documents for
the "summary/author" inside each file.  Does not take a lot of memory,
and does considerably less IO.

> It sounds a lot to me like this efs thing is like a tarball, but
> optimized for read-write access. If there were a widely available
> command-line tool to process it, it might not be so bad. 

Yes, it is.  We can write the command line options, and even a vfs
module (so people can browse the internals with Nautilus or any other
gnome-vfs applications).

> But it would still be extra work to process it with standard XML
> tools, so there would have to be an actual compelling reason for
> preferring an ad-hoc structured binary format to an existing
> structured format that can be processed with many general-purpose
> tools.

Yes, this is my concern as well.  I wanted to use Microsoft's
Structured Storage file format, until Michael told me about the
shortcomings they had (small file names), although even this could
probably be worked around.

OLE2SS format is pretty standard in today's universe.  Might make
sense to just use OLE and deal with working around its brokeness.
This way, even Microsoft tools could search and index our documents,
and our applications would be ready to scan and search theirs.

> I don't think fast random access to specific fields is a compelling
> enough reason. Everyone else is moving away from binary files and
> towards XML for serialization despite this issue.

Not Microsoft.  They do support exporting to XML, but their default
file formats is still binary.

Miguel.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]