Re: Request: Test suite for EFS.



On Wed, Feb 16, 2000 at 06:01:06AM -0600, Miguel de Icaza wrote:
> 
> [Michael: comments on OLE2 at the end]
> 
> > Disabling searching to avoid parsing the whole file sounds lame to
> > me. XML is definitely structured. Storing images in it should not be a
> > problem, see the RFC 2397 for one way to do it. Storing embedded
> > objects should be no problem either as long as they serialize to
> > XML. XML is perfectly happy to let you use multiple DTDs in one file.
> 
> People from the Windows world are used to multi-megabyte files.  Some
> of the Gnumeric test cases for Excel loading are pretty large.
> 
> If we use XML exclusively, I wonder who is the brave soul who will be
> scanning a directory for information with an XML file.  Consider a few
> hundred files on a server, and you are looking for documents that have
> been edited by "Maciej" at some point in life.  
> 
> I can picture the disk IO action going up, the memory usage going up
> and the time going up.
> 
> Can you picture a way in which this could be solved with XML?

  Not just XML, but a combination.
Server runs WebDAV, all WebDAV servers allows to associate properties
to a given resource. Maciej having edited that document is such a metadata
which should be stored on the server. Those metadata could be accessed
without even opening the resource in question. Indexing them and maintaining
a complete coherent view on the server is far easier taht way.
  Honnestly I doubt MS would package both the resource and the metadata in
a single file for new products. I may be wrong.

> > To do good searching you really need a background indexer in any case,
> > and that gives equally good performance either way, and people are
> > working on various parts of this problem already.
> 
> That is one option, and might work if you set up things properly.  But
> lets think as a regular, plain user.  A small office of people who do
> not even have a sysadmin.
> 
> They choose to put their docs on "/company docs/", and they accumulate
> a few hundred of those.  Who will setup the background indexing for
> them?  What if they add another directory?  Is the setting global?
> toi the network?  Per application?  is it even standard across
> applications?
> 
> The entire scenario described above is avoided completely in current
> Microsoft environments, because they can just scan the documents for
> the "summary/author" inside each file.  Does not take a lot of memory,
> and does considerably less IO.

  Current environments, maybe, future, I doubt. 

> > It sounds a lot to me like this efs thing is like a tarball, but
> > optimized for read-write access. If there were a widely available
> > command-line tool to process it, it might not be so bad. 
> 
> Yes, it is.  We can write the command line options, and even a vfs
> module (so people can browse the internals with Nautilus or any other
> gnome-vfs applications).

  And will be stuck once back to say a command line ... Sounds a hard
decision.

> > But it would still be extra work to process it with standard XML
> > tools, so there would have to be an actual compelling reason for
> > preferring an ad-hoc structured binary format to an existing
> > structured format that can be processed with many general-purpose
> > tools.
> 
> Yes, this is my concern as well.  I wanted to use Microsoft's
> Structured Storage file format, until Michael told me about the
> shortcomings they had (small file names), although even this could
> probably be worked around.
> 
> OLE2SS format is pretty standard in today's universe.  Might make
> sense to just use OLE and deal with working around its brokeness.
> This way, even Microsoft tools could search and index our documents,
> and our applications would be ready to scan and search theirs.

  Well i would rather spend time reimplementing new MS technologies
like XML based stuff, WebDAV, etc. rather than spending time on the
older ones especially if one knows it's limitation are barely acceptable.
  But I agree I didn't code specifically in this field, and at the end
the one who codes makes the final decision.

> > I don't think fast random access to specific fields is a compelling
> > enough reason. Everyone else is moving away from binary files and
> > towards XML for serialization despite this issue.
> 
> Not Microsoft.  They do support exporting to XML, but their default
> file formats is still binary.

  Well that's not a good point in their favor. Do not follow them
on that.
  Maybe a better handling of multiple components documents can be
done later as an alternate serialization mechanism (XML based or not).
Doing it right the first time is nearly impossible. But let's learn
from others (and especailly MS's) mistakes.

Daniel.

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel: +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax: +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]