Re: Request: Test suite for EFS.



One way around searching a huge xml file I think would be to do something
like this:
<data size="32434">
...some junk...
</data>

Then for quick searching, the search app, when it finds a data tag, will
read the size, and skip ahead that far. Should speed that up alot I think.
Also, have the searchable tags, if posible, at the top of the document
like html.

On Wed, 16 Feb 2000, Miguel de Icaza wrote:

> 
> [Michael: comments on OLE2 at the end]
> 
> > Disabling searching to avoid parsing the whole file sounds lame to
> > me. XML is definitely structured. Storing images in it should not be a
> > problem, see the RFC 2397 for one way to do it. Storing embedded
> > objects should be no problem either as long as they serialize to
> > XML. XML is perfectly happy to let you use multiple DTDs in one file.
> 
> People from the Windows world are used to multi-megabyte files.  Some
> of the Gnumeric test cases for Excel loading are pretty large.
> 
> If we use XML exclusively, I wonder who is the brave soul who will be
> scanning a directory for information with an XML file.  Consider a few
> hundred files on a server, and you are looking for documents that have
> been edited by "Maciej" at some point in life.  
> 
> I can picture the disk IO action going up, the memory usage going up
> and the time going up.
> 
> Can you picture a way in which this could be solved with XML?
> 
> > To do good searching you really need a background indexer in any case,
> > and that gives equally good performance either way, and people are
> > working on various parts of this problem already.
> 
> That is one option, and might work if you set up things properly.  But
> lets think as a regular, plain user.  A small office of people who do
> not even have a sysadmin.
> 
> They choose to put their docs on "/company docs/", and they accumulate
> a few hundred of those.  Who will setup the background indexing for
> them?  What if they add another directory?  Is the setting global?
> toi the network?  Per application?  is it even standard across
> applications?
> 
> The entire scenario described above is avoided completely in current
> Microsoft environments, because they can just scan the documents for
> the "summary/author" inside each file.  Does not take a lot of memory,
> and does considerably less IO.
> 
> > It sounds a lot to me like this efs thing is like a tarball, but
> > optimized for read-write access. If there were a widely available
> > command-line tool to process it, it might not be so bad. 
> 
> Yes, it is.  We can write the command line options, and even a vfs
> module (so people can browse the internals with Nautilus or any other
> gnome-vfs applications).
> 
> > But it would still be extra work to process it with standard XML
> > tools, so there would have to be an actual compelling reason for
> > preferring an ad-hoc structured binary format to an existing
> > structured format that can be processed with many general-purpose
> > tools.
> 
> Yes, this is my concern as well.  I wanted to use Microsoft's
> Structured Storage file format, until Michael told me about the
> shortcomings they had (small file names), although even this could
> probably be worked around.
> 
> OLE2SS format is pretty standard in today's universe.  Might make
> sense to just use OLE and deal with working around its brokeness.
> This way, even Microsoft tools could search and index our documents,
> and our applications would be ready to scan and search theirs.
> 
> > I don't think fast random access to specific fields is a compelling
> > enough reason. Everyone else is moving away from binary files and
> > towards XML for serialization despite this issue.
> 
> Not Microsoft.  They do support exporting to XML, but their default
> file formats is still binary.
> 
> Miguel.
> 
> 
> -- 
>         FAQ: Frequently-Asked Questions at http://www.gnome.org/gnomefaq
>          To unsubscribe: mail gnome-list-request@gnome.org with 
>                        "unsubscribe" as the Subject.
> 



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]