Re: New module proposal: tracker



On Wed, 2009-08-19 at 13:07 +0100, Alan Cox wrote:
> > One short coming in this approach will be, It will cause a problem
> > where multiple applications can be associated with a file-type, over a
> > period of time. For instance, for .mbox files, the applications could
> > vary like: Evolution, Mutt, Pine, Claws, Thunderbird, etc. And it is
> > common among some people to switch between applications; not for email
> > but other applications like PDF-viewer, etc. once in few months.
> 
> This requires some commonality about indexing and the meaning of
> concepts. There isn't anything wrong with several apps indexing the one
> file (preferably at the same time so we walk the filestore once).

> A more interesting problem is heirarchical breakdowns (a multipart 
> mime email of a zip holding a pdf and a jpg file) or xml documents 
> with multiple namespaces in use.

The libstreamanalyzer library is ideal for this. I opens a file and then
starts reading it using what they call a stream. When they reach a point
in the file that can be recursed (like a zip file in a zip file) then
they open a stream on top of the root stream and recurse into it.

Tracker's FS miner is integrating with libstreamanalyzer for extraction.
The libstreamanalyzer library is originally developed by and for KDE's
Strigi project by Jos Vandenoever. We're of course in discussion with
Jos about various things.

We realized that their method of extracting metadata is far superior
compared to our more simple FILE and fread() based extractors.

The MBox example is a good one for this: a Base64 encoded image/png
attachment in an E-mail that can be found somewhere deep in a large MBox
file ... can have Exif tags that are indexable (when Base64 decoded).

Surely you don't want to Base64 decode all the attachments in the MBox
file to files in /tmp, and then extract those file's Exif tags? (well,
that's what Tracker did for its Evolution support).

Instead, you want to lay a Base64 decoder stream over the root stream
for the MBox file, and then analyze the image/png image that comes out
of the Base64 decoder stream.

That's what libstreamanalyzer does.

> > subject is the metadata etc. So every time the user switches
> > applications, the earlier collected meta-data might need some brushup.
> 
> That assumes that the old meta data is somehow "wrong". When an office
> changes staff the way stuff is indexed may change a bit but the old index
> doesn't become invalid or useless.

Right

> > many sites exist. For desktop the scale of the things is less,
> > individual application-provided-search is enough and will satisfy the
> > needs of most of the users. ctags, mairix etc. can provide specialized
> > and more effective searching.
> 
> The notion that the internet and personal file store are separate is one
> I would question.

Exactly. I think RDF metadata stores can be instrumental in bringing the
web and the desktop closer together. Which is among our goals.

> Why for example would I not be running a query across my personal email
> and a company wide accumulated metadata source of all the internal public
> mailing lists.

This would be possible if we'd first develop a protocol for doing remote
queries. Again a long term goal that might not even be part of Tracker
itself (we can easily proxy DBus over some TCP/IP or even UDP service).

You can probably imagine that this would require things like security
policies too? ;-) We wouldn't want random people accessing your data.

Of course.

> Specialized searching is also very different to general contexts. It is
> better at the one job but cannot answer random queries or associations.
> 
> /home is a place where you keep stuff nobody else needs, or you
> want fast access to, or you particularly don't want other people to have
> access to. Indeed if you backup to an internet connected server its not
> unreasonable to argue that user filestore is simply a cache, nothing more.

Thanks for your input, Alan. It's very helpful.

-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]