Re: [Tracker] Mining GVfs metadata

From: Tomas Bzatek <tbzatek redhat com>
To: tracker-list gnome org
Subject: Re: [Tracker] Mining GVfs metadata
Date: Tue, 23 Nov 2010 14:50:05 +0100

On Mon, 2010-11-22 at 20:25 +0200, Ivan Frade wrote:

Tracker performs well with "defined" keys. It is possible to store
also arbitrary pairs of key=value, but it has a performance cost.


How large is this performance drawback? As long as we can have thousands
of records of this kind, would it take significant amount of time to
search? Still, it would be the best way of indexing.

2. Metadata with no correspondence in the ontology BUT that makes
sense there -> we update the ontology


So far, I don't have a list of information stored in metadata, but we
can figure that out later and update code if necessary. So far the only
potential usage is tagging.

  3.2 Because is app-specific information -> lets study the case
(arbitrary key=value or no tracker at all)


I would like to avoid making lists, whitelist and constantly updating
the code. Of course we can blacklist some known useless metadata. But
generally applications should be free to use metadata and should benefit
of automatic pushing to tracker.

And when we add mapping to a new ontology, we can always force
reindexing.

Stupid question - I've been looking at the NAO ontology (as pointed out
by Adrien Bustany in the other post) and found nao:Tag class. Is it
possible to have array of properties, say multiple nao:prefLabel values
associated with a particular file? It would make our life easier. If
not, don't worry, in gvfs we would store tags in single string anyway.

Notifying directly changes to tracker helps to save a round trip that
can take few miliseconds and uses IO (re-reading the file from disk
and processing the metadata). If GVFS knows already the changes, no
need to encode them in a file, read and decode to put them into
tracker. 

I prefer to check carefully those "failure cases" and how can be
handled assuming a direct communication.


I understand it's a performance issue. So I was thinking about the
following:
 - gvfsd-metadata daemon would be pushing metadata directly to tracker
and tracker should send some ACK back, indicating that changes have been
stored successfully.
 - if we don't get that ACK until timeout or error happens, set a flag
that manual reindexing is needed.
 - in any case, we need to have manual reindexing for the case of first
crawl, the flag mentioned above will force that.

Something should keep an eye on the flag and run metadata miner. If we
keep the miner running, it can periodically check and force reindexing
when necessary.

Of course the dependency on tracker can be optional with some
compilation flags.


That's the idea.

Tracker (store) doesn't start/pause/stop any miner by itself. The
miners are started in the session login scripts and they activate (via
DBus) the store. So far our miners are alive all the time, but nothing
prevents them to appear/disappear when need. They can be activated by
cron or dbus, the store doesn't care.

In the gvfs case, that the miner will be used very frequently, i
wonder what is more efficient: to keep it alive or start/stop under
demand. At least in maemo/meego starting a process is a expensive
operation.


Cool, I hope people wouldn't mind another process sitting in memory. As
outlined above, the miner could keep an eye and reindex when necessary
or when store appears on d-bus. After all, the functionality can be
integrated in gvfsd-metadata directly. Need to think more about it.

I don't think we want to target embedded systems. Gvfs is based on
processes, communicating over d-bus and sockets and every mount is a
separate process.

Take a look to libtracker-miner:
http://library.gnome.org/devel/libtracker-miner/unstable/

It has a common superclass to all miners offering the control methods
in DBus, so an applet/application can start/stop/monitor them (not
sure this is relevant for a low level miner like GVFS). There is also
the crawling code and few useful things to write a miner.


Thanks, I'm starting to understand the design.

Tracker does the initial crawling as it is doing now. No problem
there. I still wonder what happens afterwards: do we need to monitor
the filesystem via inotify as we do now? or we can kill our filesystem
miner because gvfs give us all relevant changes? Or do we need to
receive data from both and handle the duplicated information?


Great. As mentioned above, we should have a way to manually reindex all
metadata. While the filesystem miner could ask GIO for metadata:: info
for every single file, it's not efficient since metadata databases are
stored at one place. Thus it's better to write new separate miner, going
through databases directly. We only need to ensure to run this miner for
initial crawling. I wouldn't touch existing filesystem miner.

-- 
Tomas Bzatek <tbzatek redhat com>

Follow-Ups:
- Re: [Tracker] Mining GVfs metadata
  - From: Adrien Bustany

References:
- [Tracker] Mining GVfs metadata
  - From: Tomas Bzatek
- Re: [Tracker] Mining GVfs metadata
  - From: Ivan Frade
- Re: [Tracker] Mining GVfs metadata
  - From: Tomas Bzatek
- Re: [Tracker] Mining GVfs metadata
  - From: Ivan Frade

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]