[Tracker] wip/passive-extraction (and API cleanup?)



Hey all,

I've talking about this branch on #tracker, but now that most work is
done there it is worth raising to the ML. In that branch there are two
extra objects in libtracker-miner: 

      * TrackerDecorator is a TrackerMiner that implements a passive
        indexing pattern, instead of being expected to feed data
        directly to tracker-store, it listens for GraphUpdated signals,
        so when an item eligible for indexing is added/updated and is
        still missing a nie:dataSource specific to the decorator, it is
        queued for processing. On startup it also queries for all
        elements of the eligible rdf:types that are missing that
        nie:dataSource, so all elements are ensured to be indexed.
      * TrackerDecoratorFS is a file-specific implementation of that
        object, which basically adds volume monitoring, so indexing
        within just added volumes is resumed if interrupted previously,
        or having the elements removed from the queue if the volume is
        removed.

In that branch, tracker-extract does use these features, it is been
turned into a full-blown standalone miner using TrackerDecorator, while
miner-fs stopped calling it. On one hand, this leads to a greatly
simplified indexing in tracker-miner-fs, as the task is a lot less prone
to failure now. On the other hand, this brings in the 2-pass indexing
that was being requested, miner-fs promptly crawls and fetches GFile
info, and tracker-extract goes soon after filling in extra information.

Current caveats
===============

It is worth noting though that in the branch not much has been done yet
about handling extraction failures:
      * extractor modules blocking or taking too much time
      * crashes in extractor modules

Possible solutions go through adding cancellability of extract tasks
and/or having all extraction go into a subprocess that we can watch on,
so the dbus service itself doesn't go away and doesn't need to be
restarted. The latter could also help with Phillip's idea to run
extraction in containers. But about these changes...

Future plans?
=============

I'm very seriously proposing to make libtracker-extract private
altogether, the usefulness of having 3rd party extractors is dubious, as
neither allowing them to reimplement extraction for a famous mimetype
nor implementing support for a mimetype we don't know well enough is
positive, it potentially affects tracker stability and user perception,
and helps avoid the point that if a mimetype has enough traction, it
should be in the tracker tree. Its API is also a mishmash of utility
functions that have little to do with the rest of Tracker, and written
in not a quite future-safe way.

Moreover, goggling for "tracker_extract_get_metadata" (the function that
modules must implement), I just see 3 pages of references to Tracker
code, backtraces, and logs, very little references to external
extractors. This API is 1/3 of the Tracker public API, yet it's been
mostly unused externally for the 3 years it's been on.

So, I think Tracker should offer API to help integrate with Tracker, as
such this API falls over, I propose to keep it in private land, and
encourage the use of TrackerDecorator, which is also nice in the way
that multiple sources add up information, unlike extract modules which
are individually responsible of filling in every piece of information.

Actually, I'd like to think we can make 1.0 soon (we technically could
ASAP, we've remained feature stable for quite some time now) and make
longer stability promises than we do currently (having every gnome
module depending on Tracker bump .pc file versions every 6 months is a
PITA), IMO the main milestone is getting the API to a point where we can
think of forward compatibility, and doing this would help greatly.

Phew, long email,
  Carlos



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]