Re: [Tracker] Discarding broken metadata from miners



Hey :),

On Fri, Feb 24, 2017 at 11:19 AM, Debarshi Ray <rishi is lostca se> wrote:
Hey,

Every now and then, there is a bug or regression in one of miners
which leads to broken metadata being inserted into the database.  For
example, here is a low impact one:
https://bugzilla.gnome.org/show_bug.cgi?id=767472#c48

However, the breakage can be more serious and visibly impact
applications. For example, this one:
https://bugzilla.gnome.org/show_bug.cgi?id=776723

It broke various fields in gnome-photos' properties dialog, and due to
the wrong nfo:orientation values images were no longer correctly
oriented.

Bugs happen. Such is life. What is the recommended way to deal with
these situations. So far I have been telling users to hard reset their
database and restart the miners using the command line. I am afraid
that isn't an elegant solution.

tracker reset --file is friendlier :), but I do agree. We already keep
several files in ~/.cache/tracker to ensure that the database itself
is up-to-date, the easy way would be adding a fuck-up counter for
miner-fs (and thus tracker-extract) to force reindex and thus do the
same maintenance with the database content.

Ideally, this would have a better granularity, if a bug only affects
image files, we shouldn't need reindexing everything from scratch. I
wonder if we could apply the same approach than we have for the FTS
tokenizer: keeping the most recent commit ID affecting it, and
checking it against a file, whenever it changes in the user setup, an
update is due.

However, the git files to track are rather varying and spread, there's
tracker-extract-*.c files, there's tracker-resource.c, there's eg.
tracker-xmp.c wherever it applies,... I guess that can get under
control soon.


Do the Tracker miners version the metadata that they insert into the
database? Or, is it possible to programmatically discard metadata
coming from a certain miner and force a reindex?

There's no versioning... For dropping full miner data, I'd wish we
supported the DROP GRAPH syntax, all filesystem miners in tracker
share the same TRACKER_OWN_GRAPH_URN define.

This however could be open coded as:
"DELETE WHERE { GRAPH <" TRACKER_OWN_GRAPH "> { ?u a rdfs:Resource }}"

That should leave a clean slate for miners, still maybe a bit too clean :).


In gnome-online-miners (those are the out-of-tree miners used by
gnome-documents/photos to index online accounts advertised by
gnome-online-accounts), we handle this by having each miner tag their
insertions with nie:version (grep for 'version' in
src/gom-miner.c). Whenever a bug that could have inserted broken
metadata is fixed, we bump the miner version. When the user installs
the updated miner, it will automatically purge the old metadata and
re-index.

So, any suggestions? Thoughts?

For data maintenance, I suggest you look into inserting g-o-m data
into its own graph, version management is more open to discussion. The
approach you picked seems indeed the nepomuk-y way, although I'm not
sure how much of a great argument that is nowadays :).

Cheers,
  Carlos


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]