[Tracker] Referential Integrity (Was: Re: Fwd: tracker 1.11.2)

From: Carlos Garnacho <carlosg gnome org>
To: Philip Van Hoof <philip codeminded be>
Cc: Carlos Garnacho <carlosg gnome org>, Tracker mailing list <tracker-list gnome org>
Subject: [Tracker] Referential Integrity (Was: Re: Fwd: tracker 1.11.2)
Date: Wed, 21 Dec 2016 14:29:35 +0100

Hey Philip :),

On Sun, Dec 11, 2016 at 9:45 PM, Philip Van Hoof <philip codeminded be> wrote:

On Sun, 2016-12-11 at 17:07 +0100, Carlos Garnacho wrote:

[cut]

- Getting as close to supporting the full sparql 1.1 spec as possible
in libtracker-data:
  * property paths: last weekend got halfway with this \o/
  * graph management: for DROP GRAPH I think triggers will perform


Did ever something happen to cleaning up anonymous nodes of deleted
subjects/context, and or do reference counting on them (and clear them
once they reach zero references)?

If not then we are still leaking those in the db afaik. We always wanted
to do something about that.


Yeah, we still do leak those... I remember you/Jürg suggested setting
a foreign key with ON DELETE RESTRICT action from the various tables'
ID rows to the Resource table, so the cleaning up the Resource table
would fail for the still referenced nodes. I got as far as seeing
that:

- Performance would be just fine for the common ops, the trigger would
only run when trying to delete the parent key in the Resource table,
which is the once-in-a-while operation, modifications on the cols
setting the foreign key would be just as fast as they're now.

nod

- It however wouldn't fix alone the other issue I saw happening before
the revert (graph URNs being deleted). I had a patch around that added
a Graph table, so IDs in the Resource table were ensured to be in
either rdfs:Resource or the Graph table. That already helped with
identifying and not deleting the graph URNs during garbage collection,
and seems useful for graph management, but I think I can't just add
the same RESTRICT action as CLEAR/DROP GRAPH will want pretty much the
opposite.


Personally think some sort of reference counting will be needed for
anonymous nodes references by different graphs..


It's not as much reference counting as it is a "weak pointer" what we
want here, no? i.e. we want to know whether the reference count
dropped to 0 anywhere else, and drop the last bit of info in that
case.

And that seems precisely the info we can obtain from foreign keys with
RESTRICT action, deleting the parent key would fail unless it's no
longer referenced.

- I also wondered if it's more desirable, or allowed by the sparql
spec, that we actually garbage collect the inconsistent nodes. IMO not
leaving this type of data coherence up to miners/apps being educated
when deleting would be a win, but I've only seen mentions in the
sparql spec about impls being free to drop empty graphs, nothing about
triples with no longer bound elements.


I also don't think it's a problem to rid ourselves of orphaned anonymous
nodes.

Without a graph to be owned by, they can't be referenced other than by
their uuid anyway.


Right, good that you confirm :). I've been playing further with
foreign keys, and I think there are two possible ways to properly
maintain the Resource table:

1) Doing immediate deletes, requires:
   a) all rowid columns in class tables to set a foreign key on
Resource.ID with ON DELETE CASCADE
   b) all property columns of type "resource" to set a foreign key on
Resource.ID with ON DELETE SET NULL
   c) Triggering deletion on the Resource table when a resource is to
be deleted.

2) Doing deferred deletes, requires:
   a) all rowid columns in class tables to set a foreign key on
Resource.ID with ON DELETE RESTRICT
   b) all property columns of type "resource" to set a foreign key on
Resource.ID with ON DELETE RESTRICT
   c) running a maintenance task that tries to harvest no longer used
resources, which past our initial heuristics (eg. checking it's not in
rdfs:Resource table) would still fail if referenced somewhere.

(FWIW, I'm omitting *:graph-id columns management)

#1 allows this broken node harvesting to happen immediately, while #2
still relies on the user being educated enough for the harvesting to
be effective. It might be an option though to use ON DELETE CASCADE on
2.b so the references pointing to this node are cut, and we get the
good effects of #1 when nodes are harvested.

But both fail pretty bad on step b... there's many checks going on in
child tables each time a Resource table row is deleted, and those
happen on columns that don't have an index. sqlite docs [1] recommend
that an index on the child table column(s) holding the foreign key is
created.

Given our current ontology, that means adding some 140 extra indexes,
as each one requires at least an extra page [2], that already means
~0.5MB extra on a database created from cold start, roughly 1/6th of
the current size.

Without indexes, the required checks get quite noticeable esp. on mass
deletes (eg. tracker reset -f ~/ ), not as much for spare delete ops.
And I guess this is where #2 wins, it at least could be made to split
the heavy ops in non-noticeable chunks without cramming everything
with indexes, at the expense of giving up on timing promises.

I need testing with further nodes though, as I'd expect things to get
linearly worse with data, both in size with the additional indexes and
performance w/o indexes. I kinda expected we'd have to pick our poison
here though :).

Cheers,
  Carlos

[1] https://www.sqlite.org/foreignkeys.html#fk_indexes
[2] https://www.sqlite.org/limits.html , point 15

Follow-Ups:
- Re: [Tracker] Referential Integrity (Was: Re: Fwd: tracker 1.11.2)
  - From: Philip Van Hoof

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]