Re: [Tracker] Referential Integrity (Was: Re: Fwd: tracker 1.11.2)



On Wed, 2016-12-21 at 14:29 +0100, Carlos Garnacho wrote:

Hello Carlos,

Personally think some sort of reference counting will be needed for
anonymous nodes references by different graphs..

It's not as much reference counting as it is a "weak pointer" what we
want here, no? i.e. we want to know whether the reference count
dropped to 0 anywhere else, and drop the last bit of info in that
case.

That's correct. Full reference counting isn't strictly necessary, just
what a weak pointer provides should be sufficient.

Reference counting in a column like Resource table would be a possible
implementation.

And that seems precisely the info we can obtain from foreign keys with
RESTRICT action, deleting the parent key would fail unless it's no
longer referenced.

Right.

I also don't think it's a problem to rid ourselves of orphaned anonymous
nodes.

Without a graph to be owned by, they can't be referenced other than by
their uuid anyway.

Right, good that you confirm :). I've been playing further with
foreign keys, and I think there are two possible ways to properly
maintain the Resource table:

1) Doing immediate deletes, requires:
   a) all rowid columns in class tables to set a foreign key on
Resource.ID with ON DELETE CASCADE
   b) all property columns of type "resource" to set a foreign key on
Resource.ID with ON DELETE SET NULL
   c) Triggering deletion on the Resource table when a resource is to
be deleted.

ok

2) Doing deferred deletes, requires:
   a) all rowid columns in class tables to set a foreign key on
Resource.ID with ON DELETE RESTRICT
   b) all property columns of type "resource" to set a foreign key on
Resource.ID with ON DELETE RESTRICT
   c) running a maintenance task that tries to harvest no longer used
resources, which past our initial heuristics (eg. checking it's not in
rdfs:Resource table) would still fail if referenced somewhere.

ok

(FWIW, I'm omitting *:graph-id columns management)

ok

#1 allows this broken node harvesting to happen immediately, while #2
still relies on the user being educated enough for the harvesting to
be effective. It might be an option though to use ON DELETE CASCADE on
2.b so the references pointing to this node are cut, and we get the
good effects of #1 when nodes are harvested.

But both fail pretty bad on step b... there's many checks going on in
child tables each time a Resource table row is deleted, and those
happen on columns that don't have an index. sqlite docs [1] recommend
that an index on the child table column(s) holding the foreign key is
created.

Given our current ontology, that means adding some 140 extra indexes,
as each one requires at least an extra page [2], that already means
~0.5MB extra on a database created from cold start, roughly 1/6th of
the current size.

ok

Without indexes, the required checks get quite noticeable esp. on mass
deletes (eg. tracker reset -f ~/ ), not as much for spare delete ops.
And I guess this is where #2 wins, it at least could be made to split
the heavy ops in non-noticeable chunks without cramming everything
with indexes, at the expense of giving up on timing promises.

I need testing with further nodes though, as I'd expect things to get
linearly worse with data, both in size with the additional indexes and
performance w/o indexes. I kinda expected we'd have to pick our poison
here though :).


Might be that just counting references and incrementing them in a column
in the Resource table is cheaper than letting SQLite do all the extra
required indexes?

But right. Without testing, it's hard to know. And if SQLite does it for
us with CASCADING Delete and RESTRICT: that's less code for us write.

Philip



Attachment: signature.asc
Description: This is a digitally signed message part



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]