Re: [Tracker] Proposal for a new signal mechanism



Availability:
-------------

o. Unreviewed: http://git.gnome.org/browse/tracker/log/?h=class-signal
   Expected in master between now and two weeks (reviewing and
   bugfixing)
o. This branch will be rebased to master tomorrow or the day after
   tomorrow (don't depend on the branch for development unless you're in
   for some merging and conflict fixing tomorrow, or you're name is
   Adrien who still has to work on the Flickr miner in this branch)

Known open issues:

o. The Flickr miner by Adrien Bustany uses the Writeback signal by
   itself. This signal has been changed and so his miner is at this
   moment defect at its writeback features. Adrien will be fixing this
   tomorrow, he promised me.


Examples:
---------

Note that once this is merged to master, you need to remove
the ?h=class-signal from the URLs below to find the files:

These examples are NOT optimized or anything. You can for example cache
IDs in a local hashtable and things like that. These examples don't show
you how to do that.

They both work with libtracker-sparql, meaning that the queries are
executed locally over direct-access (no extra IPC is involved in the
querying).

A cute test is executing the Plain C and Vala one at the same time. The
Vala one will start doing ~ 10000 insert+delete queries, the Plain C one
will start printing a lot of stuff due to that.

Plain C:
http://git.gnome.org/browse/tracker/tree/examples/class-signal/class-signal.c?h=class-signal

Vala:
http://git.gnome.org/browse/tracker/tree/tests/functional-tests/class-signal-test.vala?h=class-signal


Documentation:
--------------

http://live.gnome.org/Tracker/Documentation/SignalsOnChanges#Tracker_0.9


Enjoy! Code hackers!

Cheers,

Philip


On Thu, 2010-08-12 at 15:03 +0200, Philip Van Hoof wrote:
A new class signal for Tracker

Today's situation

Today we have a simple signal system that causes quite a bit of
overhead which we over time tried to reduce. The overhead comes from: 
     A. Having to store the URIs of the resources involved in a
        changeset in tracker-store's memory; 
     B. Having to store the predicates involved in a changeset in
        tracker-store's memory (although far less severe than #1); 
     C. Having to UTF-8 validate the strings when we emit them over
        D-Bus (D-Bus does this implicitly); 
     D. D-Bus's own copying and handling of string data; 
     E. Heavy traffic on D-Bus; 
     F. Context switching between tracker-store and dbus-daemon; 
     G. We have to wait with turning on the D-Bus objects until after
        we have the latest ontology. So after journal replay. And we
        need to reset the situation after a backup restore. Complex!
Besides this overhead there are problems the consumers have too. I'll
make a list in the next section.

Problems of today's signal 
     1. Aforementioned overhead: consumes a lot of D-Bus traffic. This
        is caused by sending over URLs for the subjects and the
        predicates; 
     2. Doesn't make it possible, in case of a delete of <a>, to know
        <b> in <a> nfo:isLogicalPartOf <b>, as <a> is removed at the
        point of signal emission; 
     3. Round trips to know the literals create more D-Bus traffic; 
     4. Transactional changes can't be reliably identified with
        SubjectsAdded, SubjectsChanged and SubjectsRemoved being
        separate signals; 
     5. A lot of D-Bus objects, instead of letting clients use D-Bus's
        filtering system.

The drive for a solution

JÃrg Billeter and me brainstormed a bit about all these problems. Last
few months while optimizing tracker-store's INSERT performance and
memory utilization, we brainstormed a lot about how we could reduce
the overhead. I believe we have a good idea of the current situation,
its internal problems and our current solution (hey of course, we
implemented it :p).

We also gained know how about most of the problems consumers have from
the maintainer of libqttracker, Petteri Iridian Kiiskinen. Thanks
Iridian!

Today I believe that we must abandon the old ship, redo the signal
system, break the API. Break it all. Get over it, heal our wounds.
Even if that means taking the stress away from all sorts of people
who've been using the old signal system, offering massages, giving out
sauna coupons. You know, the usual stuff that we won't do for real.
Although I'm sure that at a next code-camp in Helsinki we'll have a
good sauna to burn all our own stress away.

Anyway ... *shrug*

A proposed solution

Part one: Direct access
With direct-access we will reduce the round-trip cost of a query from
a consumer who wants a literal object involved in a changeset: it'll
be executed directly on meta.db; you wont use libsqlite's API yourself
but libtracker-sparql. However, libtracker-sparql is for direct-access
a layer on top of aforementioned libsqlite. The so-called "round-trip"
won't even involve IPC: by utilizing the TrackerSparqlCursor API,
you'll end up doing sqlite3_step() in your own process, directly on
meta.db.

For the consumers of the signal, this removes 3.

Part two: Sending IDs
A while ago we introduced the SPARQL function tracker:id(). The
tracker:id() function gives you a unique number that Tracker's RDF
store internally. It's not RDF, RDF uses subject URL strings. We just
convert this internally for performance reasons, and with tracker:id()
you can access that.

Each resource, each class and each predicate (latter two are resources
like any other) have such an unique internal ID.

Given that Tracker's class signal system isn't RDF anyway, we decided
not to give you subject URL strings in it anymore. Instead, we'll give
you these integer IDs.

This for us removes A, B, C, D and E. For the consumers of the signal,
this removes 1. Whoohoo!

Part three: Combine SubjectsAdded and SubjectsChanged, and put
SubjectsRemoved in the same signal
So we give you two arrays: Inserts and Deletes. 

For consumers of the signal, this removes 4.

Part five: Add the class name to the signal
This allows you to use a string filter on your signal subscription in
D-Bus.

For us this removes G. For consumers of the signal, this removes 5.

Part six: Pass the object-id for resource objects
You'll get a third number in the Inserts and Deletes arrays:
object-id. We wont send you object literals, although for integral
objects we're still discussing this. But for resource objects we can
without much extra cost give you the object-id.

For consumers of the signal, this removes 2. Whoohoo (this was a hard
one)!

Part seven: SPARQL IN, tracker:id() and tracker:subject()
We recently added support for SPARQL IN, we already have tracker:id()
and we'll implement tracker:subject().

This makes things like this possible:

SELECT ?t { ?r nie:title ?t .
            FILTER (tracker:id(?r) IN (800, 801, 802, 807)) }

Where 800, 801, 802 and 807 will be the IDs that you receive in the
class signal.

The tracker:subject() SPARQL function will allow you to make a very
fast version of this:

SELECT ?s { ?s a rdfs:Resource .
            FILTER (tracker:id(?s) IN (800)) }

So it would be something like ... (not sure that you can omit { } in
SPARQL, though):

SELECT tracker:subject (800)

For consumers this removes most of the burden introduced by IDs.
Consumers are also advised to keep a local Map<tracker:id(), subject>
to avoid a lot of SPARQL queries. Although with direct-access it might
be just fine.

Part eight: What is left?

What is left is context switching between tracker-store and
dbus-daemon, F. But that's our problem. We'll reduce them by grouping
transactions and signals together. It's mostly a problem on ARM
hardware, but yeah that's a major and important target platform for
us. We're on it, we will care about this!

Let's take a look!

<node name="/org/freedesktop/Tracker1/Resources">
  <interface name="org.freedesktop.Tracker1.Resources.Class">
    <signal name="class-signal">
      <arg type="s" name="class-name" />
      <arg type="a(iii)" name="inserts" />
      <arg type="a(iii)" name="deletes" />
    </signal>
  </interface>
</node>

Or in short: sa(iii)a(iii). Here's a bit of pseudo code how it'll look
clientside:

void m_callback (cursor) {
  while (cursor.next()) {
   // With direct-access are these c.next()s, sqlite_step() calls
    print ("title: %s", cursor.get_string ());
  }
}

void on_signal (class_name, deleted, inserted) {
  string in_qry = "", qry;
  bool first = true;

  foreach (insert in inserted) {
    if (insert.subject_id is_in (my_resources)) {
       if (!first) { in_qry += ", "; }
       in_qry += insert.subject_id
       first = false;
    }
  }

  qry = string.printf ("SELECT ?titles { ?r nie:title ?titles . 
                        FILTER (tracker:id(?r) IN (%s)) }", in_qry);

  connection.query_async (qry, m_callback);
}


Cheers! :-)

Philip


-- 


Philip Van Hoof
philip codeminded be
freelance software developer
Codeminded BVBA - http://codeminded.be
_______________________________________________
tracker-list mailing list
tracker-list gnome org
http://mail.gnome.org/mailman/listinfo/tracker-list

-- 


Philip Van Hoof
freelance software developer
Codeminded BVBA - http://codeminded.be




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]