Re: [Tracker] Database access abstraction



On Thu, 2008-11-06 at 09:39 -0500, Jamie McCracken wrote:
its not that clear cut

firstly, pausing the indexer is an async process and does not have to be
synchronous at all so there is no round trip with direct access.

Pausing the indexer right now has to be (and is performed) synchronous.

In most cases the indexer will be idle when querying. A client could 
even bypass by checking to see if tracker-indexer is running

That's true and is also being done that way by trackerd. Although the
client can't as-easy as trackerd can, check whether or not the indexer
is running (in a reliable way). Unless the client subscribes to a few
signals that are emitted by tracker-indexer (which is private API atm).

Also note that Garnacho is and/or has changed this infrastructure
recently. Adding Garnacho in CC.

dbus has overhead with millions of strings - each one has to be type
verified and copied several times. Unless you are testing with huge
result sets like fetching 100,000 music files with all metadata then any
comparison is invalid. Also as we move to flattened tables the query
time will get faster and the this will make the dbus overhead more
prominent

Agree with this. But you probably want to fetch pages, not entire
databases, over D-Bus.

I don't think a client would be developed correctly if it really needs
to fetch an entire database. Clients should be cursor based consumers,
that fetch pages of data .. instead of all data.

ïIn other cases, like say using tracker as a gconf backend you will often
get calls to fetch individual keys and these need to be in process to
avoid round trips for each key

Tracker as a gconf backend?! Let's just use either DConf or GConf for
that. I don't really see the point in that.

ergo its premature to rule out direct access at this point 

Yet if we want to insert massive amounts of data fast, then we need to
monopolize the SQLite tables by opening (large) transactions.

We can periodically commit these transactions, but that means that the
latency before you can query is at maximum the duration of one full
transaction and on average half the time of the duration of one full
transaction, unless you instruct (using IPC) the indexer to shortcut its
currently running transaction, and early commit it.

Until the indexer has reacted to that, you'll have to wait.

No matter how asynchronously you can DBus-call the indexer, you'll still
have to wait for the transaction to be committed.

Waiting for this is at least the IPC latency plus the amount of time the
indexer itself needed to shortcut and commit the transaction.

The indexer can only respond as soon as its current file is processed,
as that is the granularity of the state machine that operates the
indexer.

So you always have at least the overhead of DBus, plus the overhead of
the indexer, although you can indeed hope that the indexer ain't running
anymore at the time of launching a query. Yet you can't be sure of that.


On Thu, 2008-11-06 at 11:29 +0100, Philip Van Hoof wrote:
Hi, this is a mail that I once sent to a few people at Nokia who wanted
direct access to SQL too:


Hi guys,

I made a very simple test case that selects a LIMIT of 1 until 100 of
a few columns out of the Services table.

After I finished the "over DBus" version of it, and while I was
measuring its performance, I was already confident that as konttori
pointed out too, that the DBus overhead truly is minimal when compared
to query time.

I could write the same test with direct access and it would most
likely shave of another few tenths of milliseconds. But for a UI
application I don't really see the point in that (a mainloop iteration
that has to do a few exposes and draws is likely going to take
longer).

So I attached a vala app for testing this and I included the generated
C source code for it. `pkg-config dbus-glib-1 --cflags --libs` it and
for the Vala stuff take a look here:
http://live.gnome.org/Vala/DBusSample

So this is over DBus:

        pvanhoof tinc:~/test$ ./test-sql-tracker 
        
        ...
       
        0.035126 seconds elapsed
        pvanhoof tinc:~/test$


Now with relation to the queuing: 

SQLite has a write-lock per transaction and it keeps all tables in the
connection involved in the transaction, locked. A write lock means
that while we (Tracker) are writing, you (your process, your
connection, your *direct* connection to the SQLite tables indeed) is
locked out.

Tracker writes in long transactions because SQLite is 50 or 60 times
faster at writing if you group lots of writes together. If you don't
do this you also have the fsync() problem more often. This is a
similar problem as what Firefox started having when it switched to
SQLite for several things.

Short: This means that we can't turn off our use of transactions. It's
vital in Tracker's design and performance.

For preempting these transactions we require a very strict
communication between the indexer and the front end query mechanism.
This implies a synchronous DBus message to the indexer. This DBus
message will instruct the indexer to do a preemptive commit of its
standing transaction.

This is possible because the transaction is not done for atomicity but
is only done for improving the write speed of SQLite (so we can early
commit it, and we do that).

So even if we'd make a library that has a direct (in process with the
app linking with that library) connection to the SQLite database, then
we'd still, just like what trackerd (the front end query mechanism)
has to do too, would need to execute a DBus message to request a
preemptive commit to the indexer. Making it pointless to do a direct
connection in the first place (because you still have the DBus
overhead for this one message, anyway).


Some pointers:

http://www.sqlite.org/faq.html#q5
http://www.sqlite.org/lang_transaction.html


What would be possible to avoid queuing, in case a lot of applications
concurrently want to query Tracker, is to introduce a connection pool
and a queue for each connection in the pool, to Tracker's query front
end.

I must warn that although very much possible, that this solution adds
complexity. If it's not proved that concurrent access will very often
occur, I don't think that it's worthwhile right now to implement this
solution already.

But we can keep it in mind for when the day comes.

/EO my advise ;)






On Tue, 2008-10-21 at 08:47 -0400, Jamie McCracken wrote:
On Tue, 2008-10-21 at 09:36 +0100, Martyn Russell wrote:


Also in the future i want to support direct access to sqlite via  a
client lib so we can bypass dbus (and trackerd) for select queries where
speed is paramount and volume of data is too big for dbus to handle
optimally (think get all my 100,000 music tracks with metadata). So this
library would have to handle all querying and any future ones (like
sparql) - so you will have no problem from me for implementing that
support in a lib

Hmm, I would like to see the difference it makes using DBus and if it
really is an issue. We have an API like this in DBus now which Phillip
added - I really don't like the idea of people executing random SQL on
the databases. It can lead to much bigger problems. Phillip stresses
this in the .xml file where we document this API. I think quite rightly
so too.  

that should be moved to a direct access lib

the advantage of a direcdt access lib is that it removes dbus overhead
when large amounts of data are required.

Rob taylor probably knows this better but from what i understand dbus is
not optimal for large payloads (~1mb+) and somehting like get all music
and metadata might involve a million plus strings which dbus would have
to marshall, strdup and validate individually into multiple packets
(IIRC packet size is 4kb?) so you are looking at massive overhead with
multiple ipc calls



_______________________________________________
tracker-list mailing list
tracker-list gnome org
http://mail.gnome.org/mailman/listinfo/tracker-list



-- 
Philip Van Hoof, freelance software developer
home: me at pvanhoof dot be 
gnome: pvanhoof at gnome dot org 
http://pvanhoof.be/blog
http://codeminded.be




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]