[Evolution-hackers] Refactoring EWS GAL sqlitedb



I've just started looking at the performance (and locking) of the EWS
Global Address List.

This is a large data set (in my case 238131 records) which is downloaded
from the Exchange servers. We have to keep the original binary OAB file
around because updates are provided in the form of binary deltas. So we
currently end up with the full vcard data duplicated in two files side
by side in the cache directory:

-rw-r--r--. 1 dwoodhou dwoodhou 292296704 Sep  3 10:27 contacts.db
-rw-rw-r--. 1 dwoodhou dwoodhou 130741567 Sep  3 10:05 Global Address List-1520.oab

It might well make sense to stop including the vcard in the sqlite
database and instead use the 'extra data' to hold a reference to the OAB
file (version and offset therein), for the vcard to be recreated on
demand.

My primary concern at the moment, though, is the speed of updates. The
current implementation is extremely naïve:
  - Download/create new OAB file version
  - Remove *all* entries from database
  - Iterate over OAB file, adding entries
  - Update 'oab_revision' field in database

We add the new entries 1000 at a time, in an attempt to ensure that the
process completes before the eventual heat death of the universe. It's
only marginally successful; it can take about 90 minutes on my laptop.

For an update which is likely to leave 99% of the entries *unchanged* by
the time we're done, this is horribly slow.

One option I've considered is that we could store the OAB file version
in the sqlite 'bdata' field, then we could skip the initial removal
step. Just iterate over the new OAB file doing 'INSERT OR REPLACE' with
the new OAB revision. For most records, the *only* thing that would
change would be the bdata. And then finally we would do something like
DELETE * from folder_id where bdata IS NOT $new_oab_revision;

How inefficient is REPLACE in that case, though? I'm fairly clueless
about SQL efficiency; should I worry about things like http : / /
code.openark.org/blog/mysql/replace-into-think-twice ?

(Apologies for the non-clickable link but bug 734605 already ate one
draft of this email and I'm scared to try again.)

Another option would be to get a full list of UIDs present in the
database, skip the initial removal step, and then for each record we
insert as we process the OAB file, we remove that UID from our in-memory
list. Finally, if our list is non-empty at the end, remove those
remaining UIDs from the database.

(If we do this, is it OK to just do the insertion and trust that if
nothing has changed, nothing will be done? Or should we explicitly fetch
the previous version of the record and compare, and *only* do the
insertion if it's actually changed?)

This approach would avoid all the REPLACE activity, but it would mean a
fairly large in-memory list (or hash table, more likely) of UIDs.

And if we *do* want to store the file offset in the bdata instead of
storing the full vcard in the database then it's a non-starter; we're
likely to *need* to update most of the records anyway. The bdata would
take the format "$OABVER:$OFFSET" and the final deletion step would be
 DELETE * from folder_id where bdata NOT LIKE '$new_oab_revision:%';


Other approaches might include having a *different* folder table rather
than the single, although EBookSqlite seems fairly clear that we're only
allowed *one* folder_id these days. And I'm not sure it really helps.

Any other ideas on how to make this happen efficiently?

The locking is also an issue to be considered. As things stand, we
currently have a long period of time during which queries may return
missing results. And we certainly wouldn't want to lock the database
entirely for the whole duration of the update. If we die in the *middle*
of an update, we'll need to pick up where we left off. So we're going to
need to *know* that we were in the middle of an update.

-- 
dwmw2

Attachment: smime.p7s
Description: S/MIME cryptographic signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]