I've just started looking at the performance (and locking) of the EWS Global Address List. This is a large data set (in my case 238131 records) which is downloaded from the Exchange servers. We have to keep the original binary OAB file around because updates are provided in the form of binary deltas. So we currently end up with the full vcard data duplicated in two files side by side in the cache directory: -rw-r--r--. 1 dwoodhou dwoodhou 292296704 Sep 3 10:27 contacts.db -rw-rw-r--. 1 dwoodhou dwoodhou 130741567 Sep 3 10:05 Global Address List-1520.oab It might well make sense to stop including the vcard in the sqlite database and instead use the 'extra data' to hold a reference to the OAB file (version and offset therein), for the vcard to be recreated on demand. My primary concern at the moment, though, is the speed of updates. The current implementation is extremely naïve: - Download/create new OAB file version - Remove *all* entries from database - Iterate over OAB file, adding entries - Update 'oab_revision' field in database We add the new entries 1000 at a time, in an attempt to ensure that the process completes before the eventual heat death of the universe. It's only marginally successful; it can take about 90 minutes on my laptop. For an update which is likely to leave 99% of the entries *unchanged* by the time we're done, this is horribly slow. One option I've considered is that we could store the OAB file version in the sqlite 'bdata' field, then we could skip the initial removal step. Just iterate over the new OAB file doing 'INSERT OR REPLACE' with the new OAB revision. For most records, the *only* thing that would change would be the bdata. And then finally we would do something like DELETE * from folder_id where bdata IS NOT $new_oab_revision; How inefficient is REPLACE in that case, though? I'm fairly clueless about SQL efficiency; should I worry about things like http : / / code.openark.org/blog/mysql/replace-into-think-twice ? (Apologies for the non-clickable link but bug 734605 already ate one draft of this email and I'm scared to try again.) Another option would be to get a full list of UIDs present in the database, skip the initial removal step, and then for each record we insert as we process the OAB file, we remove that UID from our in-memory list. Finally, if our list is non-empty at the end, remove those remaining UIDs from the database. (If we do this, is it OK to just do the insertion and trust that if nothing has changed, nothing will be done? Or should we explicitly fetch the previous version of the record and compare, and *only* do the insertion if it's actually changed?) This approach would avoid all the REPLACE activity, but it would mean a fairly large in-memory list (or hash table, more likely) of UIDs. And if we *do* want to store the file offset in the bdata instead of storing the full vcard in the database then it's a non-starter; we're likely to *need* to update most of the records anyway. The bdata would take the format "$OABVER:$OFFSET" and the final deletion step would be DELETE * from folder_id where bdata NOT LIKE '$new_oab_revision:%'; Other approaches might include having a *different* folder table rather than the single, although EBookSqlite seems fairly clear that we're only allowed *one* folder_id these days. And I'm not sure it really helps. Any other ideas on how to make this happen efficiently? The locking is also an issue to be considered. As things stand, we currently have a long period of time during which queries may return missing results. And we certainly wouldn't want to lock the database entirely for the whole duration of the update. If we die in the *middle* of an update, we'll need to pick up where we left off. So we're going to need to *know* that we were in the middle of an update. -- dwmw2
Attachment:
smime.p7s
Description: S/MIME cryptographic signature