[Shotwell] Bulk Import?

Mon Apr 9 01:09:02 UTC 2012

Hi Oliver,

On 07/04/12 13:55, oliver wrote:
> Hello,
>
> The import of my f-spot folder runs since last evening...
> maxbe it's more than about ten hours running...
>
> When I tried it the first time, I used f-spot import.
> When it took too long I tried strace and ltrace on it
> (when shotwell already is running).
> The latter one killed shotwell.
>
> (It is an old shotwell version; I tested the ltrace killing-issue
> with a current version of shotwell on a different machine/system
> and this problem was away.)
>
> What I experienced when importing at leats with the old shotwell,
> was that it is importing slowly.
> A lot of stuff is going on... I could see this, when using ltrace
> directly (instead of attaching it later).

That's because it does a lot of things. The F-Spot import works a bit
like this:
- Do a single SELECT on the F-Spot DB to get the full list of photos,
- Then for each photo:
  - Do a select on the photo_versions table and on the tags table to
load each version of a photo independently,
  - Check whether the photo already exists in the Shotwell database,
  - If not, insert it with all the tags and event (in itself a few
INSERTs in the Shotwell DB).

>
> Same also holds true for the current version of shotwell.
> There is even alot of calls going on, when an empty
> archive is used.

Part of that is also due to the following aspects of the F-Spot import:
- In the latest version, F-Spot import is a plugin so calls go through
the SPIT API and then through the SPIT extension point interfaces before
hitting the F-Spot plugin itself.
- The F-Spot plugin is built to auto-detect the version of the F-Spot
database being read and adjust its behaviour accordingly: this is
implemented by an intermediate layer of data access objects, which in
turn means additional intermediary calls.
- The actual import into the Shotwell database is performed in a
background thread, which means that individual photos are handed to
background jobs that perform the actual inserts in the Shotwell DB once
they have been loaded from the F-Spot DB: this adds yet more calls but
is essential to make sure you can still use Shotwell even while it's
importing data.

>
> From what I saw in the latrace outputs it seems to me,
> that any picture files is handled seperately (maybe each one
> an object), and importing a file means: creating an object,
> which individually connects to sqlite.

This is correct. For each photo, there are multiple sqlite SELECT
queries to load the data from the F-Spot DB. There are then multiple
sqlite INSERT calls to store it into the Shotwell DB.
Having said this, the objects that connect to sqlite are created only
once and re-used throughout the import.

>
> Just from that (without looking at the code) I think,
> it would make sense to have an internal representation of the data
> not only for one picture, but for a bunch of pictures
> and doing a bulk-insertion operation into the database,
> instead of individually insert the files unto the database.

Doing a bulk insert could potentially provide performance improvements.
You'd have to deal with a few complications though:
- Inserting photos one by one means that 1) progress reporting is smooth
and 2) it provides the user the ability to cancel the import at any time
and have it stop immediately. If inserts were bulked, the code would
only be able to abort between blocks of inserts.
- The object tree being persisted is not flat: each photo comes with an
event and a number of tags so even though you would be able to
bulk-insert the photos, it would be a lot more difficult to do that for
tags.
- The combination of the two points above would make it quite difficult
to ensure that the Shotwell DB ends up in a consistent state if you were
to cancel the import half-way through or if writing to the Shotwell DB
failed are any point during the insert.

>
> Does anyone of the shotwell developers who knows the internals
> of the code, can affirm or reject my assumption on individual
> sqlite-accesses?
>
> And if that's the case, would it be possible to have a bulk insertion feature?
> So that - for example - bunches of e.g. 1000 files will be inserted in one
> operation, instead of individually inserting any file?
>
> It seems, that the sqlite access is eating up a lot of time,
> when doing massively adding files to the database.

You'd need to see examples of bulk insert timings against repeated
individual insert timings to confirm that bulk inserts is actually more
efficient and to get an estimate of what performance improvements you
could get. So any data you have that demonstrate a different in
performance between the two approaches would be useful (not necessarily
Shotwell related, basic sqlite comparisons would be good to see as I
really don't know how sqlite behaves under load).

Then you would have to deal with the complications detailed above, which
is likely to add complexity to the Shotwell import code. The code is
already quite complex as it is so I'd rather avoid making it more
complex if at all possible.

Having said this, any performance or memory usage analysis is always
useful so if you've found any sequence of calls that seem wasteful to
you, don't hesitate to share trace output and we can discuss individual
cases.

Cheers,

Bruno