Re: Revamped Duplicates Patch



On Mon, 2006-03-13 at 22:51 +0100, Thomas Van Machelen wrote:
> Last weekend I took the Duplicates patch that was sent to this list a
> few months ago, and updated it so works with current cvs head.  I added
> some updater code so that it adds a new md5sum field to the photos
> table, so make sure you take a backup of your database first.

This patch looks pretty promising.  Some minor things:

1) In the database upgrade code, you can assume that the update you're
adding won't be run if it's not needed.  Meaning you can take out the
check for has_md5.  Also, since you go ahead and compute the MD5 hash of
all photos after altering the database, that seems like it should either
open the progress dialog (make it a 'slow' update) or it should happen
in the background once f-spot has started.  Perhaps it should ask the
user if they want it to search for duplicates, too?

2) You create the Duplicate tag, but make sure when you use it that it
does indeed exist, because it can be deleted by the user.

3) A ChangeLog entry would be great - this is a massive patch.

4) Can we just use the byte [] for the md5 sum instead of turning it
into a string?  Anything to make it a bit faster..

5) Speaking of performance, it appears that there are several easy
changes that would speed it way up.

In the worst case (which will actually happen fairly often in practice),
your algorithm runs IsDuplicate for all N photos, which involves doing a
query on the database (which it shouldn't) and (normal case,
not-duplicate) doing N comparisons.  So it's O(n^2) time worst/normal
case, which would not be pretty for Nat's 50,000 photos.

This can be speeded way up by making a hashtable of all the MD5s and
while constructing it, checking for duplicates before inserting a
key/value pair.  This ties into my next point, that I think users would
expect doing a Find Duplicates to not search just the photos they have
selected.  If we can make it so that checking all files for being
duplicates is fast, that should probably be the only option.

Thanks for your hard work!  This looks great.  With some more work,
hopefully we'll get this in soon.

Gabriel Burt




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]