Re: Revamped Duplicates Patch
- From: "Thomas Van Machelen" <thomas vanmachelen gmail com>
- To: "Gabriel Burt" <gabriel burt gmail com>
- Cc: f-spot-list gnome org
- Subject: Re: Revamped Duplicates Patch
- Date: Wed, 15 Mar 2006 13:40:33 +0100
Hi Gabriel,
2006/3/15, Gabriel Burt <gabriel burt gmail com>:
> On Mon, 2006-03-13 at 22:51 +0100, Thomas Van Machelen wrote:
> > Last weekend I took the Duplicates patch that was sent to this list a
> > few months ago, and updated it so works with current cvs head. I added
> > some updater code so that it adds a new md5sum field to the photos
> > table, so make sure you take a backup of your database first.
>
> This patch looks pretty promising. Some minor things:
>
Thanks for having a look at it.
> 1) In the database upgrade code, you can assume that the update you're
> adding won't be run if it's not needed. Meaning you can take out the
> check for has_md5. Also, since you go ahead and compute the MD5 hash of
> all photos after altering the database, that seems like it should either
> open the progress dialog (make it a 'slow' update) or it should happen
> in the background once f-spot has started. Perhaps it should ask the
> user if they want it to search for duplicates, too?
>
Point taken, I will adjust the code.
> 2) You create the Duplicate tag, but make sure when you use it that it
> does indeed exist, because it can be deleted by the user.
>
This should normally already be the case. Did you encounter a
situation where the Duplicate tag was not created prior to finding
duplicates?
> 3) A ChangeLog entry would be great - this is a massive patch.
>
Will do.
> 4) Can we just use the byte [] for the md5 sum instead of turning it
> into a string? Anything to make it a bit faster..
>
Again, good remark. Shouldn't be too hard, so i'll do it ;-)
> 5) Speaking of performance, it appears that there are several easy
> changes that would speed it way up.
>
> In the worst case (which will actually happen fairly often in practice),
> your algorithm runs IsDuplicate for all N photos, which involves doing a
> query on the database (which it shouldn't) and (normal case,
> not-duplicate) doing N comparisons. So it's O(n^2) time worst/normal
> case, which would not be pretty for Nat's 50,000 photos.
>
> This can be speeded way up by making a hashtable of all the MD5s and
> while constructing it, checking for duplicates before inserting a
> key/value pair.
The first step was to make the original patch more or less work again.
Improving speed was the second step. Thanks for you advice, I think
I'll implement it this way.
> This ties into my next point, that I think users would
> expect doing a Find Duplicates to not search just the photos they have
> selected. If we can make it so that checking all files for being
> duplicates is fast, that should probably be the only option.
>
I don't know about this one. I'll wait for some more feedback to
decide what would be the best approach here.
> Thanks for your hard work! This looks great. With some more work,
> hopefully we'll get this in soon.
>
That would be nice :-) I won't make any promise on the date though as
I got a couple of busy days ahead.
Regards,
Thomas
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]