Re: Revamped Duplicates Patch

From: "Thomas Van Machelen" <thomas vanmachelen gmail com>
To: "Gabriel Burt" <gabriel burt gmail com>
Cc: f-spot-list gnome org
Subject: Re: Revamped Duplicates Patch
Date: Wed, 15 Mar 2006 13:40:33 +0100

Hi Gabriel,

2006/3/15, Gabriel Burt <gabriel burt gmail com>:
> On Mon, 2006-03-13 at 22:51 +0100, Thomas Van Machelen wrote:
> > Last weekend I took the Duplicates patch that was sent to this list a
> > few months ago, and updated it so works with current cvs head.  I added
> > some updater code so that it adds a new md5sum field to the photos
> > table, so make sure you take a backup of your database first.
>
> This patch looks pretty promising.  Some minor things:
>

Thanks for having a look at it.

> 1) In the database upgrade code, you can assume that the update you're
> adding won't be run if it's not needed.  Meaning you can take out the
> check for has_md5.  Also, since you go ahead and compute the MD5 hash of
> all photos after altering the database, that seems like it should either
> open the progress dialog (make it a 'slow' update) or it should happen
> in the background once f-spot has started.  Perhaps it should ask the
> user if they want it to search for duplicates, too?
>

Point taken, I will adjust the code.

> 2) You create the Duplicate tag, but make sure when you use it that it
> does indeed exist, because it can be deleted by the user.
>

This should normally already be the case.  Did you encounter a
situation where the Duplicate tag was not created prior to finding
duplicates?

> 3) A ChangeLog entry would be great - this is a massive patch.
>

Will do.

> 4) Can we just use the byte [] for the md5 sum instead of turning it
> into a string?  Anything to make it a bit faster..
>

Again, good remark.  Shouldn't be too hard, so i'll do it ;-)

> 5) Speaking of performance, it appears that there are several easy
> changes that would speed it way up.
>
> In the worst case (which will actually happen fairly often in practice),
> your algorithm runs IsDuplicate for all N photos, which involves doing a
> query on the database (which it shouldn't) and (normal case,
> not-duplicate) doing N comparisons.  So it's O(n^2) time worst/normal
> case, which would not be pretty for Nat's 50,000 photos.
>
> This can be speeded way up by making a hashtable of all the MD5s and
> while constructing it, checking for duplicates before inserting a
> key/value pair.

The first step was to make the original patch more or less work again.
 Improving speed was the second step.  Thanks for you advice, I think
I'll implement it this way.

> This ties into my next point, that I think users would
> expect doing a Find Duplicates to not search just the photos they have
> selected.  If we can make it so that checking all files for being
> duplicates is fast, that should probably be the only option.
>

I don't know about this one.  I'll wait for some more feedback to
decide what would be the best approach here.

> Thanks for your hard work!  This looks great.  With some more work,
> hopefully we'll get this in soon.
>

That would be nice :-)  I won't make any promise on the date though as
 I got a couple of busy days ahead.

Regards,
Thomas

Follow-Ups:
- Re: Revamped Duplicates Patch
  - From: Ruben Vermeersch
- Re: Revamped Duplicates Patch
  - From: Larry Ewing

References:
- Revamped Duplicates Patch
  - From: Thomas Van Machelen
- Re: Revamped Duplicates Patch
  - From: Gabriel Burt

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]