Re: Adding "Find Duplicates" feature to F-Spot

From: Steve Rosen <steve sjrosen mailshell com>
To: F-Spot <f-spot-list gnome org>
Subject: Re: Adding "Find Duplicates" feature to F-Spot
Date: Mon, 20 Jun 2005 15:54:07 -0400

One possible suggestion: Create MD5 values in a separate thread at low priority during normal use of F-Spot. MD5s would then be created in the background without interrupting the user's work. It wouldn't slow down import at all. It would only slow finding duplicates if MD5s had not been created for all the photos selected for duplicate scanning.

This feature could be optional, turned off by default, but turned on subsequently if the user selects the Find Duplicates menu item or selects to find duplicates during photo import.

Steve

Alvaro del Castillo wrote:

Hi!
...

Yes, this could be a good way but all the users will suffer the md5
generation for all the photos. I think that only users that use the
Duplicate feature should spend time with the MD5 generation for the
photos if we don't find any uses for the MD5 that could justify that all
users suffer this loading time.

I have to agree with this opinion. This bothered me too. But I think
it would be a good idea to store the created md5 hashes. I saw this
feature in gthumb where it was a bit slow. In this situation we have
the opportunity to store the created hashes in sql for further use. So
perhaps when you run a duplicate searching it would be a good idea to
store the hashes as a side affect. The next search would be a fast
generation for just the new images and an sql query.


Yes, I think this is the best idea. To create MD5 when using the
duplicate feature. And to store the MD5 in the database could be also a
good idea, yes! When you load the photo data from the database, the MD5
could be loaded also if it exists and later, you don't need to recreate
it.

And at last one more thing. With your original idea you can't alert
the user not to import the same image twice.


No, if you loose the MD5 you can't. So you are thinking about showing
the user a dialog when she tries to import photos that are already in
the albums, no? The user then can say "Don't import any repeated photo
or import all the repeated photos". This could be a nice feature. We
annoy the user with a question but I think she will like to be informed
about it :)


So to fix some points:


1. If the user doesn't use the Duplicate feature, she won't suffer any
time spend creating md5. The only extra time will be loading the data
from the MD5 database field. I think this time should be minimal because
you will load the MD5 data field with lots of other fields.


2. If the user select the Duplicate feature then:

2.1 If she has selected a group of photos, the duplicate code will work
in this selection. A "Duplicate" tag will be created if it doesn't exist
and all the duplicates photos will be marked with this Duplicate tag.
The Duplicate tag checkbox will be selected show in the main window will
only appear the duplicates photos so the user can work with them.
Probably she will delete one of the copies or more if they exists. Maybe
we can preselect for the user all the photos except of original per
duplicate group.

2.2 If she doesn't select any photos, we will work with all the photos.

In 2.1 and 2.2 we could need to show a progress dialog.


How does it sounds?

Cheers

Hubidubi

I think the MD5 for photos could be cached in a hash table. This is what
I do in the current implementation.

Some numbers: computing the MD5 files for the photos

acs amigo:~/fotos/airport extreme$ ls -l
total 1360
-rwxr--r--  1 acs root 364713 2005-01-05 14:26 dsc00045.jpg
-rwxr--r--  1 acs root 330323 2005-01-05 14:26 dsc00046.jpg
-rwxr--r--  1 acs root 324022 2005-01-05 14:26 dsc00047.jpg
-rwxr--r--  1 acs root 344558 2005-01-05 14:27 dsc00048.jpg

and measuring the MD5 computing with DateTime.Now.Ticks (I am sure it
isn't the most accurate way to do it) in my computer (Dell X300 witn 256
MB RAM and Pentium(R) M processor 1200MHz):

First time:
MD5 compute: 00:00:00.0769270
MD5 compute: 00:00:00.0290020
MD5 compute: 00:00:00.0200700
MD5 compute: 00:00:00.0204300

Second time:
MD5 compute: 00:00:00.0199370
MD5 compute: 00:00:00.0174230
MD5 compute: 00:00:00.0176300
MD5 compute: 00:00:00.0184470

Third time:
MD5 compute: 00:00:00.0219800
MD5 compute: 00:00:00.0203260
MD5 compute: 00:00:00.0194000
MD5 compute: 00:00:00.0199240

Fourth time:
MD5 compute: 00:00:00.0284410
MD5 compute: 00:00:00.0254680
MD5 compute: 00:00:00.0252140
MD5 compute: 00:00:00.0277300

So with not very big photos (1024x768) we can find around 30ms per
photo. If you have for example 6000 photos you spend 180 seconds (3
minutes). A really bad first experience for the user. Currently, this 3
minutes are spread in the minutes you spend in the importing process
that is a bit slow actually.

Cheers

-- Alvaro

P.S: To compute the MD5 I use the code

FileStream fs = new FileStream(photo.Path, FileMode.Open,
FileAccess.Read);
MD5 md5ServiceProvider = new MD5CryptoServiceProvider();
byte[] md5 = md5ServiceProvider.ComputeHash(fs);

StringBuilder hash = new StringBuilder();
for (int pos = 0; pos < md5.Length; pos++) {
        hash.Append(md5[pos].ToString("X2").ToLower());
}

taken from Mono bugzilla.


_______________________________________________
F-spot-list mailing list
F-spot-list gnome org
http://mail.gnome.org/mailman/listinfo/f-spot-list

--
Steve

Follow-Ups:
- Re: Adding "Find Duplicates" feature to F-Spot
  - From: Alvaro del Castillo

References:
- Adding "Find Duplicates" feature to F-Spot
  - From: Alvaro del Castillo
- Re: Adding "Find Duplicates" feature to F-Spot
  - From: Alvaro del Castillo

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]