Re: Adding "Find Duplicates" feature to F-Spot



Hi!
...

> > Yes, this could be a good way but all the users will suffer the md5
> > generation for all the photos. I think that only users that use the
> > Duplicate feature should spend time with the MD5 generation for the
> > photos if we don't find any uses for the MD5 that could justify that all
> > users suffer this loading time.
> 
> I have to agree with this opinion. This bothered me too. But I think
> it would be a good idea to store the created md5 hashes. I saw this
> feature in gthumb where it was a bit slow. In this situation we have
> the opportunity to store the created hashes in sql for further use. So
> perhaps when you run a duplicate searching it would be a good idea to
> store the hashes as a side affect. The next search would be a fast
> generation for just the new images and an sql query.

Yes, I think this is the best idea. To create MD5 when using the
duplicate feature. And to store the MD5 in the database could be also a
good idea, yes! When you load the photo data from the database, the MD5
could be loaded also if it exists and later, you don't need to recreate
it.

> And at last one more thing. With your original idea you can't alert
> the user not to import the same image twice.
> 

No, if you loose the MD5 you can't. So you are thinking about showing
the user a dialog when she tries to import photos that are already in
the albums, no? The user then can say "Don't import any repeated photo
or import all the repeated photos". This could be a nice feature. We
annoy the user with a question but I think she will like to be informed
about it :)


So to fix some points:


1. If the user doesn't use the Duplicate feature, she won't suffer any
time spend creating md5. The only extra time will be loading the data
from the MD5 database field. I think this time should be minimal because
you will load the MD5 data field with lots of other fields.


2. If the user select the Duplicate feature then:

2.1 If she has selected a group of photos, the duplicate code will work
in this selection. A "Duplicate" tag will be created if it doesn't exist
and all the duplicates photos will be marked with this Duplicate tag.
The Duplicate tag checkbox will be selected show in the main window will
only appear the duplicates photos so the user can work with them.
Probably she will delete one of the copies or more if they exists. Maybe
we can preselect for the user all the photos except of original per
duplicate group.

2.2 If she doesn't select any photos, we will work with all the photos.

In 2.1 and 2.2 we could need to show a progress dialog.


How does it sounds?

Cheers


> Hubidubi
> 
> > 
> > I think the MD5 for photos could be cached in a hash table. This is what
> > I do in the current implementation.
> > 
> > Some numbers: computing the MD5 files for the photos
> > 
> > acs amigo:~/fotos/airport extreme$ ls -l
> > total 1360
> > -rwxr--r--  1 acs root 364713 2005-01-05 14:26 dsc00045.jpg
> > -rwxr--r--  1 acs root 330323 2005-01-05 14:26 dsc00046.jpg
> > -rwxr--r--  1 acs root 324022 2005-01-05 14:26 dsc00047.jpg
> > -rwxr--r--  1 acs root 344558 2005-01-05 14:27 dsc00048.jpg
> > 
> > and measuring the MD5 computing with DateTime.Now.Ticks (I am sure it
> > isn't the most accurate way to do it) in my computer (Dell X300 witn 256
> > MB RAM and Pentium(R) M processor 1200MHz):
> > 
> > First time:
> > MD5 compute: 00:00:00.0769270
> > MD5 compute: 00:00:00.0290020
> > MD5 compute: 00:00:00.0200700
> > MD5 compute: 00:00:00.0204300
> > 
> > Second time:
> > MD5 compute: 00:00:00.0199370
> > MD5 compute: 00:00:00.0174230
> > MD5 compute: 00:00:00.0176300
> > MD5 compute: 00:00:00.0184470
> > 
> > Third time:
> > MD5 compute: 00:00:00.0219800
> > MD5 compute: 00:00:00.0203260
> > MD5 compute: 00:00:00.0194000
> > MD5 compute: 00:00:00.0199240
> > 
> > Fourth time:
> > MD5 compute: 00:00:00.0284410
> > MD5 compute: 00:00:00.0254680
> > MD5 compute: 00:00:00.0252140
> > MD5 compute: 00:00:00.0277300
> > 
> > So with not very big photos (1024x768) we can find around 30ms per
> > photo. If you have for example 6000 photos you spend 180 seconds (3
> > minutes). A really bad first experience for the user. Currently, this 3
> > minutes are spread in the minutes you spend in the importing process
> > that is a bit slow actually.
> > 
> > Cheers
> > 
> > -- Alvaro
> > 
> > P.S: To compute the MD5 I use the code
> > 
> > FileStream fs = new FileStream(photo.Path, FileMode.Open,
> > FileAccess.Read);
> > MD5 md5ServiceProvider = new MD5CryptoServiceProvider();
> > byte[] md5 = md5ServiceProvider.ComputeHash(fs);
> > 
> > StringBuilder hash = new StringBuilder();
> > for (int pos = 0; pos < md5.Length; pos++) {
> >         hash.Append(md5[pos].ToString("X2").ToLower());
> > }
> > 
> > taken from Mono bugzilla.
> > 
> > 
> 
> 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]