Re: Adding "Find Duplicates" feature to F-Spot



Hi!

El dom, 19-06-2005 a las 21:12 +0200, Huba Zsolt escribió:
> Hi!
> 
> Some weeks ago I also thought about this missing feature. And my idea
> was that during the import process the md5 code of every new photo
> should be generated and stored in the database. This way finding the
> duplicated images would be just one sql query, so it would be very
> fast but as I thought the import process would be slower because of
> the lots of md5 code generation.
> 

Yes, this could be a good way but all the users will suffer the md5
generation for all the photos. I think that only users that use the
Duplicate feature should spend time with the MD5 generation for the
photos if we don't find any uses for the MD5 that could justify that all
users suffer this loading time.

I think the MD5 for photos could be cached in a hash table. This is what
I do in the current implementation.

Some numbers: computing the MD5 files for the photos

acs amigo:~/fotos/airport extreme$ ls -l
total 1360
-rwxr--r--  1 acs root 364713 2005-01-05 14:26 dsc00045.jpg
-rwxr--r--  1 acs root 330323 2005-01-05 14:26 dsc00046.jpg
-rwxr--r--  1 acs root 324022 2005-01-05 14:26 dsc00047.jpg
-rwxr--r--  1 acs root 344558 2005-01-05 14:27 dsc00048.jpg

and measuring the MD5 computing with DateTime.Now.Ticks (I am sure it
isn't the most accurate way to do it) in my computer (Dell X300 witn 256
MB RAM and Pentium(R) M processor 1200MHz):

First time:
MD5 compute: 00:00:00.0769270
MD5 compute: 00:00:00.0290020
MD5 compute: 00:00:00.0200700
MD5 compute: 00:00:00.0204300

Second time:
MD5 compute: 00:00:00.0199370
MD5 compute: 00:00:00.0174230
MD5 compute: 00:00:00.0176300
MD5 compute: 00:00:00.0184470

Third time:
MD5 compute: 00:00:00.0219800
MD5 compute: 00:00:00.0203260
MD5 compute: 00:00:00.0194000
MD5 compute: 00:00:00.0199240

Fourth time:
MD5 compute: 00:00:00.0284410
MD5 compute: 00:00:00.0254680
MD5 compute: 00:00:00.0252140
MD5 compute: 00:00:00.0277300

So with not very big photos (1024x768) we can find around 30ms per
photo. If you have for example 6000 photos you spend 180 seconds (3
minutes). A really bad first experience for the user. Currently, this 3
minutes are spread in the minutes you spend in the importing process
that is a bit slow actually.

Cheers

-- Alvaro

P.S: To compute the MD5 I use the code

FileStream fs = new FileStream(photo.Path, FileMode.Open,
FileAccess.Read);
MD5 md5ServiceProvider = new MD5CryptoServiceProvider();
byte[] md5 = md5ServiceProvider.ComputeHash(fs);
            
StringBuilder hash = new StringBuilder();
for (int pos = 0; pos < md5.Length; pos++) {
	hash.Append(md5[pos].ToString("X2").ToLower());
}

taken from Mono bugzilla.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]