Re: Dup detection

From: Asheesh Laroia <asheesh asheesh org>
To: Bill Moseley <moseley hank org>
Cc: f-spot-list gnome org
Subject: Re: Dup detection
Date: Sun, 1 Mar 2009 14:31:16 -0800 (PST)

On Sun, 1 Mar 2009, Bill Moseley wrote:

On Sat, Feb 28, 2009 at 07:39:05AM -0800, Bill Moseley wrote:

Ubuntu 8.10 / f-spot 0.5.0.3


I'm setting up f-spot for someone and imported their entire "My
Documents" folder from their old drive.

The "Detect duplicates" item was checked on the import dialog, yet I
ended up with many duplicates.

Here's a few examples in the f-spot database:

sqlite> select uri,md5_sum from photos where uri like '%visit17%';
file:///home/dawson/Photos/2002/07/27/visit17.jpg|NxQax6OOx2UrTYXuegNDjA==
file:///home/dawson/Photos/2002/07/27/visit17-1.jpg|Dpa4apwy/Wguf2VD+UTXog==
file:///home/dawson/Photos/2002/07/27/visit17c.jpg|KRaGglvxVYNTbsj6IhkEFA==
file:///home/dawson/Photos/2002/07/27/visit17-2.jpg|Bb9Lspcs+WWOt2HiQKf0Xw==

Yet the md5's of the photos match:

$ md5sum Photos/2002/07/27/visit17*.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17-1.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17-2.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17c.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17.jpg

Clearly, the md5's in the database are not just the file contents.


Hum, is this code below not the code that generates the md5 stored in
the photos table?

I'm not sure I understand the code, but is it first creating
a thumbnail and then calculating the md5?  Is the point of generating
the thumbnail first to strip any image meta data that might be
different?

As an outsider to this conversation, I can suggest that it seems to methat it would be better not to make a thumbnail (in case the thumbnailgeneration process changes between versions of F-Spot, in which case oldMD5s would need to be regenerated). But I iagree with the need to stripthe image metadata.

Given that, it seems to me that the best thing would be to decode theimage into a deterministic uncompressed format that has no metadata andhash that.

Since you could have pictures that are equivalent except for rotation, youmight also want to pick a canonical orientation before hashing, like "thelongest side is always the vertical." For square images, this doesn'thelp; perhaps just storing four hashes per image (one per rotation) is abetter idea. (Or, more efficiently, storing only one, but "Check if thisis a duplicate" time, decode the image into the canonical metadata-lessformat, rotate it all four ways, and then check for each of those hashes.)


Just some thoughts. Thanks all for making F-Spot great!

-- Asheesh.

--
The human race has one really effective weapon, and that is laughter.
		-- Mark Twain

References:
- Re: Dup detection
  - From: Bill Moseley

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]