Re: Dup detection



On Sun, 1 Mar 2009, Bill Moseley wrote:

On Sat, Feb 28, 2009 at 07:39:05AM -0800, Bill Moseley wrote:
Ubuntu 8.10 / f-spot 0.5.0.3


I'm setting up f-spot for someone and imported their entire "My
Documents" folder from their old drive.

The "Detect duplicates" item was checked on the import dialog, yet I
ended up with many duplicates.

Here's a few examples in the f-spot database:

sqlite> select uri,md5_sum from photos where uri like '%visit17%';
file:///home/dawson/Photos/2002/07/27/visit17.jpg|NxQax6OOx2UrTYXuegNDjA==
file:///home/dawson/Photos/2002/07/27/visit17-1.jpg|Dpa4apwy/Wguf2VD+UTXog==
file:///home/dawson/Photos/2002/07/27/visit17c.jpg|KRaGglvxVYNTbsj6IhkEFA==
file:///home/dawson/Photos/2002/07/27/visit17-2.jpg|Bb9Lspcs+WWOt2HiQKf0Xw==

Yet the md5's of the photos match:

$ md5sum Photos/2002/07/27/visit17*.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17-1.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17-2.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17c.jpg
4b09e7c7cf223687a9d2727230c2c5a4  Photos/2002/07/27/visit17.jpg

Clearly, the md5's in the database are not just the file contents.

Hum, is this code below not the code that generates the md5 stored in
the photos table?

I'm not sure I understand the code, but is it first creating
a thumbnail and then calculating the md5?  Is the point of generating
the thumbnail first to strip any image meta data that might be
different?

As an outsider to this conversation, I can suggest that it seems to me that it would be better not to make a thumbnail (in case the thumbnail generation process changes between versions of F-Spot, in which case old MD5s would need to be regenerated). But I iagree with the need to strip the image metadata.

Given that, it seems to me that the best thing would be to decode the image into a deterministic uncompressed format that has no metadata and hash that.

Since you could have pictures that are equivalent except for rotation, you might also want to pick a canonical orientation before hashing, like "the longest side is always the vertical." For square images, this doesn't help; perhaps just storing four hashes per image (one per rotation) is a better idea. (Or, more efficiently, storing only one, but "Check if this is a duplicate" time, decode the image into the canonical metadata-less format, rotate it all four ways, and then check for each of those hashes.)

Just some thoughts. Thanks all for making F-Spot great!

-- Asheesh.

--
The human race has one really effective weapon, and that is laughter.
		-- Mark Twain


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]