Re: Dup detection
- From: Asheesh Laroia <asheesh asheesh org>
- To: Bill Moseley <moseley hank org>
- Cc: f-spot-list gnome org
- Subject: Re: Dup detection
- Date: Sun, 1 Mar 2009 14:31:16 -0800 (PST)
On Sun, 1 Mar 2009, Bill Moseley wrote:
On Sat, Feb 28, 2009 at 07:39:05AM -0800, Bill Moseley wrote:
Ubuntu 8.10 / f-spot 0.5.0.3
I'm setting up f-spot for someone and imported their entire "My
Documents" folder from their old drive.
The "Detect duplicates" item was checked on the import dialog, yet I
ended up with many duplicates.
Here's a few examples in the f-spot database:
sqlite> select uri,md5_sum from photos where uri like '%visit17%';
file:///home/dawson/Photos/2002/07/27/visit17.jpg|NxQax6OOx2UrTYXuegNDjA==
file:///home/dawson/Photos/2002/07/27/visit17-1.jpg|Dpa4apwy/Wguf2VD+UTXog==
file:///home/dawson/Photos/2002/07/27/visit17c.jpg|KRaGglvxVYNTbsj6IhkEFA==
file:///home/dawson/Photos/2002/07/27/visit17-2.jpg|Bb9Lspcs+WWOt2HiQKf0Xw==
Yet the md5's of the photos match:
$ md5sum Photos/2002/07/27/visit17*.jpg
4b09e7c7cf223687a9d2727230c2c5a4 Photos/2002/07/27/visit17-1.jpg
4b09e7c7cf223687a9d2727230c2c5a4 Photos/2002/07/27/visit17-2.jpg
4b09e7c7cf223687a9d2727230c2c5a4 Photos/2002/07/27/visit17c.jpg
4b09e7c7cf223687a9d2727230c2c5a4 Photos/2002/07/27/visit17.jpg
Clearly, the md5's in the database are not just the file contents.
Hum, is this code below not the code that generates the md5 stored in
the photos table?
I'm not sure I understand the code, but is it first creating
a thumbnail and then calculating the md5? Is the point of generating
the thumbnail first to strip any image meta data that might be
different?
As an outsider to this conversation, I can suggest that it seems to me
that it would be better not to make a thumbnail (in case the thumbnail
generation process changes between versions of F-Spot, in which case old
MD5s would need to be regenerated). But I iagree with the need to strip
the image metadata.
Given that, it seems to me that the best thing would be to decode the
image into a deterministic uncompressed format that has no metadata and
hash that.
Since you could have pictures that are equivalent except for rotation, you
might also want to pick a canonical orientation before hashing, like "the
longest side is always the vertical." For square images, this doesn't
help; perhaps just storing four hashes per image (one per rotation) is a
better idea. (Or, more efficiently, storing only one, but "Check if this
is a duplicate" time, decode the image into the canonical metadata-less
format, rotate it all four ways, and then check for each of those hashes.)
Just some thoughts. Thanks all for making F-Spot great!
-- Asheesh.
--
The human race has one really effective weapon, and that is laughter.
-- Mark Twain
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]