Re: Duplicates patch new version - Check box in importing



Alvaro, As promised here is a review of the duplicates patch. It is a
very useful patch and does a lot of things very well but I have a few
issues still.

I'm going to go into a lot of detail in this review because I haven't
done enough public review on this list and I want everyone to know what
what I'd like to see.  In particular I'm going to nitpick on formatting
a bit, not because you did anything wrong (I hadn't explicitly stated
any of this before so how could you follow it), but because I want
everyone to get a feel for what I'd like to see. Please forgive me for
using you as a public example I'm extremely greatful for the work you've
done.

On Tue, 2005-10-04 at 06:30 +0200, Alvaro del Castillo wrote:
> I attach a modified patch with the problem solved. Apply it against the
> current CVS.


> Index: src/ImportCommand.cs
> ===================================================================
> RCS file: /cvs/gnome/f-spot/src/ImportCommand.cs,v
> retrieving revision 1.39
> diff -u -b -B -p -r1.39 ImportCommand.cs
> --- src/ImportCommand.cs        22 Aug 2005 10:47:34 -0000      1.39
> +++ src/ImportCommand.cs        4 Oct 2005 04:27:22 -0000
> @@ -362,6 +365,7 @@ public class ImportCommand : FSpot.Glade
>         [Glade.Widget] Gtk.ScrolledWindow photo_scrolled;
>         [Glade.Widget] Gtk.CheckButton attach_check;
>         [Glade.Widget] Gtk.CheckButton recurse_check;
> +    [Glade.Widget] Gtk.CheckButton duplicate_check;
>         [Glade.Widget] Gtk.Button ok_button;
>         [Glade.Widget] Gtk.Image tag_image;
>         [Glade.Widget] Gtk.Label tag_label;
> @@ -378,6 +382,11 @@ public class ImportCommand : FSpot.Glade
>         bool copy;
>  
>         int total;
> +    
> +    private int          duplicates;
> +    private long         total_md5;
> +    private long         total_import;
> +
>         PhotoStore store;

Here we see a problem with the whitespace formatting.  Use tabs and if
you must treat tabs as spaces they are 8 spaces wide not four.

>  
>         FSpot.Delay step;
> @@ -390,6 +399,8 @@ public class ImportCommand : FSpot.Glade
>         {
>                 main_window = mw;
>                 step = new FSpot.Delay (10, new GLib.IdleHandler (Step));
> +        total_md5 = 0;
> +        total_import = 0;
>         }
>  
>         private void HandleDialogResponse (object obj, ResponseArgs args)
> @@ -401,12 +412,16 @@ public class ImportCommand : FSpot.Glade
>                 }
>         }
>  
> -       private void UpdateProgressBar (int count, int total)
> +       private void UpdateProgressBar (int count, int total, int duplicates)
>         {
>                 if (progress_bar == null)
>                         return;
> -
> +        if (duplicates > 0) {
> +            progress_bar.Text = String.Format ("Importing {0} of {1}, {2} duplicate/s", 
> +                                               count-duplicates, total-duplicates, duplicates);
> +        } else {
>                 progress_bar.Text = String.Format ("Importing {0} of {1}", count, total);
> +        }
>                 progress_bar.Fraction = (double) count / System.Math.Max (total, 1);
>         }
>  
> @@ -431,6 +446,10 @@ public class ImportCommand : FSpot.Glade
>                 Pixbuf thumbnail;
>                 int count;
>                 bool ongoing = true;
> +        TimeSpan timeTakenMD5 = new TimeSpan();;
> +
> +        //Console.WriteLine ("Start importing the photo ...");
> +        long startTime = DateTime.Now.Ticks;
>  
>                 if (importer == null)
>                         return false;
> @@ -440,14 +459,48 @@ public class ImportCommand : FSpot.Glade
>                 if (thumbnail == null) {
>                         Console.WriteLine ("Could not import file");
>                 } else {
> -                       //icon_scrolled.Visible = true;
> +            long startTimeMD5 = DateTime.Now.Ticks;
> +            photo.MD5Sum = FSpot.DuplicatesFinder.PhotoMD5 (photo);            
> +            store.Commit (photo);
> +            long endTimeMD5 = DateTime.Now.Ticks;
> +            total_md5 += endTimeMD5 - startTimeMD5;
> +            timeTakenMD5 = new TimeSpan (endTimeMD5 - startTimeMD5);
> +            //Console.WriteLine ("Total time for MD5 {0}", timeTakenMD5.ToString());
> +
> +            if (FSpot.DuplicatesFinder.IsDuplicate (photo)) {
> +                photo.AddTag (MainWindow.Toplevel.Database.Tags.Duplicate);
> +                store.Commit (photo);
> +            }
> +            
> +            if (FSpot.DuplicatesFinder.IsDuplicate (photo)) {
> +                //Console.WriteLine ("Photo: {0} duplicated\n", photo.Name);
> +                duplicates++;
> +                if (duplicate_check.Active) {                    
>                         collection.Add (photo);
> +                } else {
> +                    //Console.WriteLine ("Removing duplicate ...");
> +                    store.Remove (photo);
> +                }

Speed on import is important because it is one of the first things
people will deal with in f-spot and is potentially very long running.
With that in mind I have a couple of questions.  Why are you calling
IsDuplicate twice here, it is potentially quite slow and could trivially
be called only once, unless I'm missing something important.  I'll have
more to say about IsDuplicate later.

In the interest of speed it might also make sense to calculate the
md5sum on the file before adding it to the store so that we could avoid
two writes to the db.  Also there is no point in adding the the
duplicate tag or the md5 if we aren't saving the photo so we can avoid
two more writes in that case

something like:

		ongoing = importer.Step (out photo, out thumbnail, out count);
		
		if (thumbnail == null) {
			Console.WriteLine ("Could not import file");
		} else {
			photo.MD5Sum = FSpot.DuplicatesFinder.PhotoMD5 (photo);			

			if (FSpot.DuplicatesFinder.IsDuplicate (photo)) {
				duplicates++;
				if (duplicate_check.Active) {                    
					photo.AddTag (MainWindow.Toplevel.Database.Tags.Duplicate);
					store.Commit (photo);
					collection.Add (photo);
				} else {
					Console.WriteLine ("Removing duplicate ...");
					store.Remove (photo);
				}
			} else {                
				Console.WriteLine ("Importing Photo: {0}", photo.Name);
				collection.Add (photo);
			}

			//grid.AddThumbnail (thumbnail);
			if (duplicate_check.Active)
				UpdateProgressBar (count, total, 0);
			else
				UpdateProgressBar (count, total, duplicates);

			thumbnail.Dispose ();
		}

Seems like a good idea if we aren't going to explicitly allow you to add
and remove duplicates in the import dialog.  If we do allow that then
tagging the photos as duplicates is good but the logic will also need to
change.


> +            } else {                
> +                //Console.WriteLine ("Importing Photo: {0}", photo.Name);
> +                collection.Add (photo);
> +            }
>                 
>                         //grid.AddThumbnail (thumbnail);
> -                       UpdateProgressBar (count, total);
> +            if (duplicate_check.Active) {
> +                UpdateProgressBar (count, total, 0);
> +            } else {
> +                UpdateProgressBar (count, total, duplicates);
> +            }
>                         thumbnail.Dispose ();
>                 }
>  
> +        long endTime = DateTime.Now.Ticks;
> +        total_import += endTime - startTime;
> +        TimeSpan timeTaken = new TimeSpan(endTime - startTime);
> +        //Console.WriteLine ("Total time importing the photo {0} ({1} MD5)\n", 
> +        //                   timeTaken.ToString(), timeTakenMD5.ToString ());        
> +
>                 if (ongoing && totaol > 0)
>                         return true;
>                 else {
> @@ -473,8 +526,9 @@ public class ImportCommand : FSpot.Glade
>                 this.importer = imp;
>                 AllowFinish = false;
>  
> +        duplicates = 0;
>                 total = importer.Prepare ();
> -               UpdateProgressBar (0, total);
> +               UpdateProgressBar (0, total, 0);
>                 
>                 collection.Clear ();
>                 collection.Capacity = total;
> @@ -492,10 +546,20 @@ public class ImportCommand : FSpot.Glade
>                         }
>                 }
>  
> +        TimeSpan timeTaken = new TimeSpan(total_import);
> +        TimeSpan timeTakenMD5 = new TimeSpan(total_md5);
> +        //Console.WriteLine ("Total time importing the photos {0} ({1} MD5)\n", 
> +        //                   timeTaken.ToString(), timeTakenMD5.ToString ()); 
> +
>                 FSpot.ThumbnailGenerator.Default.PopBlock ();
>                 
> -               if (importer != null)
> +               if (importer != null) {
> +                       try { 
>                         importer.Finish ();
> +            } catch (System.Exception e) {
> +                Console.WriteLine ("Exception finishing import ..." + e);
> +            }
> +        }
>                 
>                 importer = null;
>  
> @@ -619,6 +683,15 @@ public class ImportCommand : FSpot.Glade
>                 this.Start ();
>         }
>  
> +    
> +    private void HandleDuplicateToggled (object sender, System.EventArgs args)
> +       {
> +               this.Cancel ();
> +               while (Application.EventsPending ())
> +                       Application.RunIteration ();
> +               this.Start ();
> +       }
> +
>         public int ImportFromFile (PhotoStore store, string path)
>         {
>                 this.store = store;
> @@ -643,6 +716,7 @@ public class ImportCommand : FSpot.Glade
>                 AllowFinish = false;
>                 
>                 recurse_check.Toggled += HandleRecurseToggled;
> +        duplicate_check.Toggled += HandleRecurseToggled;
>  
>                 tag_option_menu.Menu = tagmenu;
>                 SourceMenu menu = new SourceMenu ();
> @@ -786,7 +860,7 @@ public class ImportCommand : FSpot.Glade
>         {
>                 Db db = new Db (db_path, true);
>  
> -               ImportCommand command = new ImportCommand ();
> +               ImportCommand command = new ImportCommand (main_window);
>  
>                 command.ImportFromPath (db.Photos, directory_path, true);
>  
> Index: src/MainWindow.cs
> ===================================================================
> RCS file: /cvs/gnome/f-spot/src/MainWindow.cs,v
> retrieving revision 1.218
> diff -u -b -B -p -r1.218 MainWindow.cs
> --- src/MainWindow.cs   2 Oct 2005 07:13:38 -0000       1.218
> +++ src/MainWindow.cs   4 Oct 2005 04:27:22 -0000
> @@ -13,10 +13,11 @@ using System.Text.RegularExpressions;
>  using LibGPhoto2;
>  
>  public class MainWindow {
> -        public static MainWindow Toplevel;
> +    public static MainWindow Toplevel = null;
>  
>         Db db;
>  
> +       
>         TagSelectionWidget tag_selection_widget;
>         [Glade.Widget] Gtk.Window main_window;
>         [Glade.Widget] Gtk.VBox left_vbox;
> @@ -68,6 +69,8 @@ public class MainWindow {
>         [Glade.Widget] MenuItem remove_tag;
>         [Glade.Widget] MenuItem find_tag;
>         
> +    [Glade.Widget] MenuItem find_duplicates;
> +       
>         [Glade.Widget] Scale zoom_scale;
>  
>         [Glade.Widget] VPaned info_vpaned;
> @@ -1686,6 +1689,28 @@ public class MainWindow {
>         void HandleClearDateRange (object sender, EventArgs args) {
>                 query.Range = null;
>         }
> +
> +       void HandleFindDuplicates (object sender, EventArgs args) {
> +        find_duplicates.Sensitive = false;
> +        FSpot.DuplicatesFinder finder = new FSpot.DuplicatesFinder (db, query);
> +        finder.CreateDuplicateTag ();
> +        tag_selection_widget.Update ();
> +        if (!PhotoSelectionActive()) {
> +            icon_view.SelectAllCells ();
> +        }
> +        System.Console.WriteLine ("Looking for duplicates ...");
> +        finder.SearchFinished += new FSpot.DuplicatesFinderEnd (HandleEndDuplicates);
> +        finder.startFind (SelectedIds());
> +    }
> +

I don't like selecting all cells here, the other methods that do similar
things leave the selection alone.  Copy the query array or pass null and
handle null specially in the find method.

> +    void HandleEndDuplicates (Boolean success) {
> +        if (success) {
> +            tag_selection_widget.Select (db.Tags.Duplicate);
> +            UpdateQuery ();
> +            SetViewMode (ModeType.IconView);
> +        }
> +        find_duplicates.Sensitive = true;
> +    }
>  
>         // Version Id updates.
>  
> Index: src/Makefile.am
> ===================================================================
> RCS file: /cvs/gnome/f-spot/src/Makefile.am,v
> retrieving revision 1.43
> diff -u -b -B -p -r1.43 Makefile.am
> --- src/Makefile.am     29 Sep 2005 10:37:09 -0000      1.43
> +++ src/Makefile.am     4 Oct 2005 04:27:22 -0000
> @@ -16,6 +16,7 @@ F_SPOT_CSDISTFILES =                          \
>         $(srcdir)/Delay.cs                      \
>         $(srcdir)/DirectoryAdaptor.cs           \
>         $(srcdir)/DirectoryCollection.cs        \
> +       $(srcdir)/DuplicatesFinder.cs   \
>         $(srcdir)/Exif.cs                       \
>         $(srcdir)/ExifUtils.cs                  \
>         $(srcdir)/FlickrExport.cs               \
> Index: src/PhotoStore.cs
> ===================================================================
> RCS file: /cvs/gnome/f-spot/src/PhotoStore.cs,v
> retrieving revision 1.71
> diff -u -b -B -p -r1.71 PhotoStore.cs
> --- src/PhotoStore.cs   1 Sep 2005 10:01:52 -0000       1.71
> +++ src/PhotoStore.cs   4 Oct 2005 04:27:22 -0000
> @@ -28,6 +28,7 @@ public class Photo : DbItem, IComparable
>                 return Compare (this, photo);
>         }
>         
> +    // FIXME: With md5sum field this could be easy
>         public static int Compare (Photo photo1, Photo photo2)
>         {
>                 int result = photo1.Id.CompareTo (photo2.Id);
> @@ -137,6 +138,16 @@ public class Photo : DbItem, IComparable
>                 }
>         }
>  
> +    private string md5sum;
> +    public string MD5Sum {
> +               get {
> +                       return md5sum;
> +               }
> +               set {
> +                       md5sum = value;
> +               }
> +       }
> +
>         // Version management
>         public const int OriginalVersionId = 1;
>         private uint highest_version_id;
> @@ -393,6 +404,7 @@ public class Photo : DbItem, IComparable
>                 this.name = name;
>  
>                 description = "";
> +        md5sum      = "";
>  
>                 // Note that the original version is never stored in the photo_versions table in the
>                 // database.
> @@ -493,7 +505,8 @@ public class PhotoStore : DbStore {
>                         "       directory_path     STRING NOT NULL,                " +
>                         "       name               STRING NOT NULL,                " +
>                         "       description        TEXT NOT NULL,                  " +
> -                       "       default_version_id INTEGER NOT NULL                " +
> +                       "       default_version_id INTEGER NOT NULL,                 " +
> +            "       md5sum             STRING NOT NULL                  " +
>                         ")";
>  
>                 command.ExecuteNonQuery ();
> @@ -538,8 +551,8 @@ public class PhotoStore : DbStore {
>                 command.Connection = Connection;
>  
>                 command.CommandText = String.Format ("INSERT INTO photos (time, " +
> -                                                    "directory_path, name, description, default_version_id) " +
> -                                                    "       VALUES ({0}, '{1}', '{2}', '{3}', {4})                                       ",
> +                                                    "directory_path, name, description, default_version_id, md5sum) " +
> +                                                    "       VALUES ({0}, '{1}', '{2}', '{3}', {4}, '')                                       ",
>                                                      unix_time,
>                                                      SqlString (System.IO.Path.GetDirectoryName (path)),
>                                                      SqlString (System.IO.Path.GetFileName (path)),
> @@ -728,7 +741,8 @@ public class PhotoStore : DbStore {
>                                                      "       directory_path,                       " +
>                                                      "       name,                                 " +
>                                                      "       description,                          " +
> -                                                    "       default_version_id                    " +
> +                                                    "       default_version_id,                   " +
> +                             "       md5sum                                " +
>                                                      "     FROM photos                             " +
>                                                      "     WHERE id = {0}                          ",
>                                                      id);
> @@ -742,6 +756,7 @@ public class PhotoStore : DbStore {
>  
>                         photo.Description = reader[3].ToString ();
>                         photo.DefaultVersionId = Convert.ToUInt32 (reader[4]);
> +            photo.MD5Sum = reader[5].ToString ();
>                         AddToCache (photo);
>                 }
>  
> @@ -895,11 +910,13 @@ public class PhotoStore : DbStore {
>  
>                 SqliteCommand command = new SqliteCommand ();
>                 command.Connection = Connection;
> -               command.CommandText = String.Format ("UPDATE photos SET description = '{0}',     " +
> -                                                    "                  default_version_id = {1} " +
> -                                                    "              WHERE id = {2}",
> +               command.CommandText = String.Format ("UPDATE photos SET description = '{0}'," +
> +                                                    "                default_version_id = {1}," +
> +                             "                md5sum = '{2}'"+                
> +                                                    "                WHERE id = {3}",
>                                                      SqlString (photo.Description),
>                                                      photo.DefaultVersionId,
> +                             SqlString (photo.MD5Sum),  
>                                                      photo.Id);
>                 command.ExecuteNonQuery ();
>                 command.Dispose ();
> @@ -913,6 +930,7 @@ public class PhotoStore : DbStore {
>                 command.Dispose ();
>  
>                 foreach (Tag tag in photo.Tags) {
> +            
>                         command = new SqliteCommand ();
>                         command.Connection = Connection;
>                         command.CommandText = String.Format ("INSERT INTO photo_tags (photo_id, tag_id) " +
> @@ -1004,6 +1022,7 @@ public class PhotoStore : DbStore {
>                                 
>                                 photo.Description = reader[4].ToString ();
>                                 photo.DefaultVersionId = Convert.ToUInt32 (reader[5]);           
> +                photo.MD5Sum = reader[6].ToString ();
>                                 
>                                 version_list.Add (photo);
>                         }
> @@ -1043,7 +1062,8 @@ public class PhotoStore : DbStore {
>                                                      "       photos.directory_path,              " +
>                                                      "       photos.name,                        " +
>                                                      "       photos.description,                 " +
> -                                                    "       photos.default_version_id           " +
> +                                                    "       photos.default_version_id,          " +
> +                             "       photos.md5sum                       " +
>                                                      "     FROM photos                           " +
>                                                      "     WHERE directory_path = \"{0}\"", dir.FullName);
>  
> @@ -1084,7 +1104,8 @@ public class PhotoStore : DbStore {
>                                       "       photos.directory_path,              " +
>                                       "       photos.name,                        " +
>                                       "       photos.description,                 " +
> -                                     "       photos.default_version_id           " +
> +                                     "       photos.default_version_id,          " +
> +                      "       photos.md5sum                       " +
>                                       "     FROM photos                      ");
>                 
>                 if (range != null) {

At this point we really need a way to properly migrate the db for users
there are real people that have real data in f-spot now.  So we need to
detect and change the table at load time, which probably means walking
all the photos and adding a sum.  I mind less if the upgrade takes some
time than if the f-spot crashes with the old db format.

Other than that nothing much to mention here other than the obvious
formatting issues.

> 
> Index: src/TagStore.cs
> ===================================================================
> RCS file: /cvs/gnome/f-spot/src/TagStore.cs,v
> retrieving revision 1.19
> diff -u -b -B -p -r1.19 TagStore.cs
> --- src/TagStore.cs     27 Sep 2005 03:15:05 -0000      1.19
> +++ src/TagStore.cs     4 Oct 2005 04:27:22 -0000
> @@ -257,6 +257,16 @@ public class TagStore : DbStore {
>                 }
>         }
>  
> +    private Tag duplicate;
> +       public Tag Duplicate {
> +               get {
> +                       return duplicate;
> +               }
> +        set {
> +            duplicate = value;
> +        }
> +       }
> +
>         // In this store we keep all the items (i.e. the tags) in memory at all times.  This is
>         // mostly to simplify handling of the parent relationship between tags, but it also makes it
>         // a little bit faster.  We achieve this by passing "true" as the cache_is_immortal to our
> @@ -290,6 +300,9 @@ public class TagStore : DbStore {
>                         
>                         if (tag.Name == "Hidden")
>                                 hidden = tag;
> +
> +            if (tag.Name == "Duplicate")
> +                duplicate = tag;            
>                 }
>  
>                 reader.Close ();

I'm not excited to have another "special" tag like hidden, but in this
case it doesn't really make things much worse than the hidden tag.  I
need to figure out a clean way to fix this.

> @@ -354,6 +367,12 @@ public class TagStore : DbStore {
>                 this.hidden = hidden_tag;
>                 Commit (hidden_tag);
>  
> +        Tag duplicate_tag = CreateTag (RootCategory, "Duplicate");
> +               duplicate_tag.StockIconName = "f-spot-hidden.png";
> +               duplicate_tag.SortPriority = -11;
> +               this.duplicate = duplicate_tag;
> +               Commit (duplicate_tag);
> +
>                 Tag people_category = CreateCategory (RootCategory, "People");
>                 people_category.StockIconName = "f-spot-people.png";
>                 people_category.SortPriority = -8;
> @@ -455,6 +474,11 @@ public class TagStore : DbStore {
>                     category.Children.Length > 0)
>                         throw new InvalidTagOperationException (category, "Cannot remove category that contains children");
>  
> +        // FIXME: Hack!
> +        if (string.Compare (((Tag) item).Name, "Duplicate") == 0) {
> +            duplicate = null;
> +        }
> +        
>                 RemoveFromCache (item);
>                 
>                 ((Tag)item).Category = null;
> Index: src/f-spot.glade
> ===================================================================
> RCS file: /cvs/gnome/f-spot/src/f-spot.glade,v


The only comment I have on the glade changes are that I think having the
duplicate check inline with the other check is probably a little
visually confusing.

On to the included file


> using System;
> using System.Collections;
> using System.IO;
> using System.Security.Cryptography;
> using System.Text;
> 
> namespace FSpot {
>     public delegate void DuplicatesFinderEnd (Boolean success);
> 
>     public class DuplicatesFinder {
>         private System.Threading.Thread    duplicates_thread = null;
>         private FSpot.ThreadProgressDialog progress_dialog_duplicates;
>         private ArrayList                  duplicates = null;
>         private Boolean                    end_duplicates = false;
>         private uint                       duplicates_timer = 0;
>         private System.Threading.Thread    md5_thread = null;
>         private FSpot.ThreadProgressDialog progress_dialog_md5;
>         private Boolean                    end_computeMD5 = false;
>         private uint                       computeMD5_timer = 0;
> 
>         // Shared data variables we need to work with
>         private Db                         db;
>         private int[]                      selected_ids;
>         private FSpot.PhotoQuery           query;
>     
>         // To inform that the Finder has finished
>         public event DuplicatesFinderEnd   SearchFinished;
>     

Did I already mention I don't like to have the variable names aligned?
All to often adding a variable means needless whitespace changes when
you do this, and I don't see much benefit.

>         public DuplicatesFinder (Db                 db, 
>                                  FSpot.PhotoQuery   query)
> 
>         {
>             this.db = db;
>             this.query = query;
>         }


The previous comment goes double for method parameters.  It is ok to
wrap them, but don't align them.

> 
>         public void CreateDuplicateTag () {
>             Tag tag_duplicate = db.Tags.Duplicate;
>             if (tag_duplicate == null) {
>                 Console.WriteLine ("Creating the Duplicate tag ...");
>                 tag_duplicate = db.Tags.CreateTag (null, "Duplicate");
>                 db.Tags.Duplicate = tag_duplicate;
>                 tag_duplicate.StockIconName = "f-spot-hidden.png";
>                 tag_duplicate.SortPriority = -11;
>                 db.Tags.Commit (tag_duplicate);
>             } else {
>                 Console.WriteLine ("The duplicate tag already exists ...");
>             }
>         }
> 
>         public void startFind (int [] selected_ids) {
>             this.selected_ids = selected_ids;
>             md5_thread = 
>                 new System.Threading.Thread (new System.Threading.ThreadStart (this.ComputeMD5));
>             md5_thread.Name = Mono.Posix.Catalog.GetString ("Creating image unique identifiers");
>             progress_dialog_md5 = new 
>                 FSpot.ThreadProgressDialog (md5_thread, selected_ids.Length);
>             progress_dialog_md5.Start();
>             StartComputeMD5Timer ();
>         }
> 

I'd like all method names to be PascalCase not camelCase.

>         private bool HandleDuplicatesTimer ()
>         {
>             Console.WriteLine ("Duplicates Timer ...");
>             if (end_duplicates || !duplicates_thread.IsAlive) {
>                 if (db.Tags.Duplicate == null) {
>                     CreateDuplicateTag ();
>                 }
>                 end_duplicates = false;
>                 duplicates_timer = 0;
>                 foreach (int num in duplicates) {
>                     Console.WriteLine ("Tagging duplicate photo {0}", num);
>                     query.Photos[num].AddTag (db.Tags.Duplicate);
>                     query.Commit (num);
>                 }
>                 if (SearchFinished != null) {
>                     SearchFinished (true);
>                 }
>                 return false;
>             }
>             return true;
>         }
> 
>         private void StartDuplicatesTimer ()
>         {
>             if (duplicates_timer == 0)
>                 duplicates_timer = 
>                     GLib.Timeout.Add (100, new GLib.TimeoutHandler (HandleDuplicatesTimer));
>         }
> 
>         private bool HandleComputeMD5Timer ()
>         {
>             Console.WriteLine ("MD5 Timer ...");
>             if (md5_thread.IsAlive) {
>                 Console.WriteLine ("The MD5 thread is alive ...");
>             }
>             if (end_computeMD5 || !md5_thread.IsAlive) {
>                 if (progress_dialog_md5 != null) {
>                     progress_dialog_md5.Destroy ();
>                 }
>                 computeMD5_timer = 0;
>                 if (end_computeMD5) {
>                     end_computeMD5 = false;
>                     /* duplicates_thread = new System.Threading.Thread 
>                         (new System.Threading.ThreadStart (this.FindDuplicates));
>                     duplicates_thread.Name = Mono.Posix.Catalog.GetString ("Finding Duplicates");
>                     progress_dialog_duplicates = 
>                     new FSpot.ThreadProgressDialog (duplicates_thread, selected_ids.Length);
>                     progress_dialog_duplicates.Start (); */
>                     StartDuplicatesTimer (); 
>                     lock (this) {
>                         end_duplicates = true;
>                     }
>                 } else {
>                     if (SearchFinished != null) {
>                         SearchFinished (false);
>                     }
>                 }
>                 return false;
>             }
>             return true;
>         }
> 
>         private void StartComputeMD5Timer ()
>         {
>             if (computeMD5_timer == 0)
>                 computeMD5_timer = 
>                     GLib.Timeout.Add (1000, new GLib.TimeoutHandler (HandleComputeMD5Timer));
>         }
> 
> 
>         public static string PhotoMD5 (Photo photo) {
>             FileStream fs = new FileStream(photo.Path, FileMode.Open, FileAccess.Read);
>             MD5 md5ServiceProvider = new MD5CryptoServiceProvider();
>             byte[] md5 = md5ServiceProvider.ComputeHash(fs);
>             
>             StringBuilder hash = new StringBuilder();
>             for (int pos = 0; pos < md5.Length; pos++) {
>                 hash.Append(md5[pos].ToString("X2").ToLower());
>             }
>             return hash.ToString ();
>         }
> 

Is there any advantage to using a string rather than the byte array?  It
uses 4 times as much memory per hash, and since the strings would
typically be unique you don't gain anything.  We could still store them
in the db as a string.  

Also Use a using block around around the FileStream and probably use
System.IO.File.OpenRead.

>         /* FIXME: We should use a cache of all photos MD5? */
>         public static bool IsDuplicate (Photo photo) {
>             Photo[] photos;
>             photos = MainWindow.Toplevel.Database.Photos.Query (null, null);
>             if (photos == null) {
>                 Console.WriteLine ("No photos in duplicates ...");
>                 return false;
>             }
>             if (photo.MD5Sum == null) {
>                 photo.MD5Sum = PhotoMD5 (photo);
>             }
>             bool  duplicate = false;
>             foreach (Photo p in photos) {
>                 if (p.MD5Sum == null) {
>                     p.MD5Sum = PhotoMD5 (p);
>                 }
>                 //Console.WriteLine ("Looking: {0},{1} {2},{3}", 
>                 //                   p.Name, p.MD5Sum, photo.Name, photo.MD5Sum);
>                 if (p.MD5Sum.CompareTo(photo.MD5Sum) == 0 &&
>                     p.Id != photo.Id) {
>                     //Console.WriteLine ("DUPLICATES");
>                     duplicate = true;
>                     break;
>                 }
>             }            
>             return duplicate;
>         } 
> 

Ok, this looks like something we definitely want to avoid and is also
part of the reason why I wanted to reduce the two calls to IsDuplicate
to one.  First we query for all the photos in the database then we walk
over all the items.  The database is potentially large (40,000 photos in
Nat's case) and this walks in first in the query then in the in the
foreach.  You are storing the hash in the db why don't you just add a
query for the hash?

>         // Executed in a different thread
>         private void ComputeMD5 () {
>             // A really simple implementation: Build hash table with MD5 of images
>             // and we later use it to find duplicates. If the MD5 is in the hash table
>             // we don't add it. We have only the "original" in this table.
>             Hashtable photos_md5 = new Hashtable ();
>             duplicates = new ArrayList ();
> 
>             int counter = 1;
>             foreach (int num in selected_ids) {
>                 Photo photo = query.Photos [num];
>                 progress_dialog_md5.Message = System.String.Format 
>                     (Mono.Posix.Catalog.GetString ("Creating unique identifier for {0}"), photo.Name);
>                 progress_dialog_md5.Fraction = counter / (double) selected_ids.Length;
>                 progress_dialog_md5.ProgressText = System.String.Format (Mono.Posix.Catalog.GetString ("{0} of {1}"), counter, selected_ids.Length);
>                 if (string.Compare(photo.MD5Sum, "") != 0) {
>                     // It is a duplicate. We should mark it.
>                     if (photos_md5[photo.MD5Sum] != null) {
>                         duplicates.Add (num);
>                     } else {
>                         photos_md5.Add (photo.MD5Sum, num);
>                     }
>                     counter++;
>                     continue;
>                 }
>                 // Computing time is measured for testing purposes
>                 try {
>                     long startTime = DateTime.Now.Ticks;
>                     FileStream fs = new FileStream(photo.Path, FileMode.Open, FileAccess.Read);
>                     MD5 md5ServiceProvider = new MD5CryptoServiceProvider();
>                     byte[] md5 = md5ServiceProvider.ComputeHash(fs);
> 
>                     StringBuilder hash = new StringBuilder();
>                     for (int pos = 0; pos < md5.Length; pos++) {
>                         hash.Append(md5[pos].ToString("X2").ToLower());
>                     }
>                     long endTime = DateTime.Now.Ticks;
>                     TimeSpan timeTaken = new TimeSpan(endTime - startTime);
>                     Console.WriteLine("MD5 compute: {0}", timeTaken.ToString());
>                     photo.MD5Sum = hash.ToString ();
>                     db.Photos.Commit (photo);
>                     counter++;                    
>                     // It is a duplicate. We should mark it.
>                     if (photos_md5[photo.MD5Sum] != null) {
>                         duplicates.Add (num);
>                     } else {
>                         photos_md5.Add (photo.MD5Sum, num);
>                     }
>                 } catch (Exception ex) {
>                     Console.WriteLine ("Problems creating MD5: {0} ", photo.Path);
>                     Console.WriteLine (ex);
>                 }
>             }
>             progress_dialog_md5.Message = Mono.Posix.Catalog.GetString ("Done Creating Unique Identifiers for Photos");
>             progress_dialog_md5.Fraction = 1.0;
>             progress_dialog_md5.ProgressText = Mono.Posix.Catalog.GetString ("All unique identifiers created.");
>             progress_dialog_md5.ButtonLabel = Gtk.Stock.Ok;
>             lock (this) {
>                 end_computeMD5 = true;
>             }
>         }
>     }

This contains another implementation of the PhotoMD5 method, use the one
you already have above (once it is fixed).  Also this might be a
candidate for using the query as well, at least in the nothing is
selected case.

Now that I'm here at the end I'm not sure what the point of the
duplicate tag is.  Is it there for a general purpose or are you planning
to just use it in the import dialog.  If it is the later I think we can
use a different trick.  If it is the former, what are your plans?  Also
I notice the duplicate detection doesn't take versions into account.
You probably need to store a sum on each version as well and check them
in all the tests (ouch that adds more complications).  At least this
would be a good time to add the directory path to the version table
which is badly needed for other things. 

Anyway, the patch is awesome and solves a real problem and would be
great to get in,  but there is a bit of refactoring and cleanup that
still needs to be done.  I'm really sorry I haven't given it a proper
review before now.  I hope this at least clarifies why I didn't think it
was quite ready.

Thank you so much for all the hard work and time it must have taken to
get things this far, I'd love to solve the remaining issues as fast as
possible and move on to the other patches you have waiting.

--Larry






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]