Re: Duplicates patch new version - Check box in importing



This message is getting long so I'll delete long chunks.

On Sun, 2005-10-23 at 11:31 +0200, Alvaro del Castillo wrote:
> Hi!
> 
> El mi�19-10-2005 a las 13:32 -0500, Larry Ewing escribi�> > Alvaro, As promised here is a review of the duplicates patch. It is a
> > very useful patch and does a lot of things very well but I have a few
> > issues still.
> > 
> 
> I suppose it! It is the first time we try to integrate code so I can
> understand it perfectly.
> 
> > I'm going to go into a lot of detail in this review because I haven't
> > done enough public review on this list and I want everyone to know what
> > what I'd like to see.  In particular I'm going to nitpick on formatting
> > a bit, not because you did anything wrong (I hadn't explicitly stated
> > any of this before so how could you follow it), but because I want
> > everyone to get a feel for what I'd like to see. Please forgive me for
> > using you as a public example I'm extremely greatful for the work you've
> > done.
> 
> Don't worry Larry, this is exactly what I was waiting for to happen in
> order we can integrate code upstream. I have tried to follow in someway
> the coding style but there isn't a formatting style until now so I
> haven't the rules to follow. Now it is time to follow them in order to
> have clean code :)
> 
> I plan to follow your comments and also check that for the future, we
> have a clear path for other developers to follow the style.

Great, thanks for helping out.

> I am reading right now the HACKING file in the CVS:
> 
> http://cvs.gnome.org/viewcvs/f-spot/HACKING?rev=1.1&view=markup
> 
> So the first step is to be sure we have a open issue in bugzilla about
> the duplicates feature.
> 
> http://bugs.gnome.org/buglist.cgi?product=f-spot&&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED
> 
> Reading the bugs I think that this one could be the right bug to include
> work and details about the patch:
> 
> http://bugs.gnome.org/show_bug.cgi?id=169646
> 
> So I will finally attach the patch to this bugzilla entry.
> 

Sounds good.

> Now about the codying style:
> 
> <<Beagle attempts to follow the Mono coding conventions.  The following
> description of those conventions was shamelessly stolen from Beagle's
> HACKING file.>>
> 
> s/Beagle/F-Spot
> 

done (whoops :))

[snip]

> > 
> > Here we see a problem with the whitespace formatting.  Use tabs and if
> > you must treat tabs as spaces they are 8 spaces wide not four.
> > 
> 
> I will use 8 space tabs. I am working with Emacs so I will include in my
> ".emacs" file the configuration proposed in the HACKING file.
> 
> Done! Now only 1 white space between the type and the variable name and
> all the lines using 8 space tabs.
> 

Thanks, I know it is a bit pedantic but it will help in the long run.

[snip]

> > Speed on import is important because it is one of the first things
> > people will deal with in f-spot and is potentially very long running.
> > With that in mind I have a couple of questions.  Why are you calling
> > IsDuplicate twice here, it is potentially quite slow and could trivially
> > be called only once, unless I'm missing something important.  I'll have
> > more to say about IsDuplicate later.
> > 
> 
> It is really strange how I finally end up with this code here. It is
> cleary a bug not to include the first code if condition for
> FSpot.DuplicatesFinder.IsDuplicate in the second block. I have done it
> right now.
> 
> I have tried to check carefully the additional time we use to compute
> the MD5. As you said, the first impression of the user from F-Spot will
> be importing photos. If it is a very slow process, we start creating a
> bad feeling in the user, a very bad thing.
> 

Right, it is clear from the patch you were thinking about speed and I
greatly appreciate that.

[snip]

> 
> Currently you can check or uncheck the import duplicates checkbox. But
> once you do it, the import process restart from zero, like it is done
> with include subdirectories checkbox.
> 

Sounds good, once we get this patch in I have some ideas to make the
process of excluding files a little more generic and interactive but
let's get this patch in first.

[snip]

> I have corrected the coding style in all the file using the correct tab
> size. I have commented out all the code to compute MD5 performance.
> So it seems we have finished reviewing ImportCommand.cs. 
> 

Great!

[big snip]

> > 
> > At this point we really need a way to properly migrate the db for users
> > there are real people that have real data in f-spot now.  So we need to
> > detect and change the table at load time, which probably means walking
> > all the photos and adding a sum.  I mind less if the upgrade takes some
> > time than if the f-spot crashes with the old db format.
> > 
> 
> Ok, I haven't done it because the README file about <<database is still
> changing and the code doesn't handle upgrading existing tables when the
> schema changes>>
> 
> But I feel the same as you here. We should migrate the all database
> format to the new format and from now, change the README file if we plan
> to support the current database scheme so users could start investing
> time in filling their photos database.
> 
> I see here a clear stopper for the patch until this upgrade code is
> done, no? I will do it but don't know exactly the time frame.
> 
> 

Yeah, it needs to at least do something sane when it encounters an old
db.

> > Other than that nothing much to mention here other than the obvious
> > formatting issues.
> > 
> 
> Yes, I have modified the formatting issues.
> 
> > > 
> > > Index: src/TagStore.cs
> > > ===================================================================
> > > RCS file: /cvs/gnome/f-spot/src/TagStore.cs,v
> > > retrieving revision 1.19
> > > diff -u -b -B -p -r1.19 TagStore.cs
> > > --- src/TagStore.cs     27 Sep 2005 03:15:05 -0000      1.19
> > > +++ src/TagStore.cs     4 Oct 2005 04:27:22 -0000
> > > @@ -257,6 +257,16 @@ public class TagStore : DbStore {
> > >                 }
> > >         }
> > >  
> > > +    private Tag duplicate;
> > > +       public Tag Duplicate {
> > > +               get {
> > > +                       return duplicate;
> > > +               }
> > > +        set {
> > > +            duplicate = value;
> > > +        }
> > > +       }
> > > +
> > >         // In this store we keep all the items (i.e. the tags) in memory at all times.  This is
> > >         // mostly to simplify handling of the parent relationship between tags, but it also makes it
> > >         // a little bit faster.  We achieve this by passing "true" as the cache_is_immortal to our
> > > @@ -290,6 +300,9 @@ public class TagStore : DbStore {
> > >                         
> > >                         if (tag.Name == "Hidden")
> > >                                 hidden = tag;
> > > +
> > > +            if (tag.Name == "Duplicate")
> > > +                duplicate = tag;            
> > >                 }
> > >  
> > >                 reader.Close ();
> > 
> > I'm not excited to have another "special" tag like hidden, but in this
> > case it doesn't really make things much worse than the hidden tag.  I
> > need to figure out a clean way to fix this.
> 
> Yes, I have the same problem thinking about this issue. But probably
> solving the hidden issue will be the same than solving the duplicate
> issue so for now, I finally decided to do it that way.
> 

agreed.

> > 
> > Did I already mention I don't like to have the variable names aligned?
> > All to often adding a variable means needless whitespace changes when
> > you do this, and I don't see much benefit.
> 
> I will change it. I think the code is more readable and I have inherited
> it from my work in Planner. But I change it and I will use the one white
> space between the type and the variable name.
> 
> Just an example, try to read:
> 
> private System.Threading.Thread    duplicates_thread = null;
> private FSpot.ThreadProgressDialog progress_dialog_duplicates;
> private ArrayList                  duplicates = null;
> private Boolean                    end_duplicates = false;
> private uint                       duplicates_timer = 0;
> private System.Threading.Thread    md5_thread = null;
> private FSpot.ThreadProgressDialog progress_dialog_md5;
> private Boolean                    end_computeMD5 = false;
> private uint                       computeMD5_timer = 0;
> 
> and now
> 
> private System.Threading.Thread duplicates_thread = null;
> private FSpot.ThreadProgressDialog progress_dialog_duplicates;
> private ArrayList duplicates = null;
> private Boolean end_duplicates = false;
> private uint duplicates_timer = 0;
> private System.Threading.Thread md5_thread = null;
> private FSpot.ThreadProgressDialog progress_dialog_md5;
> private Boolean end_computeMD5 = false;
> private uint computeMD5_timer = 0;
> 

Yeah, my argument is definitely not that it is prettier just that it
leads to less noise in the patches when variables are added and removed.
I'm not terribly strict about this and I care less about newly written
code than reformatting old code.  I also break this rule on things like
enums because the important item there is the difference in value.  

[snip]

> > 
> > Is there any advantage to using a string rather than the byte array?  It
> > uses 4 times as much memory per hash, and since the strings would
> > typically be unique you don't gain anything.  We could still store them
> > in the db as a string.
> 
> 
> Ok, you yo are thinking in storing in memory the hash value and when it
> is time to save it to the database, store them as a string, no? I think
> we have then the same problem when we load the photo data from the
> database because the hash will be a string. We can then convert it to
> the byte[] structure.  I understand that if we have thousands of photos
> in memory this issue could be important.
> 
> So this is going to be the second TODO (the fist is the add code to
> upgrade the database schema).
> 

Great!

> 
> > 
> > Also Use a using block around around the FileStream and probably use
> > System.IO.File.OpenRead.
> 
> Done! Yes it is a good idea to force the Dispose call in order to close
> the File as soon as possible.
> 
> > 
> > >         /* FIXME: We should use a cache of all photos MD5? */
> > >         public static bool IsDuplicate (Photo photo) {
> > >             Photo[] photos;
> > >             photos = MainWindow.Toplevel.Database.Photos.Query (null, null);
> > >             if (photos == null) {
> > >                 Console.WriteLine ("No photos in duplicates ...");
> > >                 return false;
> > >             }
> > >             if (photo.MD5Sum == null) {
> > >                 photo.MD5Sum = PhotoMD5 (photo);
> > >             }
> > >             bool  duplicate = false;
> > >             foreach (Photo p in photos) {
> > >                 if (p.MD5Sum == null) {
> > >                     p.MD5Sum = PhotoMD5 (p);
> > >                 }
> > >                 //Console.WriteLine ("Looking: {0},{1} {2},{3}", 
> > >                 //                   p.Name, p.MD5Sum, photo.Name, photo.MD5Sum);
> > >                 if (p.MD5Sum.CompareTo(photo.MD5Sum) == 0 &&
> > >                     p.Id != photo.Id) {
> > >                     //Console.WriteLine ("DUPLICATES");
> > >                     duplicate = true;
> > >                     break;
> > >                 }
> > >             }            
> > >             return duplicate;
> > >         } 
> > > 
> > 
> > Ok, this looks like something we definitely want to avoid and is also
> > part of the reason why I wanted to reduce the two calls to IsDuplicate
> > to one.  First we query for all the photos in the database then we walk
> > over all the items.  The database is potentially large (40,000 photos in
> > Nat's case) and this walks in first in the query then in the in the
> > foreach.  You are storing the hash in the db why don't you just add a
> > query for the hash?
> 
> Yes, good point. So a SQL query to the data in the database trying to
> find a duplicate could avoid us this loop. 
> 
> To implemente the query to the database I have added a new public method
> to PhotoStore in order to do the query to the database. In
> DuplicatesFinder we don't work with the database directly and it is very
> natural to include this method in PhotoStore.
> 
> public bool IsDuplicate (Photo photo) {
> 		bool duplicate = false;
> 
> 		if (photo.MD5Sum == null) {
> 			photo.MD5Sum = FSpot.DuplicatesFinder.PhotoMD5 (photo);
> 		}
> 
> 		SqliteCommand command = new SqliteCommand ();
> 		command.Connection = Connection;
> 		command.CommandText = String.Format ("SELECT id FROM photos WHERE
> md5sum = '{0}'", photo.MD5Sum);
> 		SqliteDataReader reader = command.ExecuteReader ();
> 		if (reader.Read ()) {
> 			duplicate = true;
> 		} 
> 		command.Dispose ();
> 		return duplicate;
> 	}
> 
> 

This looks good, I wonder if there are circumstances where we'd want to
actually get the duplicate photos? 

[snip]

> > This contains another implementation of the PhotoMD5 method, use the one
> > you already have above (once it is fixed).  Also this might be a
> > candidate for using the query as well, at least in the nothing is
> > selected case.
> 
> Yes, this is the first implementation I have done when we aren't
> creating the MD5Sum in import. I think this should be cleaned out if the
> MD5Sum is computed in importing so all this code is removed right now in
> my CVS.
> 

Great, that makes sense.

> > 
> > Now that I'm here at the end I'm not sure what the point of the
> > duplicate tag is.  Is it there for a general purpose or are you planning
> > to just use it in the import dialog.  If it is the later I think we can
> > use a different trick.  If it is the former, what are your plans?  Also
> > I notice the duplicate detection doesn't take versions into account.
> > You probably need to store a sum on each version as well and check them
> > in all the tests (ouch that adds more complications).  At least this
> > would be a good time to add the directory path to the version table
> > which is badly needed for other things.
> 
> Ok, we are right now in an important and historical point about the
> patch :)
> 
> The first time I started working in the patch my target was to add
> duplicate support for already imported photos. So with this in mind, I
> created the "Find Duplicates" entry and all the dialogs that informed
> the user about the MD5 creation for photos. I found that the best way to
> show the user the duplicates she has is to use the actual search system
> in F-Spot: using tags. So it is very natural to find the duplicates
> searching for the tag Duplicate.
> 
> Talking about the duplicates feature in the list it was concluded that
> the really needed feature in the first time is to detect duplicates in
> import so I implemented it in the importing phase. If we compute the MD5
> in importing then all the dialogs showing the MD5 computation later in
> the Find Duplicates menu entry aren't needed and the DuplicatesFinder.cs
> class is more and more light and I feel that we could remove it. I bet
> that the final patch should be:
> 
> - Support for duplicates in importing
> - If importing duplicates, tag the photos
> - Use the duplicate tag to find the duplicates
> 
> I think that sometimes people will import duplicates (for example a
> database created before the duplicate feature was added or a mistake
> importing) and we have to give them a way to clean them after. The tag
> duplicate could be the solution. We can remove the "Find duplicates"
> menu entry maybe. I understand that another tag makes F-Spot more
> complex and the user will need more mental effort to use the tool. If
> you think is better not to us include this support, it is very easy to
> remove the tagging for duplicates this time and include the patch
> without this support at all.
> 

The tag seems a bit confusing too me and extends the meaning of tags in
a way I'm not sure we need.  Let's not tag them for now, and just use
the menu item.  If we come up with a compelling use for the tag we can
always add it later.

> >  
> > 
> > Anyway, the patch is awesome and solves a real problem and would be
> > great to get in,  but there is a bit of refactoring and cleanup that
> > still needs to be done.  I'm really sorry I haven't given it a proper
> > review before now.  I hope this at least clarifies why I didn't think it
> > was quite ready.
> > 
> 
> Sure! It is normal the first time a new coder want to contribute to a
> project that the project maintainer gives some rules to follow. It is
> great that now I can follow them and start working closer for the near
> future.
> 

Yeah, I'm very sorry I've been so bad about this.  I'll try to be more
responsive going forward.  I really do appreciate all the contributions.

> 
> > Thank you so much for all the hard work and time it must have taken to
> > get things this far, I'd love to solve the remaining issues as fast as
> > possible and move on to the other patches you have waiting.
> > 
> 
> Sure, I think we can resolver all the pending issues. The remaining
> issues are:
> 
> 1. We should include the upgrade database scheme code in order the patch
> is included? If any case I will work on that but sooner or later
> depending the answer.
> 

At least to the point of not crashing with old databases.

> 2. The MD5Sum should be stored in memory as an array bit and converted
> when reading as a string from the database and writing as a string to
> the database. Right?

Sounds good.

> 3. IsDuplicate right now is moved to PhotoStore.cs and queries the
> database to find it the to be imported photo is a duplicate. Right?
> 

yes.

> 4. We want duplicate photo detection after importing? If no I will clean
> all the code about it. If yes, should we continue using the tag
> duplicate? If maybe, I will clean the code in order to include the patch
> upstream for importing and we should continue talking about it.
> 

Let's keep duplicate detection after import menu item but not use the
tag for now, unless you have have a strong reason for it.

> 
> Ok, I think that's all folks! Now time to make some decissions and I
> will send an updated patch to the list.
> 

Thanks again Alvaro, I really appreciate all the work.

--Larry






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]