Re: Ignore contents: compare by file name, date/time & size only

From: Kai <kai willadsen gmail com>
To: Kevin Grover <kevin kevingrover net>
Cc: meld-list gnome org, Martin Spacek <gmane martinspacek mm st>
Subject: Re: Ignore contents: compare by file name, date/time & size only
Date: Sat, 18 Dec 2010 10:29:24 +1000

On 13 December 2010 14:31, Kevin Grover <kevin kevingrover net> wrote:
> On Sat, Dec 11, 2010 at 21:13, Martin Spacek <gmane martinspacek mm st>
> wrote:
>>
>> I often need to compare directories with 10s or 100s of GB of binary files
>> in them (generally between my internal and external drives). Coming from
>> "Beyond Compare", I'm used to directory comparisons relying only on file
>> name, date/time, and file size to decide which files differ, which are
>> missing, etc.
>>
>> Meld looks great, but seems to insist on comparing the contents of each
>> file. This is far too expensive an operation for big files. Am I missing
>> some hidden option that allows you to turn off "compare by contents" in the
>> directory comparison view? Would this be a difficult feature to add?
>>
>> I can't seem to find any mention of this in the mailing list. I did a
>> brief search of bugzilla and came up with nothing. I'm running the latest
>> version from git. All I found were some blogs/articles mentioning this
>> limitation in Meld.
>>
>> Cheers,
>>
>> Martin
>>
>> _______________________________________________
>> meld-list mailing list
>> meld-list gnome org
>> http://mail.gnome.org/mailman/listinfo/meld-list
>
> You could do with with some scripting.  I did this awhile ago while
> experimenting with image manipulation programs: I wanted to see if files had
> been renamed or changed.
>
> I basically, created a hash (md5 or sha1 I can't remember) and saved this to
> a file (sorted by hash).  I did one 'hash list file' for each tree: the
> original and the current.  I then diff'ed those text files.
>
> find PATH -type f | xargs md5sum | sort > hash-original.txt
>
> Do some work
>
> find PATH -type f | xargs md5sum | sort > hash-new.txt
>
> diff hash-original.txt hash-new.txt
>
> If you cache the hashes and only rehash files that have changed (e.g.
> date/time or file size is different than last time, you can significantly
> speed things up).  I've been evolving some Python code to do this hash
> caching, which I'm using for an in-house file cataloging program.   I've
> wanted to clean it up and post it, but have not made myself take the time.

Meld already caches comparison results (i.e., did we decide that this
file was the same as this file) but we don't persist the cache across
sessions. Persisting this could make a big difference, since in most
cases, people complain about how slow these comparisons are precisely
because they do them multiple times. I'll look into what would be
involved.

Caching hashes has some tradeoffs. Storing the hash instead of the
comparison result means that we'd be required to read the whole file
the first time; currently we can early-exit if the files are different
and there are no filters. We'd also need to keep one hash for the raw
file, and another hash for when filters are applied, and the second
hash would need to invalidate if the filter set was changed. On the
other hand, it would make some comparison scenarios much better (i.e.,
comparing Dir1 to Dir2, then Dir2 to Dir3). I don't know whether it
would be worth it or not.

cheers,
Kai

References:
- Ignore contents: compare by file name, date/time & size only
  - From: Martin Spacek
- Re: Ignore contents: compare by file name, date/time & size only
  - From: Kevin Grover

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]