Re: [Evolution] feature=remove_double_mail(); exist(feature)?; banner(' :-)');

On Mon, 2004-05-03 at 06:55, guenther wrote:
If you really are using third party tools like fetchmail, you should not
be afraid of using third party tools for eliminating dupes, in this case
formail... ;)

$ cat mbox | formail -D 8192 .msgid -s >> dedupe.mbox
$ mv -f dedupe.mbox mbox

Anything like that for email stored in Maildirs (specifically, Courier 

Oops, yes, should have mentioned the above is for mbox files only -- as
one could guess by the file names... ;)

No maildir solution OTOH unfortunately, but 'man formail' will be
informative. As formail will store the cache on disk after it is done
precessing the current job, filtering on unique Message-IDs should be
possible with single mail files in maildir format as well.

Maybe someone else already has done this?


You can do this relatively easily with "classify" (somewhere around the
net), or my sequivs (available on request) program.  Both are able to
divide files up into equivalence classes (groups of identical or similar
files).  classify is more flexible, but slower than my sequivs program. 
And it may be tough to google on "classify".  Both are O(N^2)
unfortunately, but I came up with a heuristic that speeds things a lot
anyway.  I keep thinking I should make it O(nlogn) someday, but not
finding a real need for it.

Usage with sequivs would be a bit like:

find ~/Maildir/.folder -type f -print | sequivs | sed -e 's/^[^ ]* //'
-e '/^$/d' > /tmp/.folder.dups
xargs rm < /tmp/.folder.dups

Or you could combine it into one step if you trust my off-the-cuff
scripting too much.  :)  I'd really suggest going over .folder.dups
first to make sure those really are all duplicated files.

If it's a big folder, you could be waiting a while.  However, I did it
recently overnight (probably less than overnight, but I don't know by
how much) with 40k+ messages, so it's not that bad.

If I get enough requests, I'll add sequivs to  equivs is already
there, but usage is slightly different, and won't work on as big
collections of potentially-duplicated files.

Dan Stromberg DCS/NACS/UCI <strombrg dcs nac uci edu>

Attachment: signature.asc
Description: This is a digitally signed message part

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]