On Mon, 2004-05-03 at 06:55, guenther wrote:
If you really are using third party tools like fetchmail, you should not be afraid of using third party tools for eliminating dupes, in this case formail... ;) $ cat mbox | formail -D 8192 .msgid -s >> dedupe.mbox $ mv -f dedupe.mbox mboxAnything like that for email stored in Maildirs (specifically, Courier IMAP)?Oops, yes, should have mentioned the above is for mbox files only -- as one could guess by the file names... ;) No maildir solution OTOH unfortunately, but 'man formail' will be informative. As formail will store the cache on disk after it is done precessing the current job, filtering on unique Message-IDs should be possible with single mail files in maildir format as well. Maybe someone else already has done this? ...guenther
You can do this relatively easily with "classify" (somewhere around the net), or my sequivs (available on request) program. Both are able to divide files up into equivalence classes (groups of identical or similar files). classify is more flexible, but slower than my sequivs program. And it may be tough to google on "classify". Both are O(N^2) unfortunately, but I came up with a heuristic that speeds things a lot anyway. I keep thinking I should make it O(nlogn) someday, but not finding a real need for it. Usage with sequivs would be a bit like: find ~/Maildir/.folder -type f -print | sequivs | sed -e 's/^[^ ]* //' -e '/^$/d' > /tmp/.folder.dups xargs rm < /tmp/.folder.dups Or you could combine it into one step if you trust my off-the-cuff scripting too much. :) I'd really suggest going over .folder.dups first to make sure those really are all duplicated files. If it's a big folder, you could be waiting a while. However, I did it recently overnight (probably less than overnight, but I don't know by how much) with 40k+ messages, so it's not that bad. If I get enough requests, I'll add sequivs to http://dcs.nac.uci.edu/~strombrg/software/index.html. equivs is already there, but usage is slightly different, and won't work on as big collections of potentially-duplicated files. -- Dan Stromberg DCS/NACS/UCI <strombrg dcs nac uci edu>
Attachment:
signature.asc
Description: This is a digitally signed message part