Re: [Evolution] Removing dupes

From: Dan Jones <ddjones riddlemaster org>
To: Evolution Mailing Lists <evolution lists ximian com>
Subject: Re: [Evolution] Removing dupes
Date: Sat, 04 Dec 2004 12:53:18 -0500

On Mon, 2004-11-22 at 16:35 +0100, guenther wrote:

This time, Evolution finished the import without crashing.  The
problem is that both imports worked - I now have two copies of every
message in my archives, which is close to 100MB of messages (now
taking up almost 200MB.)  Is there any way or any tool to go through
and weed out the duplicate messages?


I don't believe there is any simple method for doing this at this point 
in time - something that I've been asking for for ages...


If you really need it, use the 'formail' method I just posted yet again.
There already is a working [1] solution. You can wrap in an easy to use
shell script. And I doubt, the average user really needs such a feature
with the potential to shoot his foot.

[1] based on Message-Id headers, which are *not* guaranteed to be unique

This already has been beaten to death on this list. There are a couple
of threads arguing why it should go in, and a lot of reasons, why it can
not work reliably. Please search the archives first.

This is Open Source. If you really need that feature, go ahead and hack
it.


Rather than start hacking the source of Evolution, which would have had
quite a learning curve, I took a different approach.  I've written a
Perl script to handle it.  It looks for matching Message-ID headers,
then compares the MD5 hashes of each message to ensure that they
actually are dupes.  (I suppose that this isn't absolutely guaranteed
but the chances of two messages having the same ID and the same MD5 hash
and not being a duplicate is infinitesimal.)  You can call it with the
names of one or more mailboxes (you must be either in the directory
holding the mailbox or pass it the full path);

rdbox Inbox

You can also use the -d switch, with the name of a directory.  If you
add the -r switch, it'll recurse subdirectories.  For my setup, using
Evolution on Novell Linux Desktop, I used the following command:

rdbox -r -d /home/mylogin/.evolution/mail/local

and it finds every mailbox and checks for dupes.

The script doesn't touch (other than reading) your actual mailboxes.  It
creates a new mailbox with a .clean extension, which contains all of
your messages without dupes.  You can then rename your originals or move
them to a safe location and rename the new files by stripping off the
clean extension.  I'll probably automate this in the next iteration of
the script.

It uses two modules, Digest::MD5 and Getopt::Long, which are available
at CPAN.

This is essentially alpha software - it works on my system but hasn't
been extensively tested.  As I said, it shouldn't touch your original
mailbox but doesn't come with any guarantees.  Bug reports or problem
requests are welcome to bugsATriddlemaster.org

To create the script, just cut and paste it into a file called rdbox.pl
and set execute permissions on the file.  Hope you find it helpful!

-----------------------------------------------------------------------

#!/usr/bin/perl

use strict;
use warnings;

use Digest::MD5 qw(md5);
use Getopt::Long;

#SUB DECLARATIONS
sub ProcFile($);
sub ProcDirectory($);
sub ProcMessage($);
sub PrintUsage();

#GLOBAL VARIABLES
my (%MessageStore, $FileWrites);

#COMMANDLINE ARGUMENTS
my (@directories, @files);
my $recurse = '';
my $verbose = '';
my $usage = '';
my $global = '';

my $result = GetOptions("directory=s" => \ directories,
                                                "file=s" => \ files,
                                                "recurse" => \$recurse,
                                                "verbose" => \$verbose,
                                                "usage" => \$usage,
                                                "global" => \$global);


my $GoodArg = 0;
if(@files)
{
        $GoodArg = 1;
        for(@files)
        {
                ProcFile($_);
        }
}

if(@directories)
{
        $GoodArg = 1;
        for(@directories)
        {
                ProcDirectory($_);
        }
}

if(@ARGV)
{
        $GoodArg = 1;
        for(@ARGV)
        {
                ProcFile($_);
        }
}

PrintUsage() unless $GoodArg;

sub ProcFile($) 
{
        my $mbox = shift;
        
        print "Processing file $mbox\n";
        open MAILBOX, "<$mbox" or die "Can't open $mbox\n";
        open CLEAN, ">$mbox.clean" or die "Can't open $mbox.clean\n";
        
        %MessageStore = () unless $global;
        
        $FileWrites = 0;
        
        local $/ = "\n\nFrom ";

        my $Counter = 1;
        $_ = <MAILBOX>;
        $_ =~ s/\n\nFrom $//;
        ProcMessage($_);
        #print "Processed Message \#$Counter\n";
        
        while(<MAILBOX>) 
        {
                $Counter++;
                $_ =~ s/\n\nFrom $//;
                ProcMessage("\n\nFrom $_");
                print "Processed Message \#$Counter\n" if $verbose;
        }

        close MAILBOX;
        close CLEAN;
}

sub ProcDirectory($)
{
        my $directory = shift;
        
        chdir $directory or die "Can't change to directory $directory\n";
        
        opendir DIRECTORY, $directory or die "Can't open directory $directory
\n";
        my @DirList = grep !/^\.\.?$/, readdir DIRECTORY;
        for(@DirList)
        {
                if(-d)
                {
                        print "Found directory $_\n" if $verbose;
                        if($recurse) 
                        {
                                ProcDirectory("$directory/$_");
                                chdir $directory;
                        }
                }
                elsif(/(.*)\.ibex\.index$/){
                        print "Found file $1\n" if $verbose;
                        ProcFile($1);
                }
        }
        print "\n";
}

sub ProcMessage($)
{
        my $Message = shift;
        my @MessageParts;
        my $HashValue;
        my $MessageId;
        
        my $InitWS;
        my $WSLength;
        
        $Message =~ /^(\s+)/;
        $InitWS = $1;
        
        $WSLength = 0;
        if($InitWS) {
                $Message =~ s/$InitWS//;
                $WSLength = length $InitWS;
        }
        
        @MessageParts = split /\n\n/, substr($Message, $WSLength), 2;
        unless($MessageParts[1])
        {
                print "Error in message!\n$Message\n\n";
                return;
        }
        
        $HashValue = md5($MessageParts[1]);
        
        $MessageParts[0] =~ /Message-I[dD]: (.*)/;
        $MessageId = $1;
        
        unless($MessageId)
        {
                print STDERR "Can't find Id in this message:\n$MessageParts[0]";
                return;
        }
        
        if(exists $MessageStore{$MessageId}) 
        {
                if($MessageStore{$MessageId} eq $HashValue)
                {
                        print "Found dupe of MessageID $MessageId!\n" if $verbose;
                }
                else
                {
                        print CLEAN $Message;
                        print "False positive of $MessageId" if $verbose;
                }
        }
        else
        {
                $MessageStore{$MessageId} = $HashValue;
                unless ($FileWrites) {
                        $Message =~ s/^\n+//;
                }
                print CLEAN $Message;
                $FileWrites++;
                print "Storing Message number $FileWrites, ID#: $MessageId\n" if
$verbose;
        }
}

sub PrintUsage() {
        
print <<USAGE;
rdbox - utility to remove duplicates from mbox files.
usage: rdbox [options] filename
       rdbox [options] -d directoryname

options:
        -d/-directory   name of directory containing mbox files

        -r/-recurse     recurses subdirectories below
        
        -u/-usage               print this message
        
        -f/-file                name of mbox files[s]
        
        -v/-verbose             print extra messages
        
        -g/-global              check for duplicates across all mailboxes
       
USAGE
}

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]